From: redpointai
The field of autonomous vehicles (AVs) has seen significant advancements, with systems like Waymo operating commercially in certain areas [00:10:13]. However, achieving full, widespread autonomy still presents several complex challenges. Vincent Vu, a distinguished engineer at Waymo and former head of Google’s robotics team [00:00:39], highlights key technical and operational hurdles.
The Role and Limitations of AI Models
The recent “foundation model revolution,” involving large language models (LLMs) and visual-multimodal models (VMMs), has significantly impacted autonomous vehicle technology [00:01:48].
Benefits of LLMs/VMMs
These models enhance AV capabilities by providing:
- World Knowledge: A semantic understanding of the world, including recognizing various police cars or emergency vehicles even in new cities where specific data hasn’t been collected [00:03:22], [00:04:49]. They can interpret complex scenes like accident sites, understanding the semantic context [00:04:08].
- Enhanced Reasoning: Pre-training on vast amounts of visual and text data improves their reasoning capabilities [00:04:54].
- Scalability: Larger models generally lead to better performance [00:05:00]. These models act as “teacher models” to distill data into onboard car models [00:02:22].
Areas Where AI Models Fall Short
While powerful, AI models are not ideal for all aspects of self-driving:
- Safety and Regulatory Constraints: Aspects requiring strict contracts on safety or regulatory compliance are best expressed in an explicit, verifiable way, rather than implicitly within an AI model [00:05:37], [00:05:45]. This allows for verification that the proposed driving plan meets safety and compliance requirements [00:05:50].
- Verification Layer: A “checking layer” or “guard rails” are needed around the output of reasoning models to ensure safe and well-behaved actions [00:06:07], [00:06:36].
Key Challenges for Full Autonomy
The primary hurdles for widespread autonomous driving relate to scaling, the long tail of rare problems, and the development of advanced world models.
Scaling and the “Long Tail” of Problems
Waymo’s experience, driving millions of miles, reveals that rare, exceptional events become common occurrences [00:11:24].
- Rare Scenarios: While an average human driver might experience a specific difficult situation once in a lifetime, an AV fleet encounters it every week or month [00:11:50].
- Overcoming Edge Cases: Problems like driving in snow or fog, or on highways, are primarily scaling challenges rather than fundamental blockers [00:10:31]. The focus is on solving for this “long tail” of problems [00:12:11].
Addressing Long-Tail Problems: Simulation and Synthetic Data
Since there isn’t a massive amount of real-world data for these rare scenarios, Waymo uses:
- Simulation: Extensive use of simulation to test models [00:12:36].
- Synthesizing Scenarios: Generating scenarios that could happen but haven’t been observed [00:12:39].
- Modifying Real Scenarios: Taking real-world risky situations (where nothing bad happened) and modifying them to be worse (e.g., making other drivers “drunk” or “adversarial”) to stress-test the system and improve reactivity [00:13:13].
The Need for Physically Realistic World Models
A significant technical advance that could “completely change the landscape” is the development of reliable, physically realistic world models [00:14:05].
- Goal: To simulate the real world with physical realism and accurate rendering, creating a “digital twin” of the environment [00:14:15], [00:15:31].
- Current State: Proto-world models like Sora or Vo can predict future video frames that seem physically reasonable [00:14:45]. However, they are not yet controllable or useful enough for precise functional uses [00:16:37], [00:17:18].
- Challenge of Causality: A core issue is teaching models to understand causality – how an input change leads to a specific output [00:17:30], [00:18:00]. While LLMs show signs of causal reasoning through “chain of thought,” it’s not yet clear if new architectures are needed or just better data engineering [00:46:46], [00:47:05].
Sensor Suites and the “Superhuman” Bar
Waymo utilizes a rich sensor suite (cameras, lidars, radars) due to their complementary strengths and weaknesses, enabling cross-validation of data [00:22:30].
- Business Strategy: Waymo chose to “over-sensorize” initially to solve the hard problem first, allowing them to gather data and inform future cost reduction decisions [00:24:24]. This contrasts with L2 driving systems that prioritize lower cost [00:23:31].
- Beyond Human Level: The bar for L4 autonomous driving is considered “above human level” [00:26:34]. Waymo’s safety reports indicate they are already safer than the average human driver, with fewer collisions and injuries [00:26:46]. This “superhuman” performance is seen as a business requirement for successful L4 driving [00:27:10]. The question remains whether this can be achieved with a simpler sensor suite [00:27:34]. Redundancy in sensing is likely to remain important [00:25:47].
Human Trust and Public Acceptance
Beyond technology, integrating AVs into society requires building trust with the local community and addressing logistical challenges of entering new cities [00:21:19].
Building trust is "a lot more than just technology" [00:22:15].
The Broader Robotics Context
Autonomous cars are fundamentally robots [00:08:36], and lessons from general robotics apply.
Current State of General Robotics
Unlike autonomous driving, which has a “nominal system that works,” the general robotics space is still “chasing the nominal use case” – getting a generalized robot to do anything desired [00:09:58], [00:31:35].
- Lack of Generalization in Motion: While robots can generalize based on visual inputs, they struggle with generalizing motion or “skills” [00:32:30], [00:32:54].
- Path to Commercial Success: Commercial success might come from highly optimized, task-specific robots, but the broader vision of general-purpose AI robots (e.g., making coffee, tidying rooms) still requires breakthroughs [00:33:03].
Impact of LLMs on Robotics
The application of LLMs and VMMs to robotics has been a significant surprise, particularly in overcoming the “common sense knowledge” bottleneck [00:33:53], [00:34:31].
- Common Sense: LLMs provide robots with everyday knowledge (e.g., a cup goes on the table, not the floor) that was previously difficult to inject [00:34:52].
- Action as Language: The realization that robot actions can be viewed as a different “language” or “dialect” allows leveraging multimodal and multilingual models for robotic control [00:35:55]. This means the same “machinery” for language processing can apply to robot actions [00:36:26].
Approaches to Building Robot Models
Two main approaches exist for developing robot models:
- Hardware-Centric: Building the most capable humanoid robot first, then programming tasks [00:38:58]. This can be very expensive and hard to operationalize for data acquisition [00:40:07].
- Software-First: Building the intelligence first, with the trust that it can be retargeted to different hardware platforms [00:39:15]. This path, exemplified by work like RTX, focuses on data acquisition speed and scalability [00:39:33].
Simulation vs. Real-World Data for Robotics
- Locomotion and Navigation: Simulation has been highly effective in these contexts due to a manageable “sim-to-real gap” [00:41:17].
- Manipulation: It has been difficult to get sufficient diversity and quality of experience from simulation for manipulation tasks, especially regarding contact physics [00:41:31]. The effort required to make simulation realistic for manipulation is very high [00:42:10].
- Scaling Real-World Data: For manipulation, scaling physical operations to collect large amounts of real-world data has been a faster path [00:42:27].
Data Acquisition in Robotics
A crucial bottleneck is acquiring data at scale, particularly motion data for physical skills [00:39:55], [00:46:30].
- Methods: Strategies include kinesthetic teaching, puppeteering, or teleoperation [00:44:36].
- Third-Party Imitation: A promising but unsolved area is learning from passively watching videos of people doing tasks, which again links back to the challenge of inferring causality from observation [00:45:05].
- Visual Information Transfer: Multimodal models have significantly accelerated data acquisition by transferring visual knowledge (e.g., knowing who Taylor Swift is without specific training data) [00:45:45].
Future Outlook
The trajectory of autonomous driving and robotics will be determined by how these challenges in AI-driven research and novel planning are addressed. Progress in LLMs and robotics models is expected to accelerate [01:03:11].
The biggest milestone for self-driving cars now is expansion across geographies, demonstrating robustness in diverse environments [01:00:16]. Waymo is exploring international deployments, like driving on the left side of the road in Tokyo [01:00:28].
The long-term vision is a future where humans look back and consider driving cars by hand “crazy” due to the level of accidents [01:03:54]. This requires continuous innovation in integration of AI in autonomous vehicles and broader robotic systems.