From: redpointai
Vincent Van, a distinguished engineer at Waymo and former founder and leader of Google’s robotics team, provides insights into the current state and future potential of AI in robotics and autonomous vehicles. His expertise spans both these fields, offering a unique perspective on their integration and impact [00:00:40].
AI in Autonomous Vehicles: Waymo’s Approach
Waymo, a pioneer in self-driving technology, has been working on autonomous vehicles long before the recent explosion of advances in Large Language Models (LLMs) and diffusion models [00:01:30].
Integration of LLMs and VLM
The advent of LLMs and Vision-Language Models (VLMs) has not rendered Waymo’s existing technology obsolete; rather, these models act as an enhancement [00:01:41]. They facilitate the creation of “teacher models” – large-scale models that can ingest vast amounts of data, including internet data, to build a comprehensive understanding of the Waymo driver, car behavior, and the environment [00:01:48]. This “World Knowledge” provides semantic understanding, allowing the autonomous system to recognize elements like police cars or accident scenes, even in new cities where specific local data might be scarce [00:03:22]. LLMs also enhance the reasoning capabilities of these systems [00:04:54].
Limitations and Safety
While AI models are powerful for generating driving plans, aspects like strict safety contracts and regulatory compliance are still managed externally, outside the AI model [00:05:30]. This allows for explicit verification that the AI-proposed plan meets all safety and compliance requirements [00:05:43].
Current Challenges and Future Breakthroughs
A primary challenge for autonomous vehicles today is scaling to handle “long-tail problems” – rare, exceptional, or difficult scenarios that a human driver might encounter once in a lifetime but an autonomous fleet experiences regularly due to millions of miles driven [00:11:11]. Waymo addresses this through extensive simulation and synthesizing scenarios to validate their models against potential issues not yet observed in real-world data [00:12:34].
A significant technical advance that could revolutionize autonomous driving is the development of reliable, physically realistic world models [00:14:05]. These models would allow for highly accurate and controllable simulations of the real world, serving as a “digital twin” for autonomous driving [00:15:27]. While video prediction models like Sora or Vo are “proto world models,” the challenge lies in making them controllable and ensuring physical realism, especially for long-tail problems where current models are not yet proficient [00:14:47]. A key hurdle to achieving this is injecting causality into models, moving beyond mere correlation [00:17:30].
Sensor Suite
Waymo employs a comprehensive sensor suite including cameras, Lidars, and Radars [00:22:30]. These sensors are complementary, with their individual strengths and weaknesses offsetting each other, providing critical redundancy and allowing for cross-validation of data [00:22:40]. This approach contrasts with others that prioritize simpler, cheaper sensor setups for L2 driving, aiming to scale up later [00:24:07]. Waymo’s strategy involved initially “over-sensorizing” to solve the harder problem first, then using the collected data to inform cost reduction [00:24:29].
The performance bar for L4 (fully autonomous) driving is considered to be above human level [00:26:37]. Waymo’s safety reports indicate they are already safer than the average human driver, with fewer collisions and reported injuries [00:26:46]. This “superhuman” capability is seen as a business requirement for successful L4 driving [00:27:14].
Milestones and Expansion
The next major milestones for autonomous vehicles will be centered around expansion into new geographies [00:30:14]. Waymo is, for example, beginning data collection in Tokyo, marking their first international expansion and a foray into left-side driving [00:30:28]. The robustness and portability of their models across different cities are proving remarkable, with much of the effort in new cities focused on evaluation, regulatory compliance, and community engagement to build trust [00:30:30].
Broader Robotics Space
The robotics field is still “chasing the nominal use case” – striving to achieve a generalized robot capable of performing diverse tasks [00:31:35]. Unlike autonomous driving, which has reached commercial deployment after decades, general-purpose robotics has not yet had its “1995 ride” moment (referencing the first transcontinental autonomous drive) [00:31:53].
Impact of Large Models
The application of LLMs and VLMs to robotics has been surprisingly effective [00:33:43]. A key breakthrough has been the ability to quickly translate high-level, common-sense knowledge (e.g., how to make coffee, objects on a table) from chat models into actionable plans for robots [00:34:04]. This is partly because robot actions can be viewed as another “language,” allowing the leverage of multimodal and multilingual large models for generalization [00:35:55].
Generalization and Data Acquisition
A significant challenge in robotics is generalizing motion and skills, as many robot demos are highly specific to a single task [00:32:33]. There are two main approaches to building generalized robotic models:
- Hardware-Centric: Building the most capable humanoid robot first, then developing the software [00:38:58].
- Software-First: Building general intelligence and trusting it can be retargeted to new platforms [00:39:15].
The software-first path, exemplified by efforts like RTX, has shown faster progress due to its focus on efficient data acquisition [00:39:37]. The bottleneck in robotics remains acquiring high-quality data at scale [00:39:55].
While simulation is valuable for locomotion and navigation, the “sim-to-real gap” has made it less effective for complex manipulation tasks due to the difficulty of accurately tuning physics and achieving diverse contact experiences [00:41:15]. Consequently, scaling physical operations to collect real-world data has been a more effective approach for manipulation [00:42:27].
New data acquisition strategies are crucial, including kinesthetic teaching, puppeteering, and teleoperation [00:44:34]. A desired future capability is “third-party imitation,” where robots can learn from watching videos of humans [00:45:05]. Multimodal models have already accelerated visual information transfer to robots (e.g., recognizing Taylor Swift without specific training) [00:45:48], shifting the focus to acquiring motion and physical skills data [00:46:39].
Unanswered Questions in Robotics
Key questions for the future of robotics include:
- Can motion be generalized across different actions and environments, similar to how perception generalizes [00:48:03]?
- Are there fundamental differences between robotics and other areas of AI that will require new techniques, or will existing large model architectures suffice [00:48:35]? For example, diffusion models, effective for video generation, also work well for motion generation in robotics [00:49:10].
- Will scaling laws, similar to those observed in LLMs, apply to robotics models, even if constants differ [00:49:42]?
Broader AI Reflections and Impact
Reasoning Capabilities
The evolution of reasoning capabilities in LLMs, particularly through “chain-of-thought” prompting, has been a significant development [00:52:46]. This allows users to access expert-level knowledge (e.g., complex physics or legal information) instantly, fundamentally changing how information is accessed and utilized [00:54:27].
Applications of AI
AI models are highly effective in domains where problems are hard to generate but easy to verify, such as coding or math [00:56:02]. This “actor-critic” model, where a generative model proposes a solution and another system verifies it, is broadly applicable [00:56:41]. In autonomous driving, it’s easier to verify a plan against hard constraints than to generate it from scratch [00:57:24].
Reasoning capabilities of AI will impact multi-step problems requiring credit attribution [00:57:51]. The preferred paradigm for AI development now involves bootstrapping with a large model, extensive supervised learning, and then using reinforcement learning (RL) for fine-tuning to achieve expert performance [00:59:04].
Future Directions
Key questions for the next 12-24 months in the broader LLM space include:
- The progress of “world models” and controllable video/world generation, potentially leading to purely generative video games [00:59:43].
- Whether current large multimodal architectures can effectively create good world models, or if new architectural leaps are needed [01:00:17].
- The significant compute investments required for these next steps [01:01:13].
Overhyped vs. Underhyped
Humanoid robotics is viewed as both overhyped and underhyped [01:02:01]. While there’s significant investment, success is not guaranteed, risking a “humanoid winter” if patience runs out [01:02:30]. However, it’s also underhyped because the field needs to succeed for the broader progress of robotics [01:02:46]. Progress in LLM and robotics models is expected to accelerate even further [01:03:16].
Societal Impact and Policy Implications
There’s a potential future where humans look back at today’s reliance on human-driven cars as “crazy” due to the high accident rates [01:04:04]. The expectation is that autonomous vehicles will surpass human safety levels [01:04:09].
While many homes already have “robots” (e.g., dishwashers, washing machines) [01:05:40], general-purpose mobile manipulators (like Rosie from The Jetsons) will take much longer to become commonplace [01:05:53]. The bar for household robots is extremely high regarding safety and preventing damage to the home environment [01:06:11]. More immediate applications for mobile robots are expected in structured environments like logistics, industrial settings, last-mile delivery, offices, and hospitals, where costs of damage can be more easily managed [01:07:28].
One under-talked implication of AI progress is its transformative effect on education [01:08:15]. AI offers a “magical tool” for interactive and engaging learning, moving beyond concerns about cheating to unlock new pedagogical approaches [01:08:40].
Another exciting, under-discussed area is the use of AI techniques to design new products, particularly in industries not traditionally associated with technology. An example is using AI to design plant-based cheese, exploring the “design space” for non-animal-based products that could have a massive environmental impact [01:10:40]. This highlights the potential of “AI plus something you don’t think about when you think about technology” [01:11:54].