From: redpointai
Vincent Vanhoucke, a distinguished engineer at Waymo, discusses the evolving landscape of autonomous vehicles and robotics with the integration of AI, particularly large language models (LLMs) and visual models (VMS) [00:00:49].
Impact of LLMs and VMS on Autonomous Vehicles
The recent advancements in foundation models, including LLMs and VMS, have influenced Waymo’s technology without necessitating a complete overhaul of existing systems [00:01:40]. This “foundation model revolution” allows for the creation of very large-scale “teacher models” that can incorporate all available driving data, in addition to internet data [00:01:48]. This teacher model is then used to train and distill data into the onboard models of the car [00:02:22]. This approach allows for a different mode of supervision for existing models, and provides the foundation for evolving towards bigger, higher-capacity, and more expressive models [00:02:37].
The primary contribution of LLMs and VMS to autonomous driving is “World Knowledge” – the semantic understanding of the world [00:03:14]. This includes recognizing elements like police cars or emergency vehicles, even in new cities where they might look different [00:03:46]. These models also understand accident scenes, leveraging vast web data to provide semantic context that might not be present in Waymo’s direct driving data [00:04:08]. This extensive pre-training on visual and text data enhances the reasoning capabilities of the autonomous driver [00:04:47].
While AI models are central, there are aspects of self-driving that require explicit, external control, particularly concerning safety and regulatory compliance [00:05:30]. A “checking layer” or “guard rails” ensures that the AI’s proposed driving plan meets strict requirements for safety and general good behavior [00:06:07]. This allows the system to leverage AI’s power for strategy while maintaining verifiable safety standards [00:06:24].
Transition to Waymo and Robotics Parallels
Vanhoucke’s personal experience with Waymo’s product after an accident highlighted its “magical” and universally applicable nature, influencing his decision to join the company [00:07:00].
He identifies the core problem of autonomous driving as similar to robotics: perception, planning, and actuation [00:08:21]. An autonomous car is essentially a robot with sensors (cameras) as input and actuation (steering, acceleration) as output [00:08:37]. The main difference lies in the operational domain [00:09:09]. While robotics research often chases nominal behavior (e.g., getting a robot to perform a task), autonomous driving has reached a commercial product level where the challenge is primarily about scaling [00:09:58].
Current State and Challenges in Autonomous Vehicles
Currently, there are few “big blockers” for autonomous vehicles [00:10:31]. While Waymo currently avoids driving in snow, it’s more due to a lack of current focus rather than an insurmountable technological hurdle [00:10:37]. Most problems like fog or highway driving have been addressed over time [00:10:59]. The major challenges in achieving full autonomy in self-driving cars now revolve around scaling, specifically addressing the “long tail of problems” that arise when driving millions of miles [00:11:11]. Rare human driving experiences become common occurrences for a large fleet of autonomous vehicles, demanding solutions for these exceptional and difficult scenarios [00:11:50].
To solve these long tail problems, Waymo heavily utilizes simulation and synthesizes scenarios that correspond to potential issues, even if they haven’t been observed in the real world [00:12:34]. They also modify existing risky scenarios to make them worse, such as introducing drunk or adversarial drivers, to make the car more reactive and understand worst-case scenarios [00:13:17].
Future Technical Advances and Research
A significant technical advance that could transform the landscape for autonomous driving is the development of reliable, physically realistic World models [00:14:05]. These models would enable simulating the real world with precise physical realism and accurate scene rendering [00:14:15]. Current video prediction models like Sora or Veo are “proto-World models” that can unroll plausible futures [00:14:47]. The key is making these models controllable, physically realistic, rich, and plausible in terms of visuals and behavior [00:15:09]. A “digital twin of the world” for autonomous driving could be a game-changer [00:15:31].
The main factor hindering World model development is injecting causality [00:17:30]. While models can learn correlations for plausible videos, control requires understanding how actions lead to outcomes [00:18:00]. This challenge of injecting causality into machine learning models has been a long-standing struggle [00:18:17].
Waymo’s research strategy involves a trade-off between pushing the state-of-the-art internally and leveraging external research [00:18:52]. For bespoke problems unique to AVs, Waymo steers the conversation, for example, by releasing the Waymo Open Datasets, which are standard for AV research [00:19:26]. As Waymo is at the forefront of AV AI, they often have to build the next thing themselves rather than relying on the broader community [00:20:14].
When entering a new city, Waymo’s models are remarkably robust and portable [00:20:30]. The primary effort is focused on extensive evaluation to ensure the models are robust to local variations (e.g., different emergency vehicle designs) and to convince regulators and the community of their thoroughness [00:21:15]. Logistics and building trust with the local community are also crucial [00:21:36].
Sensor Suites and the Superhuman Bar
Waymo’s autonomous cars use a complementary sensor suite of cameras, lidars, and radars [00:22:30]. The diversity of these sensors provides cross-verification, allowing the system to identify and investigate disagreements between different sensor inputs [00:22:57].
Historically, AV companies have pursued two approaches: starting with Level 2 driving assistance systems with economic constraints on sensor costs, or starting with a highly sensorized approach to solve the hardest problem first [00:23:17]. Waymo chose the latter, prioritizing solving the complex problem with abundant sensor data, which now provides the data to inform cost reduction and simplification efforts [00:24:19].
The sensor story is also about redundancy, which is unlikely to disappear [00:25:41]. While humans can drive with just eyes, the bar for Level 4 (L4) autonomous driving is above human level [00:26:32]. Waymo’s safety reports indicate they are already safer than the average human driver, with fewer collisions and injuries [00:26:46]. This “superhuman” performance is seen as a business requirement for successful L4 driving [00:27:10].
Superhuman Performance
The speaker notes a viral video of a Waymo car avoiding an incident with a falling scooter, highlighting the “superhuman” displays of Waymo driving [00:27:48]. Once society observes such capabilities, it becomes difficult to accept a lower, human-level standard [00:27:58].
Milestones and Future Outlook
A significant historical milestone is the 1995 transcontinental autonomous ride, which achieved over 99% autonomy, leading to premature assumptions that full self-driving was imminent [00:28:37]. It took 30 years to reach commercial deployment, highlighting the difficulty in predicting timelines [00:29:17].
Current milestones for autonomous vehicles include technological validation in cities like Phoenix and San Francisco, and strong user validation, where people “love” the product [00:29:32]. The main focus moving forward is on scaling and expansion into various geographies, such as the initial international experiment of data collection in Tokyo, which involves driving on the left side of the road [00:30:19].
Broader Robotics Space and AI Integration
In the broader robotics space, the challenge remains getting a generalized robot to perform any desired task [00:31:32]. While progress has been rapid, a “convincing generalist system” has not yet emerged [00:31:53]. One key question is the ability to generalize motion and skills, as many robot demos are specialized for single tasks, even if randomized in environment [00:32:33]. A commercially successful robot might be highly optimized for a single use case [00:33:03].
The application of LLMs to robotics has been surprising [00:34:43]. The ability to quickly translate high-level natural language instructions (e.g., “make coffee”) into actionable plans for a robot, leveraging the common-sense knowledge embedded in LLMs, was a breakthrough [00:34:10]. The realization that robot actions could be considered a “different language” or “dialect” of AI, similar to English or Chinese, allowed the leverage of existing large multimodal and multilingual model machinery [00:35:55].
Generalizable vs. Task-Specific Models
The future of robotics models is likely to involve both generalizable “teacher” models and task-optimized versions [00:37:04]. This parallels the instruction tuning paradigm in large language models, where generic capabilities are developed and then quickly adapted for specific tasks through prompting or fine-tuning [00:37:23]. The ideal scenario would be prompting-style adaptation at test time [00:38:19].
Approaches to building powerful generalized robot models vary:
- Hardware-centric: Building the most capable humanoid robot first, then expecting it to accomplish tasks [00:38:58].
- Software-first: Building the intelligence first, then trusting it can be retargeted to new platforms [00:39:15].
Vanhoucke gained confidence in the software-first approach from the RTX project, as it allows for faster progress by optimizing for data collection speed [00:39:32]. Operationalizing expensive, wobbly robots for data acquisition is a significant hurdle [00:40:07].
Simulation vs. Real-World Data
The debate between using purely simulated data versus teleoperated real-world data in robotics is ongoing [00:40:58]. While simulation has been effective for locomotion and navigation where the “sim2real gap” is manageable, it has been challenging for manipulation due to the difficulty in replicating diverse experiences and contact quality [00:41:15]. The cost of setting up diverse, representative, and physically realistic simulation environments for manipulation is very high [00:41:51]. Vanhoucke’s experience suggests that scaling physical operations to collect real-world data is often a faster path for manipulation, avoiding the sim2real gap [00:42:25].
Data Acquisition Flywheels and Human-Robot Interaction
A crucial element for robotics is a flywheel for acquiring data at scale [00:43:33]. Human-robot interaction (HRI) for data acquisition is a rich but under-discussed area [00:44:05]. Strategies include kinesthetic teaching, puppeteering, teleoperation with gloves, and synthesizing behaviors in simulation [00:44:34].
A desired development is “third-party imitation”—the ability for robots to learn from observing videos of people [00:45:00]. This ties back to the World model challenge of inferring causality from observation [00:45:17]. The transfer of visual information from large multimodal models to robots has been a significant accelerator, as robots can now understand concepts like “Taylor Swift” without explicit training [00:45:45]. The remaining challenge is acquiring the right kind of motion data for physical skills [00:46:30].
Injecting causality into models may not require new architectures but rather proper data engineering and inductive biases [00:46:46]. Scaling laws observed in large models for behavior and perception suggest that the same log-linear growth patterns apply to autonomous driving models, albeit with different constants [00:49:42].
Reflections on AI Progress and Future Implications
Changed Minds
Vanhoucke has been particularly fascinated by the evolution of reasoning capabilities in AI, especially the impact of “chain of thought” thinking [00:52:46]. He gives a personal example of using Gemini to quickly solve a physics problem for a science fiction story that he had pondered for a decade [00:53:31]. This highlights the newfound accessibility to vast knowledge and reasoning capabilities, leading to the question of what other unasked questions could now be easily answered [00:54:27].
Broad Applicability of AI
The ability to generate plausible answers for problems that are “hard to generate for but easy to verify” (e.g., coding, math) is broadly applicable [00:55:58]. This “actor-critic” model, seen in reinforcement learning, can turn the hard problem of generation into the easier problem of verification, enabling applications beyond just coding or math, including autonomous driving [00:56:37].
Areas where these models will be effective include multi-step processes requiring credit attribution [00:57:51]. This is likened to “RL done right,” where large models and supervised learning provide a strong bootstrap, with reinforcement learning used for fine-tuning to achieve expert-level reasoning [00:58:48].
Top-of-Mind Questions
- The progression of the “World model” thrust, aiming for controllable video and World generation [00:59:43]. This could lead to purely generative video games [01:00:07].
- Whether current large multimodal model architectures are sufficient for good World models, or if new architectural leaps are needed [01:00:21].
- The potential for models to act as digital twins of anything, effectively turning every computer into a generative model, which will require massive compute investments [01:00:42].
Quick Fire Round
Overhyped/Underhyped
- Humanoid Robotics: Both overhyped and underhyped [01:02:01]. It’s overhyped by large investments not yet justified by current capabilities, risking a “humanoid winter” if patience runs out [01:02:09]. However, it’s also underhyped because the potential impact is massive, and those in robotics “can’t afford not to make them work” [01:02:46].
Predictions
- LLM Model Progress: More progress this year than last year [01:03:11].
- Robotics Models Progress: More progress this year [01:03:17].
- Self-driving car rides exceeding human drivers in the US: Hopes that in the future, human driving will seem “crazy” given the accidents it generates [01:03:54]. The question is whether this happens in his lifetime [01:04:21].
- Go-to thing when new model comes out: Checks LLM leaderboards to see where it stands on metrics, rather than relying on “vibes” [01:04:52]. Focuses on whether the model helps in daily life or business [01:05:30].
- Most Americans having a robot in their house: Already have robots like dishwashers, but mobile manipulators like Rosie will take a long time [01:05:39]. Household robots need to justify their “square footage” and meet extremely high safety standards [01:06:01]. Roomba’s success is attributed to its contained area of operation [01:06:39]. More immediate applications are likely in logistics, industrial settings, near-home spaces (e.g., last-meter delivery), office environments, and hospitals where scale exists and there are established protocols for addressing minor damage [01:07:28].
- Under-talked about implications of AI progress: Its impact on education [01:08:12]. The focus on cheating ignores the potential for AI to be a “magical tool to learn things” through interactive, engaging conversations [01:08:34].
Exciting AI Startups/Research
- Cheese: Specifically, startups designing plant-based cheese using AI techniques [01:10:07]. This involves designing new products that are cheaper, more sustainable, and can have a massive impact due to the scale of animal farming and milk production [01:10:39]. An AI-based blue cheese is already served in top restaurants [01:11:29]. The speaker finds “AI plus something you don’t think about when you think about technology” to be particularly exciting [01:11:54].
To learn more about Vincent Vanhoucke’s thoughts on machine learning, you can visit his blog on Medium [01:12:35].