From: redpointai

Autonomous vehicle technology has seen significant advancements, with companies like Waymo leading the charge. Vincent Vanhoucke, a distinguished engineer at Waymo and former lead of Google’s robotics team, provides insights into the current state and future trajectory of this field [00:00:30] [00:00:40].

Impact of AI Models on Autonomous Driving

The recent foundation model revolution, including large language models (LLMs) and visual models (VMS), has profoundly influenced autonomous vehicle development [00:01:48]. These models allow for the creation of “teacher models” that process vast amounts of data, including internet data, to build comprehensive models of the Waymo driver, car behavior, and the environment [00:01:51] [00:02:00]. This approach enriches existing systems without requiring a complete overhaul, essentially providing every model with a better teacher and more information [00:02:22] [00:02:55].

A key contribution of LLMs and VMS is “World Knowledge,” which encompasses semantic understanding of the surrounding environment [00:03:14] [00:03:22]. For example, these models can recognize what a police car or an accident scene looks like, even if Waymo’s internal driving data hasn’t frequently encountered such scenarios in a new city [00:03:33] [00:04:06]. This external knowledge from the web enhances the driver’s capabilities [00:04:33].

However, AI models are not exclusively relied upon for all aspects of self-driving [00:05:11]. Aspects related to strict safety contracts and regulatory constraints are expressed explicitly, outside the AI model [00:05:30] [00:05:41]. This “checking layer” or “guard rails” ensures that the AI-proposed driving plan meets all safety and compliance requirements [00:06:01] [00:06:36].

Waymo’s Journey and Challenges

Vincent Vanhoucke’s transition to Waymo was partly accidental, driven by his personal experience using the service during an injury [00:07:00] [00:07:02]. He found the product “magical” and an AI system that easily touched everyone [00:07:27] [00:07:46].

At its core, an autonomous car is a robot, sharing the same inputs (sensors, cameras) and outputs (actuation like steering and acceleration) as a manipulation robot [00:08:31] [00:08:53]. The major difference lies in the operational domain [00:09:06]. While general robotics still focuses on achieving nominal behavior (e.g., picking up an object, making coffee) [00:09:12] [00:09:53], autonomous driving has a working nominal system at a commercial product level of safety and performance [00:09:58] [00:10:02].

The primary challenges and bottlenecks in selfdriving cars for Waymo today revolve around scaling [00:10:19]. While issues like driving in snow are solvable with more attention [00:10:37], the “long tail” of rare, exceptional, and difficult problems dominates the equation [00:11:30] [00:12:01]. What a human driver might encounter once in a lifetime, Waymo cars experience weekly or monthly [00:11:46]. Waymo addresses these long-tail problems through extensive simulation and by synthesizing scenarios, often modifying real-world events to make them worse (e.g., “make all the drivers drunk drivers” or “actively adversarial”) [00:12:35] [00:13:17].

The Future: Reliable World Models

A critical technical advance that could fundamentally change the autonomous driving landscape is the development of reliable, physically realistic world models [00:14:05] [00:14:11]. These models would enable the simulation of the real world with high physical realism and accurate scene rendering [00:14:18] [00:14:26]. Current video prediction models like Sora or Vio are proto-world models that can unroll plausible futures, but lack the physical realism and controllability needed for autonomous driving [00:14:45] [00:15:08]. The challenge is to make these models controllable, physically realistic, rich in detail, and plausible in visual appearance and object behavior [00:15:09] [00:15:13].

The main obstacle to building effective world models is the deep question of causality [00:17:30]. While current models can generate plausible videos by learning correlations, achieving controllability requires understanding causality – how an input change leads to a specific output [00:18:00] [00:18:02].

Waymo’s Research and Sensor Strategy

Waymo actively conducts significant research, often leading the state-of-the-art in AI, acknowledging that they cannot always rely on the broader community for solutions to bespoke AV problems [00:19:59] [00:20:11]. Their Waymo Open Datasets, for instance, were designed to steer academic research toward relevant AV challenges [00:19:37].

Waymo’s models demonstrate remarkable robustness and portability across different cities, with adaptations mostly focusing on thorough evaluation and ensuring no critical differences are missed (e.g., variations in emergency vehicle appearance) [00:20:30] [00:20:56].

Waymo employs a comprehensive sensor suite, including cameras, lidars, and radars [00:22:30] [00:22:32]. These sensors are highly complementary, with their strengths and weaknesses balancing each other out, providing crucial redundancy and orthogonal data for verification [00:22:40] [00:22:50].

Unlike companies that start with L2 driving assistance and attempt to scale up to L4 (fully autonomous) with economic constraints on sensors, Waymo historically chose to “over-sensorize” from the start [00:23:17] [00:24:20] [00:24:29]. This strategy allowed them to solve the harder problem first, gathering essential data to inform future cost reduction and simplification efforts [00:24:32] [00:24:51]. The argument that humans drive with only eyes and don’t need lidar is countered by the conviction that the bar for L4 driving is above human level [00:26:15] [00:26:32] [00:26:37]. Waymo’s safety reports indicate they are already safer than the average human driver, with fewer collisions and injuries [00:26:46]. This “superhuman” performance is seen as a business requirement for successful L4 driving [00:27:10] [00:27:14].

Milestones and Future Trajectory

The history of autonomous vehicles highlights the difficulty of predicting timelines. The first transcontinental autonomous drive in 1995 achieved over 99% autonomy, leading to initial over-optimism [00:28:40] [00:28:49]. It took 30 years to reach commercial deployment [00:29:17].

Currently, Waymo has achieved technology validation in cities like Phoenix and San Francisco, along with strong user validation [00:29:32] [00:29:41]. The next milestones will focus on expansion into new geographies [00:30:14]. Waymo has begun collecting data in Tokyo, marking their first international experiment and initial foray into left-side driving [00:30:31].

Trust is more than just technology.

Waymo emphasizes building trust with local communities through respectful operations and supporting the areas they work in, recognizing that public trust is crucial for deployment [00:21:59] [00:22:04].

Generalization in Robotics

A significant question in robotics is the ability to generalize motion and actions, similar to how models generalize visual inputs [00:48:06]. While current robot demonstrations often perform a single task, the goal is for robots to generalize diverse skills [00:32:54] [00:32:57].

The application of LLMs to robotics has been a pleasant surprise [00:34:04]. The key breakthrough is the ability of LLMs to provide “Common Sense knowledge” (e.g., a cup goes on a table, a microwave is in the kitchen) that was historically difficult to inject into robotics [00:34:35] [00:34:42]. This high-level, albeit fuzzy, language-based knowledge can be quickly translated into actionable robot plans [00:35:17]. Furthermore, recognizing robot actions as another form of language or “dialect” allows leveraging multimodal and multilingual large models for robot control [00:35:55] [00:36:10].

Approaches to Robotics Development

There are two main approaches to developing robots:

  1. Hardware-Centric: Building the most capable humanoid robot first, then developing the software [00:38:58]. This path can be very expensive and difficult to scale due to the complexity of operationalizing wobbly, costly robots for data acquisition [00:40:07] [00:40:13].
  2. Software-First: Building intelligence and trusting that it can be easily retargeted to new platforms [00:39:15]. This approach, exemplified by the RTX project, allows faster progress by optimizing for data collection and execution speed, especially since the fundamental problem of robotic manipulation is not yet solved [00:39:37] [00:40:41].

While simulation is valuable for locomotion and navigation due to a smaller sim-to-real gap, it has been less successful for manipulation [00:41:12] [00:41:19]. The cost of setting up diverse, realistic, and tuned physics environments for manipulation simulation is very high [00:42:02]. Scaling physical operations to collect large amounts of real-world data has proven to be a faster path [00:42:27].

Data Acquisition and Human-Robot Interaction

A critical bottleneck in robot learning is data acquisition [00:44:10]. Effective methods include kinesthetic teaching, puppeteering, and teleoperation with gloves [00:44:36]. A desired but currently unsolved method is “third-party imitation,” where robots learn by observing videos of humans, which again relies on inferring causality from observation [00:45:04] [00:45:23].

The advent of large multimodal models has accelerated data by enabling visual information transfer to robots [00:45:45]. For instance, a robot can move a Coke can towards a picture of Taylor Swift without explicit teaching about Taylor Swift, as that knowledge is embedded in the multimodal model [00:46:02]. The remaining challenge is acquiring motion data for physical skills [00:46:39].

AI in Consumer Devices and Robotics

While robots exist in homes (e.g., dishwashers, washing machines), consumer interest in mobile manipulators like the “Rosie robot” is still some way off [00:55:50]. The bar for a robot in a home is extremely high due to the need for safety and preventing damage [00:56:01] [00:56:11]. Roomba-like robots have succeeded because they operate in low-risk areas [00:56:41]. More immediate applications for mobile robots are expected in logistics, industrial, near-home spaces (e.g., last-meter delivery), office, and hospital environments where there is scale and infrastructure to manage potential incidents [01:07:31].

Under-talked Implications of AI

A significant, yet under-discussed, implication of AI progress is its transformative effect on education [01:08:12]. AI offers an interactive and engaging tool for learning that goes beyond concerns about cheating [01:08:34].

Overhyped and Underhyped in AI

Vincent Vanhoucke views humanoid robotics as both overhyped and underhyped [01:02:01]. While there are significant investments and potential for a “humanoid winter” if success isn’t achieved, it is also underhyped in that those in robotics should be working on it because “we can’t afford not to make them work” [01:02:30] [01:02:46] [01:02:48].

Progress in both LLM and robotics models is expected to accelerate [01:03:11] [01:03:16].

An exciting area of AI startups and research is the application of AI techniques to design new products, such as plant-based cheese [01:10:04] [01:10:40]. This represents using AI to explore design space for non-animal based products with massive impact potential [01:10:53] [01:11:05].

Key Questions for the Future

Key questions for the next few years include:

  • Can motion be generalized in the space of actions, similar to perception? [00:48:06]
  • Are there fundamental differences between robotics and other areas of AI that will require new techniques? [00:48:35] (Currently, many techniques like diffusion models from video generation are working well in robotics [00:49:06].)
  • Will scaling laws continue to apply to large models in autonomous driving, or will they hit limits? [00:49:46] (So far, similar log-linear growth patterns are observed as in LLMs, albeit with different constants [00:49:56].)
  • How far can the “world model” thrust take us, especially with controllable video and world generation? [00:59:43] If current large multimodal architectures cannot produce good world models, a new architectural leap may be necessary [01:00:21].
  • What are the computational demands for creating “digital twins” of anything, and will massive investments in compute be justified? [01:00:49] [01:01:08].

Vincent Vanhoucke's Blog

Vincent Vanhoucke shares his insights and “random thoughts about machine learning” on his Medium blog [01:12:32].