From: redpointai
The integration of large language models (LLMs) and vision models (VLM) is profoundly transforming the landscape of autonomous vehicles and robotics. These advanced models are not merely incremental improvements but are fundamentally reshaping how self-driving cars perceive and interact with the world, and how robots learn and generalize tasks [00:00:49].
LLMs in Self-Driving: The Waymo Experience
Waymo, a pioneer in self-driving technology, has leveraged LLMs and VLM advancements without discarding their existing frameworks [01:40]. The foundation model revolution allows Waymo to build “teacher models” – large-scale models run in the cloud that ingest vast amounts of driving data and internet data [01:48]. This “teacher” then trains and distills data into the onboard models of the autonomous car, providing a more informed mode of supervision [02:22].
Key Manifestations and Benefits
- World Knowledge: LLMs and VLMs introduce “World Knowledge,” which is the semantic understanding of the environment [03:14]. This allows the autonomous driver to recognize unfamiliar objects like different regional police cars or accident scenes, even if the car hasn’t experienced them directly in its training data [03:46]. This knowledge is derived from pre-training on extensive visual and text data from the web [04:20].
- Enhanced Reasoning: The pre-training on diverse data significantly enhances the models’ reasoning capabilities [04:51]. Larger models are generally considered better, scaling up the ability to process and understand complex scenarios [05:00].
Limitations and Safety
While LLMs offer significant advantages, they are not a complete solution for all aspects of self-driving [05:11].
- Safety and Regulatory Constraints: Aspects related to strict safety contracts and regulatory constraints must be expressed explicitly, outside of the AI model [05:37]. This external layer verifies that the AI-proposed driving plan meets all safety and compliance requirements, ensuring reasonable behavior at all times [05:59]. This “checking layer” or “guard rails” around the reasoning models’ output is crucial [06:35].
Sensor Suite Strategy
Waymo uses a complementary suite of cameras, lidars, and radars [02:30:57]. Each sensor type has unique strengths and weaknesses that complement each other, providing diversity for cross-validation [02:40:00].
- Waymo’s historical decision to “over-sensorize” was based on solving the harder L4 (fully autonomous) problem first [02:41:50]. This approach generates the necessary data to inform decisions on cost reduction and simplification for future generations of cars [02:48:07].
- Redundancy: The need for redundancy in sensors is unlikely to disappear, as different sensors provide diverse and complementary information for safety [02:55:00].
- Superhuman Performance: The bar for L4 driving is considered “above human level” [03:00:27]. Waymo’s safety reports indicate they are safer than the average human driver, with fewer collisions and reported injuries [03:05:02].
Scaling and Milestones
The main challenges for self-driving today revolve around scaling and addressing the “long tail” of rare, exceptional, and difficult problems [01:11:16]. As autonomous vehicles drive millions of miles, events that might occur once in a human’s lifetime become common occurrences [01:18:00].
- Simulation: Waymo heavily utilizes simulation and synthetic scenarios to validate models against potential problems that may not have been observed in the real world [01:34:00]. They also modify real-world scenarios to make them more challenging (e.g., adversarial drivers) to improve the car’s reactivity [01:38:00].
- Future Milestones: The next major milestones in autonomous vehicles will be centered on geographic expansion, such as Waymo’s first international experiment driving on the left side of the road in Tokyo [03:00:00].
Advancements in Simulation and World Modeling
A significant technical advance that could change the landscape of autonomous driving is the development of reliable, physically realistic world models [01:14:15].
- Video Prediction Models: Early “proto-world models” include video prediction models like Sora or Veo, which can unroll an image or scene into a plausible future [01:46:00].
- Controllable and Physically Realistic Models: The challenge is to make these models controllable and physically realistic, creating a “digital twin” of the world for autonomous driving [01:50:00].
- Causality: A deep question at the heart of these world models is causality [01:30:00]. While current models can generate plausible videos by learning correlations, understanding and injecting causality (how an input changes an output) is crucial for functional, controllable models [01:57:00]. This may come from proper data engineering and inductive biases rather than new architectures [01:47:00].
LLMs and General Robotics
An autonomous car is fundamentally a robot, sharing core problems of perception, planning, and actuation [08:31:00]. However, general robotics is still “chasing the nominal use case” – how to get a generalized robot to perform any desired task [03:13:00].
Impact of LLMs on Robotics
The surprising breakthrough with LLMs in robotics is the rapid transition from a chatbot describing how to make coffee to an actionable plan for a robot [03:34:00].
- Common Sense Knowledge: LLMs provide robots with common sense knowledge that was previously difficult to embed, such as knowing a cup belongs on a table, not the floor, or that a microwave is in the kitchen [03:42:00]. This everyday knowledge, previously inaccessible, was brought together by LLMs [03:50:00].
- Action as Language: The realization that robot actions can be viewed as a “different language” (body actions instead of words) allows robotics to leverage the machinery of multimodal and multilingual large language models [03:55:00].
Generalization and Specialization
The future of robotics models will likely involve both generalized and specialized approaches [03:58:00].
- Generalized Teacher Models: The goal is to build a generalized “teacher” or backbone model that can be easily retargeted and optimized for single tasks [03:09:00]. This is analogous to instruction tuning in LLMs, where generic capabilities are developed and then adapted via prompting or fine-tuning for specific tasks [03:22:00].
Hardware vs. Software Approaches
Two main approaches exist for building generalized robot models:
- Hardware-Centric: Building the most capable humanoid robot first, then expecting it to accomplish tasks [03:58:00].
- Software-First: Building general intelligence and trusting it can be retargeted to new platforms relatively easily [03:15:00].
- The “software-first” path, as demonstrated by work like RTX, is favored for faster progress due to the current bottleneck of data acquisition [03:37:00]. Relying on expensive, wobbly robots for data collection is a significant challenge to scalability [04:05:00].
Simulation vs. Real-World Data
The debate between using simulation and teleoperated real-world data remains [04:00:00].
- Locomotion and Navigation: Simulation has been effective here, with the sim-to-real gap being manageable [04:02:00].
- Manipulation: Simulation struggles for manipulation due to the difficulty in replicating the diversity of experience, contact quality, and realistic physics [04:08:00]. Scaling up physical operations to collect real-world data has proven to be a faster path for manipulation [04:19:00].
Data Acquisition Bottlenecks and Solutions
- Human-Robot Interaction (HRI): The HRI field needs to focus more on data acquisition strategies [04:05:00]. Different methods include kinesthetic teaching, puppeteering, and teleoperation [04:34:00].
- Third-Party Imitation: Learning by observing videos of people performing tasks (third-party imitation) is a promising but currently unsolved area, requiring the ability to infer causality from observation [04:05:00].
- Multimodal Models: The transfer of visual information from large multimodal models to robots significantly accelerates data acquisition [04:45:00]. For example, a robot can understand “Taylor Swift” without explicit training data, because that knowledge is embedded in the larger model [04:06:00].
Unanswered Questions and Future Directions
- Generalizing Motion: A key question is whether robots can generalize motion and actions in the same way they generalize perception [04:54:00].
- Robotics as Another AI Language: The hypothesis that robotics is “just another language of AI” appears to hold, with techniques like diffusion models from video generation proving effective for motion generation in robotics [04:42:00].
- Scaling Laws: Scaling laws observed in large language models also apply to autonomous driving models, showing similar log-linear growth with data and size, suggesting that further scaling may continue to yield performance gains [04:56:00].
Broader Implications of AI Progress
The advancements in AI, particularly with LLMs, are poised to have a profound impact beyond autonomous systems.
- Impact on Education: AI tools like Chat GPT are often discussed in the context of cheating, but their true potential lies in transforming education as interactive, engaging, and personalized learning tools [05:15:00].
- AI in Daily Life: AI’s application extends to seemingly unconventional areas, such as using AI techniques to design new products like plant-based cheese, which can have a massive impact on sustainability and daily life [05:24:00].
- Generalized Applicability: The problem space of “hard to generate for, but easy to verify” is broad, allowing LLMs to be highly effective in diverse applications beyond coding and math [05:58:00]. This generative-discriminative paradigm (actor-critic model) is applicable to many fields [05:49:00].
- Reasoning and Reinforcement Learning (RL): Anything multi-step that requires credit attribution can benefit from enhanced reasoning capabilities [05:58:00]. The current paradigm suggests bootstrapping with a large model via supervised learning, then using RL for fine-tuning to achieve expert-level reasoning [05:58:00].