From: hu-po

Robotics has traditionally faced challenges in real-world application due to the complexity of programming a chain of different technologies, including hardware, vision sensors, vision algorithms, and control algorithms, where performance is limited by the weakest link [00:39:00]. Large language models (LLMs) have emerged as a potential solution to these problems, improving overall performance in robotic systems [01:04:00].

How LLMs Address Robotic Challenges

Traditionally, programming a robot required explicit, precise instructions for every movement and action, such as exact trajectories and positions [05:58:00]. This meant that only robotics engineers could effectively use robots, and any change in task required extensive reprogramming, limiting robots to very specific, repetitive jobs [06:30:00].

LLMs introduce a “fuzziness” that allows for tasks to be described in natural language, enabling non-experts to give high-level instructions to robots [07:08:00]. This represents a significant advancement, as LLMs can convert these high-level human commands into a set of precise, executable instructions for the robot [07:39:00].

Palm-SayCan: Google’s Approach

Google’s research project, Palm-SayCan (combining their Palm LLM with a helper robot), utilizes Chain of Thought prompting to enable robots to perform complex tasks [03:05:00].

Robot Hardware

The robots used by Google’s Everyday Robots (a group that has since been largely defunded [02:27:00]) consisted of:

  • A Roomba-like base [01:34:00].
  • A depth sensor (similar to a Velodyne LiDAR sensor) mounted under the robot’s “chin” for depth perception [01:41:00].
  • A head with pan, tilt, and twist capabilities, allowing the robot to choose what to look at [02:00:00].
  • A single arm with multiple degrees of freedom and a pincer gripper [02:09:00].

Chain of Thought Prompting in Action

Chain of Thought prompting is a prompt engineering technique that forces an LLM to “show its work” by providing a set of steps to perform a requested task [04:15:00]. A simple example is adding “let’s think step by step” to a prompt [04:37:00]. This technique yields a fundamentally different and often better answer from the LLM [04:55:00].

When a human gives a complex command like “I just worked out, can you bring me a drink and a snack to recover?” [05:32:00], Palm-SayCan interprets this [08:08:00] using Chain of Thought reasoning to split the prompt into manageable subtasks [07:25:00]:

  1. Find water (localization task) [12:12:00].
  2. Pick up the water (grasping/manipulation task) [12:27:00].
  3. Bring it to you (navigation task) [12:42:00].
  4. Put down the water (manipulation task) [12:49:00].
  5. Find an Apple [12:00:00].
  6. Pick up Apple [12:00:00].
  7. Bring it to you [12:00:00].
  8. Put down the apple [12:00:00].

Affordance Model

Palm-SayCan incorporates an “affordance model” that scores the possibility of each step for the robot within its environment [10:35:00]. This model, likely a neural network, determines if an action (e.g., “find water”) is feasible [10:44:00]. The LLM then scores the relevance of the action to the original command, and a combined score guides the robot’s actions, ensuring both relevance and possibility [11:26:00].

Manipulation of Open World Objects (MOO)

Another significant development is “Manipulation of Open World Objects” (MOO), which focuses on allowing robots to handle objects they have never seen before [14:43:00]. This addresses the impracticality of robots having first-hand experience with every possible object [14:34:00].

Leveraging Vision-Language Models

MOO uses pre-trained vision-language models (VLMs) like Clip or Blip, which have been trained on captioned images from the internet [15:34:00]. These VLMs are adept at relating visual images to textual captions, allowing them to identify objects in an image based on descriptions [15:54:00].

MOO works by:

  1. Extracting object-identifying information from a natural language command using a VLM like Owl-ViT (an open vocabulary object detector) [17:17:00].
  2. Conditioning the robot’s policy (the neural network controlling its actions) on the current image, the instruction, and the extracted object information [17:56:00]. This allows the robot to “grab this part of the image” identified by the VLM [19:59:00].

Comparison to End-to-End Learning (RT1)

Earlier systems, such as RT1, were often trained end-to-end via behavior cloning from human demonstrations [28:31:00]. This meant the entire system, from input (human command, image) to output (robot action), was trained as one unit [29:06:00]. However, RT1 was brittle when faced with previously unseen objects because it relied on novel language embeddings [28:03:00].

MOO, in contrast, uses a frozen VLM (like Owl-ViT) which is not fine-tuned for the robot but provides consistent object detection [30:45:00]. The manipulation policy is conditioned on these VLM detections, making generalization to unseen objects simpler [28:19:00]. MOO demonstrates significantly better performance on both seen (92% vs. 54% for RT1) and unseen objects (75% vs. 25% for RT1) [39:31:00].

Input Modalities

MOO can generalize to other non-language-based input modalities for specifying objects of interest, such as:

  • Finger pointing: Allowing a human to simply point at an object [18:42:00].
  • Clicking on an image: Selecting a specific part of the image [22:01:00].
  • Image-based querying: Providing a stock image of the target object, especially useful when objects are hard to describe in words or when many similar objects are present [41:37:00].

Data and Model Capacity

A strong correlation has been established between increased performance and larger data set sizes and model capacity in machine learning and AI [37:04:00]. This foundational principle, driven by advancements in the internet, digital cameras, cell phones, and GPUs, contributes significantly to the progress in areas like robotic manipulation [38:06:00].

Future Outlook

While past predictions about robots in every home (e.g., ASIMO in 2000 [20:45:00]) vastly underestimated the complexity of real-world robotics, the emergence of LLMs and their ability to interpret natural language for robot control is a crucial missing piece [20:52:00].

With hardware continuously improving (cheaper motors, cameras, laser scanners) due to advancements in various industries, and the new ability of LLMs to convert human commands into executable tasks, the future of robotics seems promising [47:24:00]. Although robots may not be in every home within a year, the speaker optimistically suggests their widespread presence in five to ten years [47:11:00].