Challenges in AI understanding of the physical world

From: mk_thisisit

Current artificial intelligence (AI) systems are limited in their understanding of the physical world [00:00:13]. Despite their ability to manipulate language effectively, they lack crucial features such as permanent memory, reasoning, and planning capabilities [00:05:45].

Current Limitations of AI

AI systems are often perceived as intelligent because of their proficiency in language manipulation [00:00:09]. However, this perception can be misleading, as these systems fundamentally do not comprehend the physical world [00:00:13]. Key features of intelligent behavior, such as reasoning and planning, are currently not reproducible in AI [00:05:53].

Human-like Abilities

While AI systems might develop “emotions” like excitement or joy from predicting successful outcomes [00:06:30], they will not inherently possess human flaws such as anger or jealousy [00:07:13]. The concept of consciousness itself lacks a clear, measurable definition, making it difficult to assess in both biological and artificial entities [00:07:27].

Ineffective Learning Paradigms for the Physical World

There are three primary paradigms in machine learning:

Supervised Learning – This classic approach trains systems by providing correct answers, such as identifying objects in images [00:08:51]. While effective for specific tasks, it demands vast amounts of high-quality data [00:10:14].
Reinforcement Learning – Considered closer to human learning, this method provides feedback on whether a result was good or bad [00:09:51]. It excels in game environments (e.g., chess or Go) where systems can play millions of games against themselves [00:10:20]. However, it is “extremely ineffective” in the real world; for example, training a self-driving car purely with reinforcement learning would result in thousands of crashes [00:10:17].
Self-Supervised Learning – This method, which has driven recent advancements in natural language processing and chatbots, trains systems to understand the structure of input data, such as predicting missing words in text [00:10:55]. While highly successful for language models, it falls short when applied to the physical world [00:12:27].

Language vs. Physical World Understanding

The physical world is significantly more complex for AI to understand than language [00:12:35]. Language, being discrete and composed of a finite number of symbols, allows for probabilistic predictions of next words [00:12:47]. In contrast, predicting events in a continuous, high-dimensional space like video is mathematically unsolvable due to the infinite possibilities of unpredictable details [00:47:47].

Humans and animals develop an intuitive understanding of physics, such as gravity, within months of birth [00:13:53]. Cats, for instance, are adept at planning complex physical actions like climbing and jumping due to their intuitive grasp of physics [00:14:31]. Replicating this “physical intuition” in computers remains a major challenge [00:14:44].

The Moravec Paradox

The Moravec paradox highlights the counter-intuitive difficulty of certain tasks for AI [00:14:50]. As robotics expert Hans Moravec observed, computers can excel at abstract tasks like playing chess or solving mathematical puzzles, but struggle with physical manipulation tasks that are easy for even young children or animals [00:15:01]. This is because the space of discrete objects and symbols is easier for computers to manipulate, while the real world is “too complicated” [00:15:24].

Sensory data, such as sight and touch, contains an “absolutely huge” amount of information compared to language [00:15:59]. This disparity explains why large language models (LLMs) can pass law exams or solve math problems, yet there are no truly realistic robots capable of tasks easily performed by cats or dogs, nor fully autonomous cars [00:16:06]. The ability to process and understand complex sensory data is crucial for machines to learn as effectively as humans and animals [00:16:40].

A typical large language model is trained on approximately 20-30 trillion tokens (words), equivalent to about 10¹⁴ bytes of data [00:17:34]. This amount of data, representing all publicly available text on the internet, would take hundreds of thousands of years for a human to read [00:18:06]. Strikingly, a small child processes a comparable amount of information through their visual system in the first four years of life [00:18:13]. This comparison underscores that AI systems will never reach human-level intelligence by training solely on text data; they must be taught to understand the real world [00:18:45].

Future Development Goals for AI

Researchers are working on designing new types of AI systems, still based on deep learning, that can function in the physical world, possess permanent memory, and are capable of reasoning and planning [00:06:09].

One proposed solution to the challenge of predicting continuous, high-dimensional data in the physical world is the Joint Embedding Predictive Architecture (JEPA) [00:48:37]. JEPA is a macro-architecture where the system learns an abstract representation of the input data and then makes predictions within that abstract space, rather than attempting to predict every unpredictable detail of the original input [00:48:42]. This approach makes the problem more tractable for AI.

Reasoning and Planning

Current large language models use primitive methods for reasoning, often generating many possible output sequences and then selecting the best one [00:26:51]. This process is computationally expensive and differs from human thought, where internal mental models of the world are used to predict outcomes and plan actions [00:27:14].

For AI to achieve true intelligence, it needs to develop the ability for hierarchical planning, defining intermediate goals to achieve a larger objective [00:29:52]. This is a “great challenge” for the coming years [00:30:44].

Robotics and AI Integration

While production robots excel at simple, automated tasks in controlled environments [00:31:36], more complex applications like autonomous driving still lack human-level reliability [00:31:59]. This is not due to physical limitations of robots, but because they are “not smart enough to deal with the real world” [00:34:06]. Companies are banking on rapid AI advancements in the next 3-5 years to make humanoid robots and autonomous vehicles truly viable [00:34:10]. The upcoming decade is predicted to be the “decade of robotics” [00:42:42].

The biggest challenge is the integration of AI, robotics, and sensors for skillful use [00:33:18]. Creating AI systems that understand the physical world, possess permanent memory, and can reason and plan will form the foundation for more adaptive robots [00:33:29].

AI progress has been discontinuous, with periods of rapid advancement followed by stagnation [00:34:55]. However, recent acceleration since 2013 is attributed to increased investment and more talented people entering the field [00:35:13].

Tubegraph

Explorer

Table of Contents