From: aidotengineer

Next token prediction is one of two key scaling paradigms that have emerged in AI research over recent years, particularly between 2020 and 2021 [00:01:28] [00:05:22]. This paradigm is also commonly referred to as “pre-training” [00:01:41].

How it Works

At its core, next token prediction operates as a “world-building machine” [00:01:48]. The model learns to comprehend the world by predicting the subsequent word or token in a sequence [00:01:51]. This learning is based on the fundamental principle that certain sequences are caused by initial actions and are irreversible [00:02:02]. By predicting what comes next, the model inherently learns the physics of the world [00:02:09].

The tokens used for pre-training can be diverse, including strings, words, or pixels [00:02:14]. To accurately predict the next token, the model must develop an understanding of how the world functions [00:02:24].

Multitask Learning

Next token prediction can be viewed as a massive multitask learning process [00:02:34]. During pre-training, some tasks are relatively easy for the model to learn, such as:

  • Translation: For example, learning that “boarding” in French translates to “embarquement” [00:02:45].
  • Factual Knowledge: Understanding general world facts like “the capital of France is Paris” [00:02:56]. This information is often readily available and prevalent on the internet, making it easier for the model to absorb [00:03:01].

The Importance of Compute

Scaling compute during the pre-training stage is crucial because it allows the model to learn a class of tasks that are inherently more challenging [00:03:11]. These difficult tasks include:

  • Physics and Problem Solving: The model learns about physics, problem-solving, generation, and logical expressions [00:03:27].
  • Spatial Reasoning: While not yet perfect, models also learn some spatial reasoning [00:03:35].
  • Math: Tasks requiring computation, such as complex math problems, demand significant compute for the model to derive the correct next token prediction [00:03:43]. This often necessitates techniques like Chain of Thought to aid in reasoning [00:03:53].
  • Creative Writing: This is exceptionally difficult for models. While they can predict writing style, creative writing involves world-building, storytelling, and maintaining plot coherence [00:04:08]. It’s much easier for a model to make a prediction error that deteriorates the plot [00:04:26]. Measuring “good” creative writing is also an open research problem, and enabling models to invent new forms of writing or create coherent novels over long periods remains a significant challenge [00:04:46].

Transition to Post-Training and Agents

The era of 2020-2021 saw significant scaling of pre-training at organizations like Anthropic and OpenAI [00:05:22]. Products like GitHub Copilot, which leveraged the model’s pre-trained knowledge of code for autocomplete features, emerged from this paradigm [00:05:38]. Researchers then applied techniques like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAF) in a “post-training” stage to refine the model’s usefulness [00:06:06] [00:06:16]. This post-training teaches models to complete functions, understand docstrings, generate multi-line completions, and apply diffs [00:06:23].

Next token prediction forms the foundation upon which more advanced AI models and agents are built, particularly those capable of highly complex reasoning through techniques like Chain of Thought and the use of real-world tools [00:10:03].