Reinforcement learning and Chain of Thought

From: aidotengineer

Karina, an AI researcher at OpenAI, discusses the scaling paradigms that have shaped AI research over the past few years, particularly focusing on how these paradigms have unlocked new frontiers in product development [00:00:32].

Two Major AI Scaling Paradigms

Over the past few years, two primary scaling paradigms have emerged in AI research [00:01:28]:

Next Token Prediction (Pre-training) [00:01:34]
Scaling Reinforcement Learning on Chain of Thought [00:07:05]

1. Next Token Prediction (Pre-training)

This paradigm is described as a “world-building machine” where the model learns to understand the world by predicting the next token [00:01:48]. This works because certain sequences are caused by initial actions and are irreversible, leading the model to learn some “physics of the world” [00:02:00]. Tokens can be anything, including strings, words, or pixels [00:02:11].

Next token prediction functions as massive multi-task learning [00:02:34]. While some tasks, like translation, are easy to learn, others are significantly harder [00:02:45]. The model learns problem-solving, generation, logical expressions, and spatial reasoning [00:03:30].

For complex computational tasks such as math, where the model needs to compute numbers during next token prediction, the difficulty is very high [00:03:43]. This is where Chain of Thought becomes crucial, allowing the model to reason through these computationally intensive tasks [00:03:53]. Creative writing is another example of a difficult task, as maintaining plot coherence is challenging and measuring “good” creative writing is an open research problem [00:04:08].

The period of 2020-2021 was an era of scaling pre-training significantly [00:05:27]. An early product from this era was GitHub Copilot, which used next token prediction on billions of code tokens [00:05:38].

Post-Training with Reinforcement Learning

The capabilities of pre-trained models were further refined in the “post-training” era using reinforcement learning from Human Feedback (RLHF) and reinforcement learning from AI Feedback (RLAIF) [00:06:06]. This process made models like GitHub Copilot more useful for tasks such as completing function bodies, understanding docstrings, generating multi-line completions, and predicting/applying diffs [00:06:23]. This era continues to be explored for pushing models to reason through complex codebases [00:06:37].

2. Scaling Reinforcement Learning on Chain of Thought

This paradigm, which emerged more recently (last year with OpenAI’s O1 model), focuses on scaling reinforcement learning on Chain of Thought for highly complex reasoning [00:07:01]. The effectiveness of this approach stems from the model learning how to think during training, leveraging strong signals from RL feedback [00:07:24].

To tackle increasingly difficult tasks, such as solving medical problems, models need to dedicate significant time to thinking through the problem [00:08:00]. This paradigm involves creating more complex environments where models can use tools to think through and verify their outputs during the Chain of Thought process [00:08:13].

Research challenges related to Chain of Thought include measuring its faithfulness and enabling models to backtrack if they follow a wrong direction [00:08:37].

New Interaction Paradigms

This shift necessitates new interaction paradigms with humans [00:09:02]. To avoid long waiting times (e.g., 15 seconds to 30 minutes for a model response), one approach is to stream the model’s thoughts to the user, requiring clear summaries of these thoughts [00:09:20].

The Future of Agents and Co-Innovation

The current year (in context of the talk) is considered the “year of agents” for OpenAI, focusing on highly complex reasoning, such as models trained on Layered Chain of Thought for robust AI utilizing real-world tools like browsing, search, and computer use over long horizons [00:10:01].

The next stage envisions agents as “co-innovators” [00:10:27]. This builds upon existing reasoning and tool use capabilities, adding creativity enabled by human-AI collaboration [00:10:33]. The goal is to create new affordances for humans to collaborate with AI, co-creating the future [00:10:52].

Product Development and Design Challenges

These scaling paradigms have accelerated product research and development [00:11:05]. A rapid evaluation cycle is possible by:

Using highly reasoning models to distill knowledge into smaller, faster-iterating models [00:11:30].
Employing complex reasoning models to synthetically generate new data for post-training and reinforcement learning environments [00:11:43].

This enables the creation of new classes of tasks, such as simulating different users for multiplayer human-AI collaboration [00:12:00].

Design Challenges and Product Learnings:

Familiarity for new capabilities: Bringing unfamiliar capabilities into familiar form factors (e.g., 100K context via file uploads instead of infinite chats) [00:13:39].
Modular Compositions: Designing product features that enable modular compositions that scale with increasing model capabilities, exemplified by “Chach tasks” that can go beyond reminders to continuous story generation or daily searches [00:15:19].
Bridging Real-time and Asynchronous Tasks: Addressing the challenge of models performing long-duration tasks (e.g., 10 hours of research) while maintaining human trust [00:15:42]. This can be solved by giving humans new collaborative affordances to verify, edit, and provide real-time feedback for model self-improvement [00:16:02].
Virtual Teammates: Early products like “Clau and Slack” explored virtual teammates, offering insights into multiplayer collaboration with tools and image uploads [00:16:21].
Flexible Interfaces (Canvas): Products like Canvas demonstrate how human collaborative affordances can scale and foster creative capabilities [00:17:14]. Canvas operates as a co-creator and co-editor, capable of fine-grain editing, performing search to generate reports, and allowing human verification of outputs [00:17:38]. This interface scales to multiplayer and multi-agent scenarios, where models can act as critics or editors [00:17:57].

Future Applications

Personalized Tutors: Models are becoming highly multimodal and flexible, adapting to individual learning styles (e.g., visual vs. auditory learners) [00:18:16].
Generative Entertainment: Models can create games and tools on the fly, enabling non-coders to develop and deploy their own applications or even start businesses, moving towards pair programming and code creation [00:18:50]. Canvas, for example, functions as a pair programmer, capable of writing and coding, searching API documentation, and performing real-time data analysis with CSV uploads [00:19:34].
AI for Research and Knowledge Creation: Models can assist in research by reproducing papers or open-source GitHub repositories [00:20:12]. Humans and AI can collaborate to form new research hypotheses, verify directions, and delegate tasks to AI assistants [00:21:05].
Invisible Software Creation: The future may involve seamless software creation for all, especially on mobile devices [00:21:32].
Changing Internet Access: Predictions suggest less clicking on internet links, with models acting as a cleaner, more personalized lens to access information, generating multimodal outputs like interactive 3D visualizations for learning [00:21:56].
Dynamic AI Interfaces: The AI interface could be a blank canvas that morphs based on user intent (e.g., becoming an IDE for coding or generating tools for writers) [00:22:42].
Co-direction and Superhuman Tasks: Co-innovation will involve co-direction with models through collaboration with highly reasoning agentic systems, capable of superhuman tasks to create new novels, films, games, and scientific knowledge [00:23:31].

Tubegraph

Explorer

Table of Contents