From: redpointai
Bob McGrew, former Chief Research Officer at OpenAI, shared insights into the future of AI and the progress of models, drawing from his six and a half years at the forefront of AI research [00:00:00].
Pre-training Progress: Beyond the Wall
While external observers might perceive a slowdown in AI model capabilities, particularly after GPT-4, the internal view from major labs like OpenAI is quite different [00:01:00]. The impression of a “wall” in model capabilities is misleading [00:01:00].
The Role of Compute and Algorithmic Improvements
To achieve significant progress in pre-training, an immense increase in compute is required. For instance, moving from GPT-2 to GPT-3, or GPT-3 to GPT-4, necessitated a 100x increase in effective compute [00:01:41]. This increment is achieved through:
- Adding more hardware: Investing in more chips and building larger data centers [00:02:01].
- Algorithmic improvements: While these can yield significant gains (50%, 2x, 3x), they are not sufficient on their own for generational leaps [00:02:08].
Building new data centers is a slow, multi-year process, which explains the perceived gaps between major model releases [00:02:16].
O1: A New Generation through Reinforcement Learning
OpenAI’s O1 model represents a significant leap, effectively a 100x compute increase over GPT-4, primarily achieved through advanced reinforcement learning (RL) techniques [00:03:08]. This approach allows the model to create longer, more coherent “Chains of Thought,” packing more compute into its answers [00:04:25]. What’s crucial about this method is that it doesn’t immediately require new data centers, opening significant room for algorithmic improvements [00:05:02]. The same techniques could theoretically extend thought processes from minutes to hours or even days [00:05:31].
Progress in 2025 and Beyond
Future AI progress will be different. As new generations of models are developed, new, unforeseen problems emerge, requiring time to work through even after new data centers are available [00:04:06]. The focus for 2025 is expected to be on “test-time compute” [00:05:57].
New Form Factors and the Rise of AI Agents
Current chatbot interactions are well-handled by models like GPT-4 [00:06:22]. However, to unlock the full capabilities of advanced models like O1, new “form factors” are needed [00:06:09]. These models excel at structured, long-duration tasks like programming or writing complex documents, which require sustained, coherent reasoning [00:07:03].
The most exciting development is enabling long-term actions, essentially powerful AI agents [00:08:04]. These agents could book flights, shop, or solve problems by interacting with the real world [00:08:16].
Challenges for AI Agents and Enterprise Adoption
The primary challenge for AI agents is reliability [00:08:57]. If an agent makes mistakes while performing actions (e.g., buying something or sending an email), it can lead to wasted time, embarrassment, or financial loss [00:09:16]. Achieving higher reliability (e.g., from 90% to 99% or 99% to 99.9%) demands another order of magnitude increase in compute, which takes years of work [00:09:50].
For enterprise adoption, the key is providing context to the AI, which is currently a hand-holding process [00:11:19]. This context (co-workers, projects, codebases, preferences) is scattered across various enterprise systems (Slack, documents, Figma) [00:11:36]. Solutions include:
- Building libraries of “connectors” to integrate data sources [00:12:02].
- Developing “computer use” models, like Anthropic’s, which can control a mouse and keyboard as a general “hammer” [00:12:39]. However, this increases token count significantly (10x-100x), again emphasizing the need for models with long, coherent Chains of Thought [00:13:34].
The future will likely see a mix of these approaches [00:14:46]. Widespread computer use agents are still a few years away, moving from compelling demos to limited use cases in a year, and surprisingly effective but not perfectly reliable in two years [00:16:13]. Adoption hinges on the tolerable level of mistakes [00:16:54].
Multimodal AI and Video Models
Multimodal AI, incorporating vision and audio alongside text, is another exciting area [00:18:14]. While models like DALL-E (image generation) and Whisper (audio) started as separate entities, they are gradually being integrated into core models [00:18:46].
Video has been the most challenging modality to integrate [00:18:54]. OpenAI’s Sora is a pioneering example [00:19:00]. Two key differences for video models are:
- Extended sequences: Video is not a single prompt but an unfolding story, requiring new user interfaces for creation [00:19:46].
- High cost: Training and running video models are very expensive [00:20:18].
The progress of video models is expected to mirror LLMs: quality will improve, especially for extended coherent generations (e.g., from seconds to hours of video), and costs will drop dramatically, making high-quality, realistic videos practically free [00:21:23]. A full-length, AI-generated movie that audiences genuinely want to watch is predicted within two years, driven by human creative vision leveraging the AI tool [00:23:08].
Robotics: A Resurgent Field
Robotics, a personal passion for McGrew, is expected to see widespread, though somewhat limited, adoption within five years [00:24:50]. Foundation models are a breakthrough due to their ability to:
- Translate vision into plans of action [00:25:32].
- Enable natural language interaction with robots, simplifying control (e.g., talking to a robot instead of typing commands) [00:26:01].
A major unresolved question in robotics is whether to learn primarily in simulation or the real world [00:26:29]. While simulators are useful for rigid bodies, they struggle with “floppy” materials (cloth, cardboard), making real-world demonstrations necessary for general-purpose robots [00:27:11].
Mass consumer adoption of home robots is still distant due to safety concerns (robot arms can be dangerous) and the unconstrained nature of home environments [00:28:10]. However, widespread deployment in constrained work environments like warehouses and retail is expected within five years [00:28:36].
Specialization vs. General Models
Frontier labs will continue to develop large, general-purpose models that perform well across various modalities and tasks [00:29:41]. However, specialization offers significant price-performance advantages [00:29:55]. A common strategy for companies is to:
- Define the desired AI task.
- Run it against a best-in-class frontier model to generate a large dataset.
- Fine-tune a much smaller, cheaper model on this dataset for specific use cases (e.g., customer service chatbots) [00:30:21].
While these specialized models may not be as robust as frontier models when going “off-script,” the cost savings justify the trade-off [00:31:02].
The Slow Burn of AI Productivity
Despite rapid advancements in AI capabilities, the impact on overall productivity statistics (e.g., GDP growth) has been surprisingly slow [00:31:42]. This phenomenon, reminiscent of the internet in the 1990s, is attributed to several factors:
- Underestimation of job complexity: What humans do in a “job” is composed of many tasks, and often a core task remains difficult to automate [00:33:05].
- Boilerplate vs. Core Reasoning: AI initially automates the “boilerplate” parts of a job (e.g., code generation), leaving the more challenging “giving direction” or “figuring out what to do” aspects to humans [00:33:37].
McGrew is particularly excited about applying AI to “boring” problems in areas like procurement, where infinite patience is more valuable than infinite intelligence [00:34:11]. AI can save significant money by meticulously comparison shopping, a task human experts find tedious [00:34:47]. Studies showing productivity gains from AI, particularly among lower-performing individuals, are seen as hopeful; AI helps those who know what to do but struggle with how to do it [00:35:55].
OpenAI’s Culture and Pivots
OpenAI’s culture is characterized by its numerous “refoundings” or pivots, which were often controversial but necessary [00:53:53]:
- Nonprofit to For-profit: Transitioning from a research-focused nonprofit to a for-profit entity to secure funding [00:41:01].
- Microsoft Partnership: Partnering with Microsoft, initially controversial, provided crucial compute resources [00:41:41].
- Building Products: The decision to build their own products (like the API) alongside research, demonstrating model value [00:42:04].
- ChatGPT Release: A famously “accidental” release of ChatGPT, which was not initially thought to have met the bar for daily use by the team, but exploded in popularity [00:44:05]. The team initially worried it would be a fad, like DALL-E 2 [00:45:25].
- Focus on Language Modeling: The decision to double down on language modeling (including multimodal work) was a painful but critical choice, leading to the shutdown of exploratory projects like robotics and games teams [00:59:34]. This came from the conviction, learned from projects like Dota 2, that problems could be solved by increasing scale [01:00:37].
These shifts, occurring every 18-24 months, fundamentally changed the company’s purpose and the identity of its people, moving from paper writing to building a single model for global use [00:42:20].
The Future of Artificial General Intelligence (AGI)
McGrew expresses a deep critique of the concept of AGI as a single “moment,” believing that progress will be continuous and “fractal” [00:47:11]. He anticipates an AGI future that feels “banal” – where self-driving cars take people to offices where they boss around AI armies, and it still feels like “office space” [00:47:25].
He contends that solving reasoning was the last fundamental challenge needed to scale to human-level intelligence [00:47:59]. Now, the remaining challenge is scaling, which encompasses systems, hardware, optimization, and data problems [00:48:17]. In this sense, the path to AGI is “predestined,” though the scaling work itself is immensely difficult [00:48:50].
Societal Impact and the Scarcity of Agency
McGrew believes that society is moving from a world where intelligence is a critical scarcity to one where it will be ubiquitous and free [00:49:16]. The new scarce factor of production will likely be agency: the ability to ask the right questions and pursue the right projects [00:49:34]. AI will struggle to solve these core human problems [00:49:47]. This shift will feel continuous, like an exponential curve [00:50:11]. While AI can create a video from a vague prompt, a human’s agency is needed to craft the desired video through detailed choices [00:50:58].
AI Research and Future Outlook
McGrew emphasizes the importance of grit in top AI researchers [00:39:00]. He recounts Adria Ramos’s 18-month perseverance to generate a “pink panda skating on ice” image with DALL-E, proving neural network creativity, despite initial results being just a blur [00:38:12].
He views engineers and researchers as “artists” who must be allowed freedom to pursue their vision, especially in research, where stifling artistry can prevent foundational breakthroughs [00:39:05].
In terms of future research, McGrew finds programming a consistently useful metric for evaluating models because it scales from simple line completion to entire website creation, and it’s far from a solved problem [00:53:56]. He also sees immense potential for AI in social sciences, particularly in simulating human interaction for product management and AB testing [00:55:25].
McGrew left OpenAI after accomplishing his core research program goals (pre-training, multimodal, reasoning) and felt it was time to pass the torch to the next generation [01:03:51]. He plans to explore new ideas, learn, and meet people, much like his two-year period between Palantir and OpenAI, without rushing into his next venture [01:04:36].
He concludes that progress in AI will continue to be exciting and will not slow down, but it will change [01:06:55].
Overhyped vs. Underhyped
- Overhyped: New architectures that tend to fall apart at scale [01:03:08]. Limitations of current AI models and future architecture are often overlooked.
- Underhyped: O1 [01:03:24].