Scaling challenges in AI and test time compute

From: redpointai

Noam Brown, a research scientist at OpenAI, was a key part of their work on O1, a model at the forefront of reasoning in Large Language Models (LLMs) [00:00:29]. He has a history of working on complex problems in games like diplomacy and poker [00:00:34]. Brown’s insights shed light on the current scaling challenges in AI and the pivotal role of test-time compute in advancing towards Artificial General Intelligence (AGI) [00:00:41].

The Evolution of AI Scaling

Early AGI Skepticism and the Need for General Inference Compute

In late 2021, Noam Brown expressed skepticism about short AGI timelines, primarily because there was no general method for scaling inference compute (or test-time compute) for language models [00:06:37]. While scaling pre-training led to models that could perform smart things, they still struggled with basic tasks like drawing a Tic-Tac-Toe board or making optimal moves in simple games [00:07:07]. Brown believed that solving this general scaling problem for inference compute would be an “extremely hard research problem” that could take at least a decade [00:07:44]. To his surprise, this prediction was drastically shortened to just two or three years [00:08:12].

The “Soft Wall” of Pre-training Costs

Scaling AI models through pre-training alone faces an economic “soft wall” [00:03:01]. The cost of training frontier models has escalated from thousands of dollars for GPT-2 to potentially hundreds of millions of dollars today for models like GPT-4 [00:01:52]. While throwing more money, resources, and data into pre-training will continue to yield better models, a 10x improvement could cost billions, then tens of billions of dollars, eventually becoming economically intractable [00:02:22]. This means spending trillions of dollars on a single model is unlikely [00:02:55].

The Rise of Test-Time Compute

Test-time compute is seen as the next frontier for pushing model capabilities [00:03:15]. Similar to the early days of GPT-2 and the scaling laws for pre-training, there’s significant “low-hanging fruit” for algorithmic improvements in test-time compute [00:03:37]. The potential ceiling for test-time compute is vast; while a ChatGPT query might cost a penny today, people might be willing to pay millions of dollars for a query that solves crucial societal problems [00:04:30]. This represents an eight-order-of-magnitude potential increase, indicating ample room for further scaling and algorithmic enhancements [00:05:01].

Noam Brown’s Insights and OpenAI’s Approach

Redefining AGI Timelines

Brown’s initial skepticism about AGI timelines shifted dramatically with the breakthroughs in scaling test-time compute [00:34:20]. He now believes progress will accelerate faster than he originally thought [00:34:48]. The median view among OpenAI researchers aligns with Sam Altman’s tweet, suggesting they “basically know what we’ve got to do to build AGI” [00:05:25].

Organizational Vision and Adaptation

OpenAI, despite pioneering large-scale pre-training, was “very on board” with pushing test-time compute [00:09:35]. Although their initial motivation might have been different (e.g., overcoming the data wall), the techniques and agendas proved compatible [00:10:07]. The leadership at OpenAI recognized the potential of this new direction and invested heavily, demonstrating an ability to avoid the “innovator’s dilemma” and disrupt its own established paradigm [00:11:21].

The Emergent Behavior of O1

The first “sign of life” that convinced Brown of faster progress came from letting the model “think for longer” [00:13:39]. This simple change led to emergent behaviors: the model began trying different strategies, breaking down multi-step problems, and recognizing and correcting its own mistakes [00:13:46]. These qualitative changes were more significant than mere performance improvements, indicating a profound shift in model capabilities and the potential for greater scalability [00:14:21].

Implications and Future Directions

Test-Time Compute’s Ceiling and Algorithmic Improvements

O1 is currently more intelligent and excels at “very hard problems” that might typically require a PhD [00:15:24]. While models like GPT-4o offer faster responses for less difficult reasoning tasks, the long-term goal is a single, unified model that can either provide immediate good responses or engage in deep thinking when required [00:16:03]. The key is to shift from extending specific game algorithms to a more general approach that starts with a broad domain (like language) and focuses on scaling test-time compute effectively within that breadth [00:18:20].

The “Bitter Lesson” and its Application to AI Development

Richard Sutton’s “Bitter Lesson” emphasizes that methods that scale well with more compute and data consistently outperform methods that encode human knowledge or rely on complex scaffolding [00:26:05]. Brown applies this to current AI development, suggesting that many “prompting tricks” and “scaffolding techniques” built to push model capabilities are ultimately temporary and will be replaced by models that inherently scale better [00:26:48]. This poses a challenge for startups that might invest heavily in such custom solutions, only to find core model capabilities improving and making their specialized workflows obsolete [00:27:37].

The Future of AI Models: Unified Capabilities and Specialized Tools

The eventual future points to a single, highly capable model that can handle all tasks [00:16:18]. This model would be general and powerful but might also utilize specialized tools (like a calculator for multiplication) to save cost and increase speed for specific, simple tasks [00:20:29]. These tools could also offer flat-out better performance than the general model for their specific function [00:21:19].

Impact on AI Hardware and Infrastructure

The success of O1 and the emphasis on test-time compute significantly alters the perspective on hardware development [00:35:05]. The mindset shifts from massive pre-training runs being the primary cost to a major focus on optimizing inference compute [00:35:11]. This creates a “big opportunity for a lot of creativity on the hardware side” to adapt to this new paradigm [00:35:29].

AI’s Role in Society and Research

Advanced AI models, particularly those with general intelligence and communication abilities (like LLMs), offer new avenues for scientific research [00:36:05]. They can be used for social science experiments (e.g., simulating economic games like the ultimatum game) and neuroscience, offering scalability and ethical advantages over human subjects [00:36:11]. The fact that LLMs communicate in human language solves the long-standing AI problem of emergent communication between AI agents [00:40:46]. In robotics, while hardware iteration is slow and expensive, progress is expected in the long term [00:41:31].

The most exciting application is the advancement of scientific research, where models will increasingly surpass expert humans in various domains, acting as partners to accelerate the frontier of human knowledge [00:42:41]. O1’s initial strengths appear to be in math and coding, which show noticeable and continuous progress [00:43:59].

Academia’s Role in a Compute-Intensive AI Landscape

Academia faces challenges in AI research due to the heavy reliance on data and compute resources [00:29:04]. Brown advises PhD students not to try to compete directly with frontier industry research labs on capabilities by adding “clever prompting” or “tricks” for marginal performance gains [00:30:07]. Instead, he suggests focusing on investigating novel architectures or approaches that demonstrate promising scaling trends with more data and compute, even if they don’t immediately achieve state-of-the-art performance [00:30:21]. Such foundational research will be recognized and invested in by industry labs [00:31:07].

Tubegraph

Explorer

Table of Contents