Scaling AI models and test time compute

From: redpointai

Noam Brown, a research scientist at OpenAI, was a key part of their work on O1 models, which are at the forefront of reasoning in Large Language Models (LLMs) [00:00:26]. His background includes work on problems in diplomacy and poker at Fair [00:00:33]. Brown discusses the current state of AI model capabilities, emphasizing the importance of test-time compute and how it has changed his perspective on AGI timelines [00:00:41].

Evolution of AI Timelines and Compute

In late 2021, Noam Brown expressed skepticism to Ilya Sutskever about AGI timelines, believing it would take over a decade [00:06:26]. His primary reason was the lack of a general method for scaling inference compute [00:06:42]. He observed that while models were becoming smarter through pre-training, they still struggled with basic tasks like Tic-Tac-Toe, leading him to believe that scaling pre-training alone wouldn’t lead to superintelligence [00:07:01].

To Brown’s surprise, Sutskever agreed that scaling pre-training alone wouldn’t achieve superintelligence and was also focused on the test-time compute direction [00:07:50]. The problem Brown thought would take at least a decade to solve was largely addressed in two or three years [00:08:09]. He now believes that while other research questions remain, none will be harder than the problems already solved [00:08:27], and progress will continue rapidly [00:05:59].

The Role of Scaling in AI Model Development

Scaling Pre-training

Scaling pre-training has shown continued improvements in models [00:01:39]. The significant advancements from GPT-2 (costing $5 -$ 50,000) to GPT-4 (costing hundreds of millions of dollars for some labs) primarily reflect increased resources [00:01:51]. Brown asserts that throwing more money, resources, and data will continue to yield better models [00:02:22].

However, the cost of further scaling becomes intractable, hitting a “soft wall” where it’s no longer economically viable to pursue a 10x improvement, potentially costing billions or tens of billions of dollars [00:02:30]. This economic barrier, rather than a hard technical limit, eventually makes it not “economically worth it to push that further” [00:02:47].

Test-Time Compute

Brown is particularly excited about test-time compute because it is still in its early stages, akin to the GPT-2 era of pre-training, with significant room for 1,000x scaling and algorithmic improvements [00:03:14].

The ceiling for test-time compute is envisioned in terms of dollar value [00:04:30]. While a ChatGPT query currently costs about a penny [00:04:37], Brown believes people would pay millions of dollars for solutions to critical societal problems [00:04:55]. This represents eight orders of magnitude of potential growth, indicating substantial room for further advancement through both increased investment and algorithmic enhancements [00:05:00].

OpenAI’s Approach to Test-Time Compute

When Noam Brown joined OpenAI, he found a surprising openness to the idea of scaling test-time compute, despite OpenAI being the pioneer of large-scale pre-training [00:09:34]. Initially, OpenAI’s motivation for this direction was to overcome the “data wall” rather than directly focusing on test-time compute scalability [00:10:07]. However, their agendas proved compatible [00:10:22].

The development of O1 involved an exploratory research direction [00:10:33]. A key breakthrough occurred when allowing the model to “think for longer,” which led to the emergent display of desired behaviors such as trying different strategies, breaking down problems, and self-correcting mistakes [00:13:38]. This qualitative shift, observed around October 2023, was a significant moment, even more so than the immediate performance improvements [00:14:14].

Brown attributes OpenAI’s success in this area to its “organizational excellence,” recognizing the potential of test-time compute and investing heavily in it [00:12:09]. This willingness to pursue a “risky direction” that is disruptive to its own pioneering paradigm demonstrates that OpenAI is not trapped in the “innovator’s dilemma” [00:11:58].

The Bitter Lesson

Noam Brown refers to Richard Sutton’s “Bitter Lesson” in Reinforcement Learning (RL), which states that methods leveraging more compute and data consistently outperform those that try to inject human knowledge or clever tricks [00:26:05]. Applied to LLMs, Brown argues that current scaffolding techniques and prompting tricks, while providing short-term gains, are unlikely to scale well with more data and compute [00:26:46]. He believes that advancements in AI model infrastructure for testtime compute, like O1, which inherently scale well, will eventually replace these temporary solutions [00:27:15].

For startups and builders, Brown advises caution against over-investing in custom scaffolding or specialized agentic workflows that might become obsolete as core model capabilities improve out of the box [00:28:02].

O1 Model Capabilities and Future Vision

The O1 model takes images as input [00:16:38]. It is particularly adept at tackling very hard reasoning tasks, making it a “real power user” for researchers at universities, capable of handling complex research questions that would normally require a PhD [00:15:24]. While O1 is more intelligent for hard problems, GPT-4o offers faster responses for less complex reasoning tasks [00:15:15].

Ultimately, Brown envisions a future with a single, highly capable model that can handle any query, performing deep thinking when required or providing immediate responses for simpler tasks [00:16:18].

He highlights the shift from problem-specific reasoning (like Monte Carlo search for Go) to a more general inference compute capability [00:16:53]. His experience with Diplomacy, where domain-specific techniques hit a ceiling, convinced him that a “start from everything” mindset was necessary to achieve superhuman performance, rather than trying to extend a technique from one domain to many [00:18:16].

Future Directions and Impact

Agentic Models

Brown is excited about models becoming more “agentic” [00:23:51]. Historically, models were too brittle for long-horizon tasks requiring many intermediate steps [00:24:14]. O1 serves as a proof of concept that models can independently figure out and tackle intermediate steps for complex problems without excessive prompting [00:24:40].

Specialized Tools and Cost Optimization

While a single general model is the goal, Brown acknowledges the role of specialized tools [00:20:58]. For instance, instead of O1 performing large number multiplication internally (an expensive process), it should ideally call a calculator tool or write a Python script for efficiency [00:20:29]. These tools offer speed and cost savings, and sometimes even flat-out better performance than a general model [00:21:07].

Hardware and Compute Scalability Challenges in AI

The rise of O1 fundamentally shifts the focus for hardware development [00:35:05]. Previously, the mindset was on massive pre-training runs with cheap inference costs [00:35:09]. However, Brown anticipates a major shift towards inference compute, creating opportunities for hardware optimization around this new paradigm [00:35:20].

Role of Academia

Noam Brown notes the challenges in AI model training and deployment for PhD students in academia, as pushing the frontier increasingly relies on access to significant data and compute resources [00:29:04]. He advises against competing directly with frontier industry labs on capabilities that require massive resources [00:30:07]. Instead, he suggests focusing on investigating novel architectures or approaches that demonstrate promising scaling trends with more data and compute, even if they don’t immediately achieve state-of-the-art performance on current evaluations [00:30:21]. Such research, he states, will gain attention from industry labs willing to invest resources to explore its potential at large scale [00:31:09].

Underexplored Applications

Brown is particularly excited about the potential of AI models to advance scientific research [00:42:10]. As models surpass human expert capabilities in various domains, they can act as partners to researchers, enabling new discoveries or accelerating existing processes [00:42:41]. Specific areas where O1 has shown impressive results include math and coding [00:43:59].

Another promising area is using models for social science experiments, including economics and game theory [00:36:09]. Models trained on vast human data can imitate human behavior, providing a scalable and cost-effective alternative to human subjects [00:36:17]. This also opens up possibilities for experiments that might be ethically problematic or cost-prohibitive with human participants, such as the ultimatum game with large sums of money [00:38:01].

Conclusion

Noam Brown encourages skeptics to examine the progress in AI, particularly the advancements in AI model infrastructure for testtime compute [00:47:09]. He emphasizes that the current state of AI is “complete science fiction” compared to just five or ten years ago [00:46:51], and this paradigm significantly addresses concerns about hitting a progress wall [00:47:11]. He predicts that model progress will accelerate in 2025 [00:45:29].

Tubegraph

Explorer

Table of Contents