From: redpointai

The path to AGI and superintelligence has seen significant advancements, particularly in the realm of scaling test-time compute. This section explores the evolving perspectives and breakthroughs that are shaping the journey towards advanced AI.

Evolving Timelines for AGI

In late 2021, Noam Brown expressed skepticism about reaching AGI within a decade, particularly because a general method for scaling inference compute (test-time compute) was not yet figured out [00:00:05] [00:00:06] [00:00:07] [00:06:33] [00:06:37] [00:06:42] [00:06:44] [00:07:40] [00:07:42] [00:07:43] [00:07:45]. This was considered an “extremely hard research problem” [00:00:13] [00:07:44]. However, this problem, which he anticipated would take at least a decade, was largely solved in just two or three years [00:00:16] [00:08:09] [00:08:12] [00:12:23] [00:12:24] [00:12:27].

Despite other unsolved research questions, Brown believes none will be harder than those already overcome [00:00:20] [00:00:23] [00:00:24] [00:08:21] [00:08:24] [00:08:27] [00:08:30] [00:08:31]. This optimism aligns with Sam Altman’s sentiment that “we basically know what we’ve got to do to build AGI” [00:05:20] [00:05:22], a view shared by many OpenAI researchers [00:05:25] [00:05:26] [00:05:36] [00:05:39] [00:05:41] [00:05:57] [00:05:59] [00:06:01]. Brown’s optimism significantly increased after seeing the ability to scale test-time compute generally [00:34:12] [00:34:14] [00:34:20] [00:34:21] [00:34:23] [00:34:26].

The Role of Scaling in AGI Development

Pre-training vs. Test-Time Compute

While scaling pre-training has yielded significant improvements, as seen from GPT-2 to GPT-4, it faces an economic “soft wall” where costs become intractable (e.g., trillions of dollars) [00:01:40] [00:01:45] [00:01:46] [00:01:48] [00:01:52] [00:01:54] [00:01:58] [00:01:59] [00:02:02] [00:02:03] [00:02:05] [00:02:07] [00:02:09] [00:02:11] [00:02:12] [00:02:13] [00:02:16] [00:02:17] [00:02:19] [00:02:22] [00:02:23] [00:02:26] [00:02:28] [00:02:30] [00:02:32] [00:02:34] [00:02:37] [00:02:41] [00:02:42] [00:02:43] [00:02:44] [00:02:45] [00:02:47] [00:02:49] [00:02:51] [00:02:54] [00:02:55] [00:02:59] [00:03:01] [00:03:03].

Noam Brown and Ilia Sutskever agreed that scaling pre-training alone would not lead to superintelligence [00:07:01] [00:07:02] [00:07:54] [00:07:55] [00:07:56] [00:09:18] [00:09:20] [00:09:21]. Early models, even after extensive pre-training, struggled with tasks like Tic-Tac-Toe, making suboptimal or illegal moves [00:07:07] [00:07:08] [00:07:11] [00:07:14] [00:07:15] [00:07:18] [00:31:37] [00:31:38]. This highlighted the need for improved reasoning capabilities beyond simple scaling of pre-training [00:07:03].

The Promise of Test-Time Compute

Test-time compute (inference compute) is viewed as having significant room for growth, similar to where pre-training was during the GPT-2 era [00:03:15] [00:03:17] [00:03:20] [00:03:22] [00:03:24] [00:03:25] [00:03:27] [00:03:28] [00:03:30] [00:03:33] [00:03:35] [00:03:37] [00:03:38] [00:03:40]. There is substantial “low hanging fruit” for algorithmic improvements in this area [00:03:42] [00:03:43] [00:03:45] [00:03:46] [00:05:07] [00:05:08].

The potential ceiling for test-time compute is high, with the possibility of spending millions of dollars on a single query for critical societal problems, representing an eight-order-of-magnitude increase from the current cost of a ChatGPT query (around a penny) [00:04:30] [00:04:31] [00:04:33] [00:04:37] [00:04:42] [00:04:44] [00:04:46] [00:04:48] [00:04:50] [00:04:52] [00:04:55] [00:04:57] [00:05:00] [00:05:01] [00:05:02] [00:05:05].

Emergent Behavior and Reasoning

A key breakthrough for test-time compute models, such as O1, was the realization that allowing the model to “think for longer” resulted in emergent, desired behaviors [00:13:39] [00:13:43] [00:13:46]. These behaviors included trying different strategies, breaking down problems into smaller steps, and recognizing/correcting mistakes [00:12:48] [00:12:49] [00:12:52] [00:12:53] [00:12:54] [00:12:56] [00:12:59] [00:13:01] [00:13:03] [00:13:05] [00:13:08] [00:13:09]. This qualitative change, rather than just quantitative performance improvement, provided strong conviction about the direction of research [00:14:10] [00:14:14] [00:14:15] [00:14:18] [00:14:21] [00:14:23] [00:14:27] [00:14:28] [00:14:33] [00:14:35] [00:14:38] [00:14:40].

Challenges and Shifts in AI Development

The “Bitter Lesson” and Scaffolding

Richard Sutton’s “Bitter Lesson” suggests that methods leveraging computation and data ultimately outperform approaches that try to encode human knowledge and tricks into models [00:25:57] [00:25:58] [00:26:02] [00:26:05] [00:26:07] [00:26:11] [00:26:12] [00:26:14] [00:26:15] [00:26:18] [00:26:21] [00:26:24] [00:26:27] [00:26:28] [00:26:29] [00:26:30] [00:26:33] [00:26:35]. In the long run, techniques that scale well with more data and compute, like O1, are expected to prevail over “scaffolding” and “prompting tricks” [00:26:39] [00:26:40] [00:26:43] [00:26:45] [00:26:46] [00:26:48] [00:26:49] [00:26:51] [00:26:52] [00:26:56] [00:26:58] [00:27:00] [00:27:04] [00:27:06] [00:27:07] [00:27:10] [00:27:11] [00:27:12] [00:27:13] [00:27:15] [00:27:17] [00:27:18] [00:27:20] [00:27:22] [00:27:23] [00:27:26] [00:27:28].

Startups are cautioned against heavily investing in scaffolding and customization for current model limitations, as rapidly improving base models may soon negate these efforts [00:27:37] [00:27:38] [00:27:40] [00:27:42] [00:27:43] [00:27:45] [00:27:46] [00:27:48] [00:27:50] [00:27:52] [00:27:53] [00:27:56] [00:27:57] [00:27:58] [00:27:59] [00:28:01] [00:28:02] [00:28:05] [00:28:07] [00:28:08] [00:28:10] [00:28:13] [00:28:15] [00:28:17].

Generalization vs. Specificity in AI Development

Noam Brown’s work on games like Diplomacy and poker, which involved scaling test-time compute, influenced his approach to general AI development [00:00:33] [00:00:34] [00:03:34] [00:03:35] [00:08:41] [00:08:44] [00:08:46]. He initially aimed to extend algorithms from specific domains (like poker) to more and more games [00:17:38] [00:17:39] [00:17:42] [00:17:43] [00:17:46] [00:17:47] [00:17:49] [00:17:53] [00:17:55] [00:17:56] [00:17:57] [00:18:01] [00:18:03] [00:18:04] [00:18:07] [00:18:08] [00:18:10] [00:18:11] [00:18:13]. However, the challenges in making poker/Go techniques apply to the full general game of Diplomacy, where they could only reach human-level performance but not super-human, changed his mindset [00:18:16] [00:18:19] [00:19:02] [00:19:04] [00:19:05] [00:19:06] [00:19:10] [00:19:12] [00:19:14] [00:19:16] [00:19:17] [00:19:18] [00:19:20] [00:19:22] [00:19:23] [00:19:24] [00:19:26] [00:19:27] [00:19:31] [00:19:33] [00:19:35].

This led to the “jump to the endpoint” approach: starting with a generally capable model (like a language model) and figuring out how to scale test-time compute for “everything” [00:18:20] [00:18:23] [00:18:24] [00:18:26] [00:18:28] [00:18:30] [00:18:34] [00:18:35] [00:18:37] [00:18:39] [00:18:42] [00:18:45] [00:19:40] [00:19:41] [00:19:43] [00:19:45]. This shift in mindset, from extending narrow techniques to generalizing broadly, was crucial [00:18:56] [00:18:57] [00:18:59] [00:19:02].

The Future of AI Models and Applications

Model Convergence and Tool Use

The long-term vision is a single, general model that can handle all queries [00:15:07] [00:15:10] [00:16:18] [00:16:19] [00:16:22] [00:16:23]. This model would perform deep thinking when required but respond immediately for simpler tasks [00:16:25] [00:16:28] [00:16:31] [00:16:32].

Highly specialized, fast, and cheap tools (like a calculator for multiplication) will likely exist alongside general models [00:20:29] [00:20:31] [00:20:32] [00:20:34] [00:20:36] [00:20:38] [00:20:41] [00:20:44] [00:20:47] [00:20:49] [00:20:51] [00:20:53] [00:20:55]. General models like O1 could use these tools to save cost and potentially achieve better results than if they tried to perform the specialized task themselves [00:20:55] [00:20:58] [00:21:00] [00:21:03] [00:21:05] [00:21:08] [00:21:11] [00:21:12] [00:21:13] [00:21:15] [00:21:17] [00:21:18] [00:21:19] [00:21:20] [00:21:22]. This parallels how humans use specialized tools [00:21:26] [00:21:27] [00:21:30] [00:21:31] [00:21:33] [00:21:36] [00:21:37] [00:21:38].

Agentic AI and Research

The development of O1 provides a proof of concept for agentic behavior in AI [00:24:40] [00:24:42]. It can autonomously figure out and tackle intermediate steps for complex problems, a capability that previously required excessive prompting [00:24:43] [00:24:45] [00:24:47] [00:24:48] [00:24:49] [00:24:51] [00:24:53] [00:24:55] [00:24:57] [00:25:00] [00:25:01] [00:25:04]. This breakthrough means that “a lot of the scaffolding techniques” currently used for agents might “go away” as models become more capable [00:25:25] [00:25:26] [00:25:29] [00:25:31] [00:25:32] [00:27:22] [00:27:23] [00:27:26].

A significant area of exploration is using these models to advance scientific research, acting as partners for researchers to achieve things not previously possible or much faster [00:42:10] [00:42:14] [00:42:17] [00:42:20] [00:42:23] [00:42:26] [00:42:28] [00:42:32] [00:42:33] [00:42:34] [00:42:35] [00:42:37] [00:42:40] [00:42:41] [00:42:45] [00:42:46] [00:42:49] [00:42:51] [00:42:53] [00:42:56] [00:42:58] [00:43:01] [00:43:03] [00:43:04] [00:43:06]. While O1 shows strong performance in math and coding [00:43:59] [00:44:01] [00:44:02], its broader impact on fields like chemistry, biology, or theoretical mathematics remains to be seen [00:43:25] [00:43:27] [00:43:28] [00:43:31] [00:43:33].

Social Science and Human Simulation

AI models, particularly those trained on vast amounts of human data, can be used for social science experiments and neuroscience research [00:36:05] [00:36:06] [00:36:09] [00:36:11] [00:36:13] [00:36:15] [00:36:17] [00:36:18] [00:36:19]. They offer a scalable and cheaper alternative to human subjects, and can even facilitate experiments that would be unethical or cost-prohibitive with humans (e.g., the ultimatum game with large sums of money) [00:36:23] [00:36:24] [00:36:25] [00:36:26] [00:36:30] [00:36:35] [00:36:38] [00:36:39] [00:36:41] [00:36:44] [00:36:46] [00:36:49] [00:36:51] [00:36:52] [00:36:55] [00:36:57] [00:36:58] [00:37:02] [00:37:04] [00:37:06] [00:37:07] [00:37:08] [00:37:09] [00:37:11] [00:37:13] [00:37:14] [00:37:16] [00:37:18] [00:37:20] [00:37:21] [00:37:23] [00:37:25] [00:37:28] [00:37:32] [00:37:35] [00:37:36] [00:37:38] [00:37:39] [00:37:41] [00:37:43] [00:37:44] [00:37:45] [00:37:47] [00:37:49] [00:37:51] [00:37:52] [00:37:54] [00:37:55] [00:37:56] [00:37:58] [00:38:01] [00:38:02] [00:38:04] [00:38:07] [00:38:10] [00:38:11] [00:38:14] [00:38:17] [00:38:19] [00:38:21] [00:38:24] [00:38:27] [00:38:31] [00:38:33] [00:38:37] [00:38:40] [00:38:41] [00:38:43] [00:38:44] [00:38:46] [00:38:49] [00:38:50] [00:38:52] [00:38:55] [00:38:57] [00:38:59] [00:39:01] [00:39:02] [00:39:04] [00:39:07] [00:39:10] [00:39:11] [00:39:13] [00:39:17] [00:39:19] [00:39:21] [00:39:22] [00:39:24] [00:39:25] [00:39:27] [00:39:28] [00:39:30] [00:39:31] [00:39:32] [00:39:34] [00:39:35] [00:39:37] [00:39:39] [00:39:40] [00:39:41] [00:39:43] [00:39:44] [00:39:46] [00:39:48] [00:39:49] [00:39:51] [00:39:53] [00:39:55] [00:39:56] [00:39:58] [00:40:00] [00:40:02] [00:40:04] [00:40:06] [00:40:08] [00:40:10] [00:40:11] [00:40:13]. The ability of LLMs to communicate using human language effectively solves the problem of AI-AI communication [00:40:43] [00:40:46] [00:40:47] [00:40:49] [00:40:51] [00:40:53] [00:40:55] [00:40:57] [00:41:00] [00:41:03] [00:41:05] [00:41:06] [00:41:08] [00:41:10] [00:41:12].

Broader Implications

Hardware and Inference Compute

The emergence of models like O1 shifts the focus for hardware development from pre-training optimization to inference compute optimization [00:35:05] [00:35:07] [00:35:09] [00:35:11] [00:35:12] [00:35:14] [00:35:15] [00:35:17] [00:35:19] [00:35:20] [00:35:22] [00:35:24] [00:35:25] [00:35:27]. This change in paradigm creates new opportunities for creativity on the hardware side [00:35:29] [00:35:31] [00:35:32].

Academia’s Role in AI Research

Academia faces challenges in pushing the frontier of AI research due to the heavy reliance on data and compute resources [00:28:52] [00:28:54] [00:28:56] [00:28:58] [00:28:59] [00:29:02] [00:29:04] [00:29:07] [00:29:09] [00:29:10] [00:29:12]. There is an incentive for students to focus on “clever prompting” and “little tricks” for marginal performance gains to get papers accepted, which may not lead to impactful long-term research [00:29:14] [00:29:19] [00:29:20] [00:29:23] [00:29:25] [00:29:26] [00:29:28] [00:29:29] [00:29:31] [00:29:34] [00:29:36] [00:29:38] [00:29:40] [00:29:42] [00:29:43] [00:29:45] [00:29:47] [00:29:50] [00:29:51] [00:29:53] [00:29:54] [00:29:58] [00:29:59] [00:30:02] [00:30:04].

Academia is encouraged to focus on investigating novel architectures or approaches that demonstrate promising scaling trends, even if they don’t achieve state-of-the-art performance immediately [00:30:07] [00:30:10] [00:30:13] [00:30:15] [00:30:17] [00:30:18] [00:30:21] [00:30:23] [00:30:26] [00:30:28] [00:30:30] [00:30:33] [00:30:34] [00:30:36] [00:30:40] [00:30:42]. Industry labs actively monitor such research [00:30:44] [00:30:48] [00:30:50] [00:30:53] [00:30:54] [00:30:55] [00:30:58] [00:31:00] [00:31:03] [00:31:04] [00:31:07] [00:31:09] [00:31:11] [00:31:13] [00:31:15].

Redefining AGI

Noam Brown prefers to shift away from the term AGI [00:45:35] [00:45:38] [00:45:40]. Instead, he focuses on an AI that can “accelerate human productivity and make our lives easier” [00:45:59] [00:46:00] [00:46:03] [00:46:06]. He acknowledges that AI will likely not match human capabilities in physical tasks for a long time, ideally preserving a human edge [00:45:43] [00:45:44] [00:45:46] [00:45:48] [00:45:52] [00:45:54] [00:45:56].

The progress in AI, particularly with the test-time compute paradigm, has been “astounding” and addressed many past concerns about hitting a “wall” [00:46:47] [00:46:48] [00:46:51] [00:46:54] [00:46:57] [00:47:00] [00:47:01] [00:47:04] [00:47:06] [00:47:09] [00:47:11] [00:47:14] [00:47:16].