From: redpointai
Percy Liang, a leading AI researcher and co-founder of Together AI, provides insights into OpenAI’s new “O1” model and its broader implications for AI development and research. The model signals a significant shift in the field, moving towards more ambitious, long-term AI tasks and agent-based systems [00:01:07].
Initial Reactions and Paradigm Shift
From a product perspective, Liang initially found O1 to be slow and difficult to use for many desired tasks [00:00:57]. He even switched back to GPT-4 for certain applications, like writing a React app, because immediate, faster feedback was prioritized over O1’s capabilities, highlighting that evaluation is multi-dimensional, considering cost, speed, accuracy, and customizability [00:09:05].
However, from a research standpoint, O1 represents a shift towards “test time compute,” where models are designed to solve tasks over extended periods (days, weeks, or even months), rather than providing instant prompt-response interactions [00:01:15]. This direction could enable AI to tackle more ambitious projects, such as inventing new research or developing new drugs [00:02:07].
This also marks a return to the concept of AI agents, reminiscent of the earlier focus on reinforcement learning [00:02:39]. In this new paradigm, language model generations are interpreted as actions in a broader space, allowing agents to gain experience and learn from feedback over time in complex tasks [00:02:54].
Performance and Evaluation Challenges
O1 demonstrates incredible ability in specific domains like math and coding, where reasoning chains are crucial and better supervision can be applied [00:03:26], [00:09:48]. Percy Liang and his team also tested O1 on Sidebench, a “Capture the Flag” cybersecurity benchmark [00:04:43]. While O1 improved on subtask performance, it did not show a huge overall bump because it ignored existing agent templates and frameworks, acting like a “normal language model” rather than integrating with the established structure [00:04:47]. This highlights the importance of compatibility when integrating new AI models into larger systems [00:05:49].
The landscape of AI evaluation is complex due to the unknown nature of training data for proprietary models, making it difficult to trust benchmark results [00:30:26]. As models become more capable, new benchmarks are constantly needed [00:31:10]. Liang suggests that AI models themselves can be used to generate benchmarks, such as in their “Auto Bencher” paper, to create more challenging and diverse inputs [00:32:13]. He advocates for the use of rubrics in evaluation to provide concrete, anchored judgments, similar to grading exams [00:33:35].
Implications for AI Development and Research
Shift in Scaffolding and Debugging
The internalization of reasoning within O1 means that the current “scaffolding” of chaining prompts, commonly used with models like GPT-4, may become dispensable [00:06:31]. While this can lead to more powerful models, it also makes debugging difficult because the internal trace of the model’s operations is not exposed to the user [00:07:02]. This lack of transparency, while potentially a competitive advantage for proprietary models, limits a developer’s ability to customize and debug applications, especially for novel use cases [00:07:49].
Impact on Inference Market
The “test time compute” paradigm of O1 has significant implications for the inference market. Liang views inference as a fundamental, low-level building block that needs to be robust and affordable [00:44:10]. This shift allows for opportunities to optimize inference for specific use cases, such as high-throughput settings that require generating numerous possibilities for agentic workflows [00:45:56].
Qualitative Changes and Future Milestones
Liang believes that AI capabilities are still moving rapidly, with O1 representing a “qualitative change” in how these systems might be used [00:49:50]. He suggests that meaningful milestones for AI include solving open math problems or discovering new insights that extend human knowledge, moving beyond merely mimicking expert human behavior [00:48:22]. He anticipates that AI will be able to contribute novel insights in fields like ML research within the next few years [00:58:33].
Broader Ecosystem and Regulation
Liang emphasizes that thinking about AI’s role must be holistic, extending beyond just the model to the larger ecosystem of actors with varying incentives [00:15:14]. He advocates for regulations that prioritize transparency and disclosure, enabling policymakers, researchers, and third-party auditors to understand AI’s risks and benefits [00:20:13]. He believes that regulation should focus on transparency and obligations for downstream decision-makers to understand the AI’s outputs, much like nutrition labels on food [00:21:39].
Liang considers “agents” to be both overhyped and underhyped [00:57:04]. While the concept has seen a full hype cycle, the potential of agents remains significant, especially in areas like scientific discovery and improving researcher productivity, which he feels are currently underexplored application areas [00:59:06].