From: redpointai

Percy Liang, a leading AI researcher and co-founder of Together AI, discusses the OpenAI’s o1 model, its implications for the future of AI, and the broader landscape of AI model development and evaluation [00:00:03].

Initial Impressions of OpenAI’s o1 [00:00:44]

From a product standpoint, o1 was initially perceived as slow and difficult to use [00:00:57]. However, from a research perspective, its release signals a significant shift towards “test-time compute” [01:01:15]. This approach allows AI to tackle more ambitious tasks that might take days, weeks, or even months for humans to complete, such as new research or drug discovery [01:45:00].

Historically, language models were seen as prompt-in, response-out systems measured by tokens per second [01:29:00]. o1 represents a small but directional step toward AI agents that can reason, plan, and execute tasks over extended periods [02:00:00]. This marks a return to concepts from reinforcement learning, where agents take actions and receive feedback to improve over time [03:03:00].

Capabilities and Evaluation Challenges [03:20:00]

o1 has shown impressive capabilities in specific domains like math and coding, where multi-step reasoning chains are beneficial [09:48:00]. New benchmarks, such as “Sidebench” for Capture the Flag cyber security exercises, are emerging to test these advanced reasoning abilities [03:41:00]. Some of these challenges are so complex that even a team of human competitors might take over 24 hours to solve them [04:04:00].

However, evaluating models like o1 presents challenges [04:41:00]:

  • Compatibility Issues: When integrated into existing systems, o1 might ignore predefined templates for agent behavior (e.g., reflection and planning), leading to less-than-expected overall performance despite improvements in sub-tasks [05:01:00]. This highlights that raw benchmark scores don’t always tell the full story [05:34:00].
  • Monotonic Progress vs. System Fit: Simply dropping in a new model doesn’t guarantee improvements if it doesn’t fit the existing system [05:42:00]. Compatibility is a crucial, often overlooked factor [06:05:00].
  • Train-Test Overlap: A persistent problem in evaluation is the unknown content of proprietary training data, making it hard to trust benchmark results [00:30:26].
  • Moving Target: As AI models improve, new benchmarks are constantly needed to capture their evolving capabilities [00:31:10]. Language models themselves are increasingly used to generate new, diverse evaluation inputs [00:32:13].
  • Lack of Transparency: Compared to earlier models, newer ones often don’t provide visibility into their internal reasoning processes, making debugging and customization difficult [07:02:00]. This “black box” nature can hinder development, especially for novel applications where data coverage might be limited [07:49:00].

Paradigm Shift: Internalized Reasoning and its Implications [06:45:00]

The o1 model signifies a shift where reasoning scaffolding, previously managed by developers through prompt chaining, is internalized within the model and not exposed to the user [06:52:00]. While this aims for greater ease of use, it complicates debugging when things go wrong, as there’s no “stack trace” of the model’s internal steps [07:06:00]. This tension exists between Open AI’s desire for the model to “just take care of it” and a developer’s need for transparency and customization [07:20:00].

This trend also impacts open-source models and the competitive landscape. With closed models internalizing their logic, the transparency previously offered by explicit prompt chains is diminishing, leading to more opaque systems [08:00:00].

Impact on the Inference Market [00:43:50]

The rise of models like o1 with their focus on “test-time compute” has significant implications for the inference market:

  • Fundamental Building Block: Inference remains a core, low-level primitive required for every aspect of AI, from training to agentic workflows and synthetic data generation [00:44:10]. It needs to be robust and cost-effective [00:44:20].
  • Abstraction Shift: The market is moving beyond merely serving specific models like Llama 3 [00:45:10]. The focus is on serving models that perform well for specific use cases, and customization becomes key for achieving better and faster performance [00:45:40].
  • Optimization for Agentic Workflows: The new agentic workflows create opportunities for further optimization, especially in high-throughput settings where many possibilities need to be generated [00:45:58].

Evolution of AI Architectures [00:40:08]

Historically, model architectures like LSTMs, CNNs, and Transformers were often developed through intuition and experimentation, considering gradients and how they should work [00:40:47]. However, newer architectures like Mamba (state-space models) emerged from mathematical breakthroughs, solving fundamental problems that then found application within neural networks [00:41:03].

It’s likely that Transformers will not be the ultimate architecture [00:42:04]. Future innovations in model architectures are more probable in new domains like video or complex agentic settings, where existing architectures might break down [00:42:19]. Structural changes are needed for genuinely novel architectures to emerge, often driven by tackling fundamentally different problems, much like machine translation spurred Transformer development [00:43:11].

The Future of AI: Beyond Mimicry [00:48:16]

AI is moving beyond merely mimicking human capabilities. The next significant milestones involve AI extending human knowledge, such as solving open math problems or discovering new research insights that haven’t been solved by humans [00:48:22]. Finding “zero-days” in cybersecurity is another example of a potential game-changer [00:49:12].

Progress in AI is not slowing down; rather, it continues to move quickly, with qualitative changes like o1’s approach representing different ways to think about using these systems [00:49:26]. Advances in chip power and cost reduction will further drive this progress [00:50:03].

Underexplored Application Areas [00:59:01]

While commercial applications like RAG (Retrieval Augmented Generation) and summarization are well-explored, more fundamental areas remain underexplored [00:59:06]. These include:

  • Fundamental science and scientific discovery [00:59:21]
  • Improving researcher productivity [00:59:26] Such areas are crucial as they feed back into and enhance the entire AI ecosystem [00:59:34].