From: redpointai
Michelle Pokris, a post-training research lead at OpenAI, played a crucial role in improving GPT-4.1 for developers, focusing on real-world usage over benchmarks [00:00:04]. Her work has significantly impacted the current and future state of AI agents [00:00:17].
Current State of AI Agents
AI agents currently work remarkably well in well-scoped domains where the model has the necessary tools and the user’s request is clear [00:09:15]. Common successful use cases include scenarios where tools are available and user intent is explicit [00:09:21].
However, the main challenge lies in bridging the gap to the “fuzzy and messy real world” [00:09:30]. This includes situations where users may not know what the agent can do, the agent lacks awareness of its own capabilities, or it’s not sufficiently connected to real-world information [00:09:38]. Michelle Pokris believes that much of the core capabilities for agents are already present, but the difficulty lies in providing sufficient context to the model [00:09:53].
A key area for improvement is handling ambiguity, allowing developers to tune whether the model should ask for more information or make assumptions when faced with unclear requests [00:10:00]. Benchmarks for function calling and agentic tool use often show models being “misgraded” or encountering ambiguous cases, suggesting that the underlying models are often doing the right thing but are limited by context or instruction adherence from user models [00:10:44].
Future Developments for AI Agents
To advance agents, improvements are needed on both the engineering and model sides [00:11:30].
Engineering Side
- APIs and UIs: Easier interfaces are required to monitor agent actions, view summaries of their activities, and intervene to change their trajectory [00:11:37]. An example of this steerability exists in OpenAI’s “Operator” product [00:11:47].
Model Side
- Robustness and “Grit”: Models need to be trained for greater resilience when things go wrong, such as API errors, preventing them from getting “stuck” [00:12:00].
- Longer-term Task Execution: While underlying model capabilities are strong, the full potential isn’t realized due to insufficient connection of context or tools [00:10:29]. Coding agents are expected to emerge soon, given that models like GPT-4.1 already exceed many human benchmarks in areas like SWEBench [00:35:22].
- Generalization: The ability to supervise long runs in code is a key capability [00:35:35]. Models like GPT-4.1 can already integrate developer-specified tools into their chain of thought, allowing them to use previous tool calls and outputs for continued reasoning [00:35:44]. This means that agentic capabilities, like for customer support, are largely “there” and just need to be integrated into cohesive products [00:36:04].
Role of Fine-Tuning
Fine-tuning, particularly Reinforcement Learning from Human Feedback (RFT), is seen as a powerful way to push the frontier in specific domains [00:22:08]. RFT is data-efficient, requiring only hundreds of samples [00:22:16].
Examples of effective RFT applications include:
- Teaching an agent to select a workflow [00:22:37].
- Guiding an agent’s decision-making process [00:22:44].
- Deep tech applications in fields like chip design or biology (e.g., drug discovery) where verifiable data allows for absolute best results [00:24:47].
Michelle Pokris suggests that if a model’s current pass rate is low (e.g., 10%) but can be significantly improved with fine-tuning (e.g., to 50%), it indicates a capability “right on the cusp” that a future model will likely master [00:18:53].
Generalization vs. Specificity
While OpenAI’s immediate goal for GPT-4.1 was a targeted model for developers, the long-term philosophy leans towards the “G” in AGI, striving for a single, general model that can handle various tasks [00:15:54]. The belief is that combining efforts on one general model yields better results [00:16:51].
However, there’s still room for targeted approaches when an acute need arises, as was the case with GPT-4.1 being decoupled from ChatGPT to accelerate development and make specific training choices like upweighting coding data [00:16:17]. Michelle Pokris acknowledges that more targeted models might be pursued again depending on demand [00:17:20].
Regarding specialized foundation models (e.g., robotics or biology), there’s a strong belief that generalization improves capabilities, and combining everything produces a much better overall result [00:25:49].
Role of Developers and Companies
Companies are advised to stay on top of rapid AI progress by developing strong internal evaluations (evals) for their specific use cases [00:17:56]. This enables them to quickly assess new models and adapt their prompts and scaffoldings [00:18:11]. Michelle Pokris suggests building features that are “just out of reach” of current models, as new model releases can quickly enable them [00:18:22].
While scaffolding to make products work today is valuable (offering “arbitrage” for a few months), companies should be prepared to change things as core capabilities like context windows, reasoning, and instruction following continue to improve [00:19:54]. Multimodal capabilities are also rapidly improving, making it worthwhile to connect models to as much information as possible, even if initial results are modest [00:20:36].
For app layer companies, strong AI expertise may not be as crucial as understanding the product and being “scrappy engineers” who can combine models and solutions [00:31:31]. The most successful developers and startups are those who deeply understand their problem, have comprehensive evaluations for subcomponents, and design modular systems [00:30:16].
Personalization and AGI
The trend of enhanced memory in models means that future ChatGPT experiences will be highly personalized, adapting to individual users and their preferences [00:39:03]. Increased steerability through features like custom instructions will also allow users to fine-tune the model’s “personality” [00:39:25].
The ultimate goal of combining different model families (like 4.0’s conversational strengths and 03’s reasoning capabilities) into a single model, such as GPT-5, is a significant research challenge. The difficulty lies in striking the right balance to be a delightful conversationalist while also knowing when to engage in deep reasoning [00:37:25].
Michelle Pokris believes that even if model progress stopped today, there would still be at least 10 years of value to build from existing capabilities [00:37:07].