From: redpointai
Michelle Pokris, a key figure behind GPT 4.1 and OpenAI, served as the post-training research lead, significantly enhancing these models for developers [00:00:04]. Her work focused on improving the models’ real-world utility and user experience [00:00:55].
Development Philosophy: Utility over Benchmarks
A core goal for GPT 4.1 was to create a model that is a “joy to use for developers” [00:01:10]. Traditionally, models might be optimized for benchmarks, leading to issues in practical application such as failing to follow instructions, odd formatting, or short context windows [00:01:17].
Instead, the development prioritized developer feedback, converting it into an internal evaluation metric that served as a “north star” during research [00:01:32] [00:02:07]. This involved extensive user conversations to pull out key insights, even for subtle issues like models failing to ignore their world knowledge when instructed to use only provided context [00:02:52]. Feedback was also gathered from internal customers building on OpenAI models [00:03:39].
Key Improvements in GPT 4.1
GPT 4.1 introduced several significant improvements:
- Improved UI and Coding Capabilities Enhancements were made to the model’s ability to generate better UIs and code, a feature “snuck in near the very end” of development [00:06:08].
- Nano Model Adoption The “nano” variant proved highly successful due to its low cost and high speed, driving increased AI adoption across various use cases, such as Box’s 17-page document reading feature [00:06:15].
- Instruction Following and Long Context These areas saw substantial improvements, which are incredibly beneficial for agents [00:08:59].
- Code Improvements
- Local Scope Mastery: The model excels at problems with local context, such as changing libraries where files are closely related [00:13:38].
- Global Understanding (Ongoing): Challenges remain with tasks requiring global context, reasoning across many parts of the code, or handling highly technical details across files [00:12:50].
- Front-end Coding: Significant improvements in generating front-end code, with an ongoing focus on linting and code style to make it professional-grade [00:13:13].
- Reduced Irrelevant Edits: The rate of irrelevant code edits decreased from 9% in GPT-4.0 to 2% in GPT 4.1, though efforts continue to reach zero [00:13:52].
The Development Process
Shipping a model like GPT 4.1 involves a large team effort [00:07:14]. The project included:
- Pre-training: The standard size model received a “mid-train” or freshness update, while the mini and nano variants were new pre-trains [00:07:30].
- Post-training: Michelle Pokris’s team focused on determining the best mix of data, optimal parameters for Reinforcement Learning (RL), and reward weighting [00:07:46].
- Timeline: The process included approximately three months of evaluating pain points with GPT-4.0, followed by a three-month period of intensive training experiments, and finally about a month of alpha testing with rapid iteration and feedback incorporation [00:08:08].
Evals have a short shelf life, about three months, due to the rapid pace of AI progress and quick saturation [00:08:48].
Agent Capabilities and Future Directions
With GPT 4.1, agents demonstrate remarkable performance in “well-scoped domains” where tools and user intent are clear [00:09:16]. However, challenges persist in bridging to the “fuzzy and messy real world,” where users might not know what the agent can do, or the agent lacks awareness of its own capabilities or real-world context [00:09:28].
A key area for improvement is handling ambiguity, allowing developers to tune whether the model should ask for more information or make assumptions [00:10:00]. Many supposed failures in external benchmarks for agentic tool use are often misgraded or stem from ambiguous situations, indicating that the models’ underlying capabilities often exceed current implementation [00:10:41].
For longer-term task execution, both engineering and model-side changes are needed [00:11:30]. Engineering-wise, APIs and UIs need to better track agent actions, provide summaries, and allow for intervention and steering [00:11:35]. Model-wise, increased robustness and “grit” are desired for handling errors like API failures [00:12:00].
Model Evolution and OpenAI’s Strategy
OpenAI’s general philosophy leans towards the “G” in AGI, aiming for one general model that simplifies the product offering [00:15:54]. GPT 4.1’s targeted development for developers was an exception due to an acute need and the ability to move faster by decoupling from ChatGPT [00:16:15]. This allowed for specific data weighting (e.g., upweighting coding data, downweighting chat data) [00:16:39].
The expectation is a return to simplification, believing models improve when all researchers contribute to a single, powerful model [00:16:51]. Cross-domain generalization has also shown benefit [00:17:01].
The future GPT-5 aims to combine the strengths of different models, like the conversational fluency of GPT-4.0 with the hard reasoning capabilities of GPT-3.0, presenting a challenge in striking the right balance without sacrificing specific strengths [00:37:12]. Personalization through “enhanced memory” and steerability via custom instructions are seen as key levers for future model personalities [00:39:03].
Advice for Companies and Users
To succeed in the rapidly evolving AI landscape, companies should:
- Maintain strong evals: The most successful startups thoroughly understand their use cases and have robust evals, allowing them to quickly test new models [00:17:58].
- Adapt prompts and scaffolding: Be prepared to tune prompts and system scaffolding to new model capabilities [00:18:15].
- Build “just out of reach” features: Focus on use cases that are almost but not quite solvable by current models (e.g., a 10% pass rate with fine-tuning that could reach 50% indicates it’s on the cusp) [00:18:25].
- Embrace scaffolding initially: Build necessary scaffolding to deliver value immediately, but be ready to adapt as underlying model capabilities improve (e.g., longer context windows, better reasoning, improved instruction following, multimodal capabilities) [00:19:54].
Fine-Tuning: SFT and RFT
Fine-tuning has seen a “renaissance” [00:21:26]:
- Supervised Fine-tuning (SFT): Primarily used for speed and latency improvements [00:21:46].
- Reinforcement Learning from Feedback (RFT): This approach allows users to push the “frontier in your specific area” with high data efficiency, requiring only hundreds of samples [00:22:08]. RFT is particularly useful for teaching agents workflows or for deep tech applications where verifiable, unique data is available [00:22:33]. Examples include chip design and biology (e.g., drug discovery), where results are easily verifiable [00:24:47].
Companies should consider RFT when no existing model performs to their needs [00:24:22].
Michelle Pokris’s Career and Future Research
Michelle Pokris joined OpenAI two and a half years prior to the interview, initially on the API engineering team, with a background in back-end distributed systems [00:40:36]. She transitioned to focusing on model improvements for the API, specifically addressing developer needs like structured outputs [00:41:14]. She later formed and rebranded her team to “Power Users Research,” focusing on the most discerning users, including developers, because their current use cases often predict mainstream usage a year later [00:41:40].
Exciting future research areas include:
- AI for AI: Using models to make other models better, especially in reinforcement learning, by generating synthetic data and using signals to track model progress [00:32:04].
- Speed of Iteration: Improving the efficiency of experiments to run more research with fewer GPUs, enabling faster feedback on model performance [00:32:26].
Pokris is bullish on generalists in the AI app space, believing that strong product understanding and scrappy engineering skills will be more important than deep AI research expertise for companies building on top of foundation models [00:31:31].
She highlights the “capabilities overhang” of current AI models, suggesting that even if model progress stopped today, there would be at least ten years of building value from existing capabilities [00:36:42].
Overhyped vs. Underhyped
- Overhyped: Benchmarks, particularly saturated agentic ones or those showing unrealistic “best numbers” [00:43:38].
- Underhyped: Companies using their own custom evals derived from real usage data [00:43:55].
Pokris changed her mind on fine-tuning, moving from a “bear” to believing RFT is highly valuable for pushing the frontier in specific domains [00:44:10]. She expects model progress to continue at a similar rapid pace, with no signs of slowing down or a “fast takeoff” yet [00:44:56].
For more information, refer to the GPT 4.1 blog post or reach out to Michelle Pokris on Twitter or via email [00:46:05].