Improving GPT 41 for developers

From: redpointai

Michelle Pokris, as OpenAI’s Post-Training Research Lead, played a pivotal role in refining GPT-4.1, making it significantly more effective for developers [00:00:04]. The focus for this model shifted from optimizing solely for benchmarks to prioritizing real-world usage and utility [00:00:55].

Development Philosophy and Process

The primary goal for GPT-4.1 was to create a model that is a “joy to use for developers” [00:01:10]. This involved addressing common developer pain points, such as models not following instructions, strange formatting, or limited context windows [00:01:22].

To achieve this, the development process included:

User Feedback Integration: Extensive conversations with users to identify issues and transform their feedback into actionable evaluations (evals) [00:01:41]. An internal instruction following eval, based on real API usage, served as a “north star” during development [00:02:00].
Identifying Key Problems: Rather than users providing specific evals, developers often describe general “weirdness” in use cases, which then requires deeper investigation to uncover the underlying issues [00:02:42]. For example, one insight revealed that models needed to improve at ignoring pre-existing world knowledge and exclusively using in-context information [00:03:00]. Recurring themes from customer feedback and internal model usage guide which evals to prioritize [00:03:30].
Rapid Iteration: The process involved about three months of eval development, followed by three months of intense training and experimentation, and finally a month of alpha testing to rapidly incorporate feedback [00:08:08].
Eval Shelf Life: The rapid progress in AI means that the shelf life of an eval is often only about three months, as models quickly saturate existing benchmarks [00:08:48].

Shipping the Model

Shipping GPT-4.1 involved a large team and three models: standard, mini, and nano [00:07:14]. The larger model received a “mid-train” or freshness update, while the mini and nano versions were entirely new pre-trains [00:07:36]. Michelle Pokris’s team focused heavily on post-training, determining the optimal mix of data, RL training parameters, and reward weightings [00:07:46].

A key decision for GPT-4.1 was to decouple its development from ChatGPT [00:16:22]. This allowed for faster training and feedback cycles, and enabled specific model training choices, such as removing ChatGPT specific datasets and significantly increasing the weighting of coding data [00:16:27].

Key Improvements and Capabilities

Agents and Tool Use

GPT-4.1 brought significant improvements to agent development tools through enhanced instruction following and long context capabilities [00:09:00].

Current State: Agents excel in “well-scoped domains” where all necessary tools are provided and user intent is clear [00:09:16].
Challenges: The main challenge is bridging the gap to “fuzzy and messy real-world” scenarios [00:09:30]. This includes the agent’s awareness of its own capabilities, connection to real-world information, and handling ambiguity (e.g., knowing when to ask for more information versus making assumptions) [00:09:42].
Future Progress: Progress requires both engineering and model changes [00:11:30]. On the engineering side, better APIs and UIs are needed for monitoring, summarizing, and steering agent actions [00:11:37]. On the modeling side, improved robustness and “grit” are needed to handle errors (e.g., API 500s) [00:12:00]. Most current agent benchmarks are nearing saturation or have ambiguous failure cases [00:10:42].

Code Capabilities

GPT-4.1 demonstrated “remarkably good” performance in coding, especially for locally scoped problems where files are closely related [00:12:31].

Areas for Improvement:
- Global Understanding: The model still needs improvement in understanding global context and reasoning across many disparate parts of a codebase [00:12:50].
- Front-End Coding: While improved, the goal is to produce front-end code that a human engineer would be proud of, focusing on linting and code style [00:13:14].
- Irrelevant Edits: Reducing changes beyond what was requested (from 9% in 4.0 to 2% in 4.1, but aiming for zero) [00:13:32].
Benchmarks: While some code benchmarks like SWEBench are still useful, many are becoming saturated, necessitating the creation of new ones [00:14:54].
Personal Use: Michelle uses Codeex (main model: 04 Mini for speed) and still dabbles with GitHub Copilot, Windsurf, and Cursor for her personal coding [00:14:11].

Model Strategy and Evolution

Generalization vs. Specialization

OpenAI’s general philosophy leans into the “G” in AGI, aiming for one general model [00:15:54]. The long-term goal is to simplify the product offering, having one model for both ChatGPT and API use cases [00:16:04]. However, GPT-4.1’s separate development was due to a “particularly acute need” for developers, allowing faster progress by decoupling from ChatGPT [00:16:17]. While a targeted approach was successful, the belief is that combining the “creative energies of all researchers” on one model yields better overall results [00:16:51].

Future of Model Families (GPT-5)

The challenge for future models like GPT-5 will be combining diverse capabilities [00:37:27]. For example, balancing the conversational delight of the 4.0 series with the strong reasoning abilities of 03, ensuring the model knows when to engage in casual chat versus deep thought [00:38:03]. Decisions about upweighting or downweighting specific data types (like chat vs. coding) create “zero-sum decisions” that must be carefully balanced [00:38:21].

Advice for Developers and Companies

Navigating Rapid Progress

Companies must have strong evals tailored to their specific use cases to quickly assess new models [00:17:58]. They should also be able to adapt their prompts and scaffolding to new models [00:18:13].

Building for the Future

“Just Out of Reach”: Focus on use cases that are “maybe just out of reach of the current models,” or that work only occasionally but have high potential [00:18:25]. If a problem can be fine-tuned from 10% to 50% pass rate, it’s likely “on the cusp” and a future model will “crush it” [00:18:51].
Scaffolding: Building scaffolding (e.g., RAG, repeating instructions) is worthwhile for immediate value and short-term arbitrage [00:19:54]. However, companies must be prepared to change their approach as model capabilities (context windows, reasoning, instruction following) continuously improve [00:20:16].
Multimodal Capabilities: Acknowledge the significant improvements in multimodal understanding, especially with new pre-trains [00:20:50]. Connecting models to diverse information sources (even if results are currently mediocre) is a good strategy as capabilities will improve [00:21:09].

Finetuning Approaches and Considerations in AI

Fine-tuning has seen a “renaissance” in usefulness [00:21:26]. It can be categorized into two camps:

Fine-tuning for Speed and Latency (SFT): This remains the workhorse, allowing models like 4.1 to be run at a fraction of the latency [00:21:46].
Fine-tuning for Frontier Capabilities (RFT): This approach allows pushing the frontier in specific domains, even with small datasets (e.g., hundreds of samples) [00:22:01]. RFT is expected to be generally available soon and is particularly useful for:
- Teaching agents complex workflows [00:22:37].
- Deep tech applications where organizations have unique, verifiable data (e.g., chip design, biology, drug discovery) [00:22:48].
- RFT is based on the same reinforcement learning process OpenAI uses internally, making it robust and effective [00:23:32].

When to Use Which Fine-tuning:
- Preference Fine-tuning: For stylistic adjustments [00:24:04].
- SFT: For simple tasks like classification where the model gets a small percentage of cases wrong [00:24:13].
- RFT: For problems where no current model in the market meets the specific needs [00:24:22].

Choosing the Right Model and Prompting

For general ChatGPT use, 4.0 is recommended, with 4.5 for creative tasks and 03 for the hardest math or critical problems like tax filing [00:26:36].

For enterprise users:

Start with 4.1 to see if it meets the use case [00:27:35].
If faster speeds are needed, look into mini and nano, and fine-tuning them [00:27:39].
If 4.1’s capabilities are insufficient, try O4 Mini for reasoning, then O3, and finally RFT with O4 Mini [00:27:48].

Effective prompting techniques for 4.1 include:

Structuring prompts with XML: This helps organize instructions effectively [00:28:23].
“Keep Going”: Explicitly telling the model to continue until the problem is solved can significantly improve performance [00:28:30].

OpenAI’s Internal Focus and Future Research

Power Users and Evals

OpenAI’s “power users research team” focuses on understanding and improving models for discerning users, including developers, who often use advanced features and know the models’ capabilities best [00:41:44]. Insights from power users predict what median users will be doing in the future [00:42:27].

Sophisticated companies excel at creating modular systems with granular evals, breaking down problems into subcomponents to understand precisely what is working or failing [00:30:16]. This modularity allows for faster development in the long run [00:30:47].

Future Research Areas

Models Improving Models: Using AI models to make other models better, particularly in reinforcement learning, by generating signals to determine if a model is on the right track [00:32:04]. This includes leveraging synthetic data, which is an “incredibly powerful trend” [00:33:18].
Speed of Iteration: Improving the speed of experimentation, reducing GPU usage per experiment, and ensuring sufficient scale to derive clear signals [00:32:26].
Generalization in Agents: The trend suggests that learning to use one set of tools improves the model’s ability to use other tools, leading to broadly capable agents rather than highly specialized, single-tool trained ones [00:34:14]. Coding agents are expected soon due to current performance levels [00:35:22].
Personality and Steerability: Enhancing model personality through “enhanced memory” (making it more useful as it learns about the user) and increased steerability via custom instructions [00:39:03].

Personal Journey at OpenAI

Michelle Pokris joined OpenAI two and a half years ago on the API engineering team, with a background in back-end distributed systems [00:40:36]. After about 1.5 years, she shifted to focusing on models specifically for the API, identifying a need to improve models for developers (e.g., structured outputs) [00:41:14]. She then formed and now leads the Power Users Research team, which focuses on developers and other discerning users across OpenAI’s products [00:41:40]. The pace of shipping remains consistently fast despite the company’s growth [00:42:51].

Quick Fire Round

Overhyped: Benchmarks, especially saturated agentic ones or those showing unrealistic best numbers [00:43:38].
Underhyped: Companies’ own evals derived from real usage data [00:43:55].
Mind Change: Michelle was initially a “fine-tuning bear” but was convinced by RFT’s ability to push the frontier in specific domains [00:44:10]. The shift occurred because RFT offers a similar algorithm to OpenAI’s internal reinforcement learning process, making it a “big shift” in eliciting capabilities [00:44:47].
Model Progress: Expect model progress to be about the same this year, with continued speed and many new models [00:45:01]. There is still “tens of trillions of dollars of value” to be extracted from current models, representing at least 10 years of building [00:36:31].
Exciting Consumer Products (outside OpenAI): Products that take AI out of the digital world, such as Levels and Whoop (for health insights) [00:45:46].

For more information, readers can refer to the GPT-4.1 blog post or reach out to Michelle Pokris on Twitter for feedback [00:46:05].

Tubegraph

Explorer

Table of Contents