From: redpointai

Michelle Pokris, Post-Training Research Lead at OpenAI, discusses the focus on real-world utility and developer experience for models like GPT-4.1, shifting away from solely optimizing for benchmarks [00:00:55]. This focus on utility necessitates strong personalization and steerability features for AI models.

Prioritizing User Utility and Feedback

The primary goal for models like GPT-4.1 was to be a “joy to use for developers” [00:01:10]. This contrasts with models optimized only for benchmarks, which might stumble on basic real-world issues like instruction following or formatting [00:01:17].

To achieve user utility, OpenAI heavily relies on:

  • User Feedback Talking to users and gathering their feedback is crucial for identifying key insights and problems [00:01:41]. This feedback is then transformed into internal evaluations (evals) that can be used during research and development [00:01:44].
  • Internal Usage OpenAI uses its own models internally and has internal customers building on top of them, providing additional insights into performance [00:03:34].
  • “North Star” Evals An internal instruction following eval, based on real API usage and user feedback, served as a “north star” during the development of GPT-4.1 [00:01:59]. However, the shelf life of an eval is short (around three months) due to rapid progress, meaning there’s a constant need for new and relevant evaluations [00:08:48].

“There’s an interesting insight I got recently from talking to a user where it turns out our models could do better on kind of sometimes you want to tell them ignore everything you know about the world and only use the information in context.” [00:02:58]

Model Development and Iteration

Shipping a model like GPT-4.1 involves significant work from various teams [00:07:11]:

  • Pre-training: The three models (standard, mini, nano) are based on “semi-new pre-trains” or “mid-trains,” which are freshness updates [00:07:22]. Pretraining and finetuning AI models is a foundational step.
  • Post-training: Michelle’s team focuses on post-training, determining the best mix of data, parameters for reinforcement learning (RL) training, and weighting of different rewards [00:07:46].
  • Rapid Iteration: The process involved months of evaluation setup, followed by a flurry of training experiments, and then alpha testing to incorporate feedback rapidly [00:08:08].

Strategies for Personalization and Steerability

1. Fine-tuning

Fine-tuning has seen a “renaissance” in its perceived helpfulness with newer models [00:21:26].

  • SFT (Supervised Fine-tuning): Primarily used for speed and latency improvements. A 4.1 model can be fine-tuned to achieve a fraction of the latency [00:21:49]. It’s useful for simpler classification tasks or when a small percentage of errors needs to be closed [00:24:13].
  • RFT (Reinforcement Fine-tuning): This method, based on the same RL process used internally by OpenAI, allows users to “push the frontier in your specific area” [00:22:08]. It is data-efficient, requiring only hundreds of samples [00:22:16].
    • Applications: RFT is effective for teaching agents specific workflows, decision processes [00:22:37], and deep tech applications where verifiable, unique data is available (e.g., chip design, biology/drug discovery) [00:22:48].
  • Preference Fine-tuning: Useful for stylistic adjustments to model output [00:24:04].

2. Prompting and Instruction Following

Instruction following is one of the “hardest thing to define in ML” because it means hundreds of different things to different people [00:04:28].

  • Structured Prompts: Using XML or well-structured prompts improves model performance [00:28:23].
  • “Keep Going” Prompt: Explicitly telling the model to “please don’t come back to me until you’ve solved the problem” can lead to “remarkably better performance” [00:28:30].
  • Custom Instructions: Users can employ custom instructions to tweak the model’s personality or behavior, such as avoiding capital letters or follow-up questions [00:39:29].

3. Enhanced Memory

Enhanced memory is a powerful lever for personality and personalization [00:39:03]. As a model learns more about a user, it can adapt to their preferences, making the interaction more useful [00:39:13]. Michelle notes that her ChatGPT experience is “so different from like my mom’s or my husband’s” due to this personalization [00:39:07].

Challenges and Future Directions

General vs. Purpose-Built Models

While GPT-4.1 was purpose-built for developers by upweighting coding data and removing chat-specific data [00:16:39], Michelle’s general philosophy leans towards making one “general” model (AGI) that simplifies the product offering [00:15:54]. Combining capabilities typically produces a much better result due to cross-domain generalization [00:16:51].

The challenge for future models like GPT-5 will be to combine different skill sets, such as being a delightful conversationalist while also knowing when to engage in hard reasoning [00:37:57].

Agentic Capabilities and Ambiguity

Agents work “remarkably well in well scoped domains” where tools and user intent are clear [00:09:16]. The current challenge is bridging the gap to “fuzzy and messy real world” scenarios [00:09:30], where users might not know what the agent can do or the agent lacks awareness of its own capabilities [00:09:38].

  • Improving Ambiguity Handling: Models need to be more steerable regarding ambiguity—should they ask for more information or proceed with assumptions? [00:10:00]
  • Robustness and “Grit”: On the modeling side, there’s a need for more robustness, particularly when external APIs encounter errors. This is referred to as training in more “grit” [00:12:00].
  • Tool Use Generalization: OpenAI has found that learning to use one set of tools makes a model better at other sets of tools, indicating a strong generalization capability that reduces the need for “tool-specific training” [00:34:21].

Future Research Areas

  • Models Improving Models: Using AI models to make other models better, particularly in reinforcement learning for figuring out if a model is on the right track [00:32:04]. Synthetic data has been an “incredibly powerful trend” in this area [00:33:18].
  • Speed of Iteration: A focus on improving the speed of running experiments by optimizing GPU usage and ensuring sufficient scale for signal extraction [00:32:26].

Recommendations for Companies

  • Develop Strong Evals: The most successful companies know their use case well and have robust evals to quickly test new models when they are released [00:17:58]. It’s beneficial to break down problems into subcomponents with specific evals to pinpoint what is and isn’t working [00:30:21].
  • Adapt Prompts and Scaffolding: Be prepared to switch prompts and scaffolding to tune them to particular models [00:18:15].
  • Build “Just Out of Reach” Features: Focus on use cases that are “just out of reach” for current models (e.g., working 1 out of 10 times, but capable of 50% with fine-tuning). These are likely to be “crushed” by future models, allowing companies to be first to market [00:18:25].
  • Anticipate Future trends in AI and personalization: While building scaffolding to ship value today is essential, always have an eye on future trends like improving context windows, reasoning capabilities, and instruction following [00:20:05]. Multimodal capabilities are also rapidly improving and are currently “underhyped” [00:20:36].
  • AI Expertise for App Companies: Michelle believes that successful app companies will need “scrappy engineers” who understand the product, rather than deep AI research expertise, as models become easier to combine and use [00:31:31].

Model Selection Guide for Enterprise AI Model Management

For enterprise users, Michelle suggests a decision tree for model selection [00:27:31]:

  1. Start with GPT-4.1: See if it meets the use case requirements [00:27:35].
  2. For Speed/Cost Optimization: If 4.1 works, consider fine-tuning or using smaller models like Mini and Nano for faster and cheaper inference [00:27:39].
  3. For Harder Problems: If 4.1 is insufficient, try GPT-4 Mini for its reasoning capabilities, then GPT-3, and finally RFT with GPT-4 Mini for frontier pushing in specific domains [00:27:48].
Overhyped vs. Underhyped

**Overhyped:** Benchmarks, especially agentic ones that are saturated or presented with unrealistic numbers [00:43:38].

**Underhyped:** Companies' own internal evals using real usage data to understand model performance [00:43:55].