Finetuning and reinforcement learning techniques for AI

From: redpointai

Michelle Pokris, the post-training research lead for GBT 4.1 at OpenAI, played a crucial role in enhancing these models for developers [00:00:04]. Her work has focused on making models more practical for real-world usage and utility, moving beyond mere benchmark optimization [00:00:55].

Focus on Real-World Utility

The primary goal for GBT 4.1 was to create a model that is “a joy to use for developers” [00:01:10]. Unlike models optimized solely for benchmarks, which may falter on basic tasks like following instructions or formatting, GBT 4.1 prioritized practical application [00:01:17].

To achieve this, OpenAI focused on:

Gathering customer feedback and transforming it into actionable evaluations (evals) for research [00:01:41].
Developing an internal instruction-following eval based on real API usage and user input, serving as a guiding “north star” during development [00:01:59].
Identifying recurring themes from customer feedback and internal model usage to prioritize which evals to pursue [00:03:28].

An interesting insight from user feedback was the need for models to ignore pre-existing world knowledge and only use information provided in context, a capability not typically measured by standard benchmarks [00:03:00].

“The shelf life of an eval is like three months unfortunately like progress is so fast things are getting saturated so quickly” [00:04:47].

Michelle emphasizes the continuous need for new evals, particularly for long-context real-world scenarios and more nuanced instruction following [00:04:08].

Post-Training and Model Development

The process of shipping a model like GBT 4.1 involves a large team and significant effort [00:07:11]. The three models (standard, mini, nano) are either “mid-trains” (freshness updates) or new pre-trains [00:07:36].

Michelle’s team focuses on post-training [00:07:46], which involves:

Determining the optimal mix of data [00:07:52].
Setting the best parameters for RL (Reinforcement Learning) training [00:07:54].
Weighting different rewards [00:07:57].

This process involves a flurry of training and running “tons of experiments” to see how different datasets or parameter tweaks impact performance [00:08:15]. An alpha testing phase, typically lasting about a month, allows for rapid training and feedback incorporation [00:08:27].

Fine-Tuning: Enhancing AI Models

Fine-tuning is a critical technique for improving AI model performance and adaptability. Michelle categorizes fine-tuning into two main camps:

1. Fine-tuning for Speed and Latency (SFT)

This approach aims to reduce latency and improve speed, making models like 4.1 operate faster at a fraction of the cost [00:21:46]. It remains the “workhorse” of OpenAI’s SFT (Supervised Fine-Tuning) offering [00:21:50]. If a model works well but needs to be faster, users should consider fine-tuning mini and nano models [00:27:39].

2. Fine-tuning for Frontier Capabilities (RFT)

RFT (Reinforcement Learning from Feedback/Human Feedback) allows users to push the frontier of what a model can do in a specific domain [00:22:08].

Data Efficiency: RFT is “so data efficient” that it can achieve significant improvements with as few as “hundred samples or something on that order” [00:22:16]. This contrasts with earlier assumptions that massive datasets were always required for fine-tuning [00:23:07].
Internal Process: RFT uses “basically the same RL process” that OpenAI uses internally to improve its own models, indicating its effectiveness and robustness compared to SFT [00:23:32].
Use Cases: RFT works well for:
- Teaching an agent how to select a workflow or reason through a decision process [00:22:37].
- Applications in deep tech where organizations possess unique, verifiable data, such as chip design or drug discovery in biology [00:22:48].

Michelle’s view on fine-tuning has evolved from being skeptical to recognizing RFT’s value for pushing capabilities in specific domains [00:44:10].

When to use which Fine-tuning Method:

Preference Fine-tuning: For stylistic adjustments [00:24:04].
SFT: For simpler tasks where the model gets a small percentage wrong (e.g., 10% of cases in classification) and you want to close that gap [00:24:13].
RFT: For problems where “just no model in the market does what you need” [00:24:22], especially with verifiable data.

The Evolution of AI Models and Agents

OpenAI’s strategy generally leans into the “G” in AGI (Artificial General Intelligence), aiming for a single, general model [00:15:54]. However, GBT 4.1 was a specific instance where a “particularly acute need” for developers led to decoupling its development from ChatGPT [00:16:15]. This allowed for faster training, quicker feedback loops, and specific model training choices, such as upweighting coding data and removing ChatGPT-specific datasets [00:16:27]. Despite this specific approach, the long-term goal is simplification and convergence towards more general models [00:16:49].

State of AI Agents

Agents currently perform “remarkably well in well-scoped domains” where tools are available and user intent is clear [00:09:16]. The challenge lies in bridging the gap to the “fuzzy and messy real world,” where users may not know what an agent can do, or the agent lacks awareness of its own capabilities or real-world context [00:09:28].

Future improvements for agents include:

Engineering Side: APIs and UIs that make it easier to follow an agent’s actions, provide summaries, and allow users to intervene and change trajectory [00:11:35].
Model Side: Enhanced robustness and “grit” to handle errors (e.g., API 500s) without getting stuck [00:12:00].
Ambiguity Handling: Making it easier for developers to tune whether a model should ask for more information or proceed with assumptions when faced with ambiguity [00:10:00].

AI in Code

GBT 4.1 and other models show “remarkably good” performance in code when the problem is “locally scoped” (e.g., changing a library where files are close) [00:12:31]. However, challenges remain in:

Global Context: Reasoning about many parts of complex code [00:12:54].
Technical Detail Transfer: Passing extremely technical details between files [00:13:01].
Front-end Coding: While improved, further enhancement is needed to produce “beautiful” code that front-end engineers would be proud of, addressing linting and code style [00:13:13].
Irrelevant Edits: Reducing instances where the model changes more than requested or injects its own style [00:13:33].

Future of AI Research and Application

Multimodality and Generalization

Michelle emphasizes the rapidly improving multimodal capabilities of the new pre-trains [00:20:50]. Many things that didn’t work previously now function seamlessly, suggesting that connecting models to “as much information about your task as possible” will yield better results over time [00:21:09]. The trend points towards generalization, where combining different capabilities produces “a much better result” [00:26:05].

Personalized Learning through AI

Enhanced memory is a key aspect of future models, allowing AI to learn about individual users [00:39:03]. This means a user’s ChatGPT experience will be “so different” from another’s, adapting to personal preferences and becoming more useful [00:39:07]. Increased steerability via custom instructions will also allow users to fine-tune AI personalities [00:39:25].

AI in Language Learning

Beyond general applications, AI is making strides in specific fields. While not explicitly mentioned as “language learning”, the ability of AI to personalize and adapt to user styles suggests potential for tailored educational experiences, similar to how it can adapt to individual preferences in general conversation [00:39:03].

Accelerating Research

A significant area of focus for OpenAI is using AI models to make other models better [00:32:04], particularly in reinforcement learning where signals from models can guide improvements [00:32:10]. This includes improving the speed of iteration, allowing researchers to run more experiments faster with fewer resources [00:32:26]. Synthetic data has been an “incredibly powerful trend” in this regard [00:33:18].

Outlook on AI Progress

Michelle believes that AI model progress will continue at a rapid pace, similar to the last year, without slowing down or undergoing an immediate “fast takeoff” [00:44:56]. She estimates that even if model progress completely stopped now, there would be “10 years of building at least” to extract value from existing capabilities [00:37:08]. This “capabilities overhang” is compared to the internet, which continues to integrate into the world [00:36:42].

Combining Model Capabilities

The future GPT-5 aims to combine the diverse strengths of models like GPT-4o (great conversationalist, tone matching) and O3 (strong reasoning, deep thinking) [00:37:25]. The challenge is balancing these capabilities, ensuring the model remains a “delightful chitchat partner” while knowing “when to reason” without unnecessary delays [00:38:00].

Recommendations for Companies and Developers

Prioritize Evals: The most successful startups have deep knowledge of their use cases and robust evals [00:17:58]. This enables them to quickly assess new models when they drop [00:18:06].
Adapt Prompts and Scaffolding: Be prepared to switch prompts and scaffoldings to tune them to particular models [00:18:15].
Build Just Out of Reach: Focus on use cases that are “maybe just out of reach” of current models or work inconsistently (e.g., 1 out of 10 times) [00:18:25]. If fine-tuning shows a 10% pass rate that can be boosted to 50%, it’s likely a candidate for a future model to “crush it” [00:18:51].
Embrace Scaffolding: It is “super worth it” to build scaffolding to ship value to users, even if it’s a few months of arbitrage before core capabilities improve [00:19:54]. However, keep future trends in mind (e.g., improving context windows, reasoning, instruction following) [00:20:05].
Modular Systems: Make systems modular and easy to plug in different solutions to move faster in the long run [00:30:47].
Generalist Skills: Michelle is “long generalists,” believing that strong product understanding and scrappy engineering skills will be more important than deep AI research expertise for app-layer companies [00:31:29].

Prompting Tips for 4.1

Structured Prompts: Using XML or other structured formats can significantly improve model performance [00:28:23].
“Keep Going”: Explicitly instructing the model to “keep going” or “please don’t come back to me until you’ve solved the problem” can lead to “remarkably better performance” [00:28:30]. OpenAI aims to train future models to inherently exhibit this behavior [00:28:54].

Conclusion

Michelle Pokris encourages developers and power users to provide feedback, especially when models aren’t working well with specific prompts, to help continuously improve the models [00:46:12].

Tubegraph

Explorer

Table of Contents