Challenges in AI model training and deployment

From: redpointai

OpenAI’s approach to AI model development, particularly with GPT-4.1, prioritizes real-world utility and developer experience over benchmark performance alone [00:00:54]. Michelle Pokris, Post-Training Research Lead at OpenAI, emphasized that the goal was to create a model that is a “joy to use for developers” [00:01:10].

Focus on Real-World Usage over Benchmarks

Traditional model optimization for benchmarks can lead to models that perform well on paper but “stumble over basic things” in practice, such as not following instructions, having weird formatting, or insufficient context length [00:01:17]. To address this, OpenAI focused on direct developer feedback [00:01:32], transforming it into actionable evaluations (evals) for research [00:01:40].

A “long leadup” to model training was dedicated to “getting our house in order on evals” [00:01:47]. An internal instruction following eval, based on real API usage and user feedback, served as a “north star” during development [00:02:00].

Challenges in Evaluation Development

Identifying the most critical evals is challenging because users often don’t provide a comprehensive list, but rather describe “weird” edge cases [00:02:26]. A significant amount of work involves talking to users to extract key insights [00:02:51]. For example, it was discovered that models needed to learn to ignore world knowledge and solely use in-context information for specific user needs, a requirement not captured by standard benchmarks [00:02:57].

Decisions on which evals to pursue are based on recurring customer themes, internal model usage, and feedback from internal customers building on the models [00:03:27]. OpenAI actively seeks more real-world, long-context evals, which are difficult to create [00:04:08], and better definitions for “instruction following,” as it encompasses “hundreds of different things” for users [00:04:26].

A surprising challenge emerged where one alpha user preferred an earlier version of GPT-4.1 over the final shipping version, despite all evals showing improvement. This highlighted how niche use cases might not be covered by standard metrics [00:05:02].

The Model Shipping Process

Shipping a model like GPT-4.1 involves a large team [00:07:14]. The latest release included semi-new pre-trained models: a “mid-train” freshness update for the standard size, and new pre-trains for the mini and nano models [00:07:19].

Post-Training and Iteration

The post-training team focuses on determining the best mix of data, optimal parameters for Reinforcement Learning (RL) training, and appropriate weighting of different rewards [00:07:46]. The development of GPT-4.1 involved:

Identifying developer pain points with previous models [00:08:01].
Approximately three months focused on evaluation [00:08:08].
A “flurry of training” for the subsequent three months, involving running “tons of experiments” with different datasets and parameter tweaks [00:08:13].
Linking these experiments with new pre-trains [00:08:23].
About one month of rapid alpha testing, training, and incorporating feedback [00:08:27].

Challenges in Model Development and Evaluation Lifespan

A key challenge is the short “shelf life of an eval,” which is about three months due to rapid progress and quick saturation of benchmarks [00:08:48]. This necessitates continuous hunting for new evaluation methods [00:08:55]. While some benchmarks like SWEBench for code and Ader evals remain useful, many become saturated and lose their utility [00:15:08].

Rapid AI Progress and Adaptability

Staying current with the rapid pace of AI model releases (a new model seemingly drops every month) is a significant challenge for companies [00:17:40]. Best practices for companies using these APIs include:

Strong Evals: The most successful startups “know their use case really well” and have robust evals, allowing them to quickly test new models [00:17:58].
Flexible Prompts and Scaffolding: Customers who can easily switch and tune their prompts and scaffolding to particular models are highly successful [00:18:11].
Building for the Near Future: Develop products that are “just out of reach” of current models [00:18:22]. If a problem achieves a 10% pass rate with current models and can be fine-tuned to 50%, it’s likely a future model will “crush it” in a few months, making such companies first to market [00:18:48].

Scaffolding and Future Trends

Companies often build extensive scaffolding around current model limitations to make products work today [00:19:19]. While this arbitrage is worthwhile for shipping immediate value [00:19:54], it’s crucial to be prepared for changes as model capabilities improve, potentially obviating existing scaffolding [00:20:05]. Key trends to monitor include:

Improving Context Windows: These will continue to get better [00:20:19].
Enhanced Reasoning Capabilities: Models will become better at complex reasoning [00:20:22].
Better Instruction Following: This will consistently improve [00:20:24].
Natively Multimodal Models: Models are becoming “so natively multimodal” and easy to use with different data types, meaning connecting models to as much information as possible, even with current mediocre results, will yield better outcomes in the future [00:20:36].

Fine-Tuning and Model Capabilities

Fine-tuning has seen a “renaissance” [00:21:26]. It’s generally categorized into two camps:

Fine-tuning for speed and latency: Using models like GPT-4.1 and then fine-tuning them (SFT) for faster, cheaper inference, making them a “workhorse” [00:21:46].
Fine-tuning for frontier capabilities (RFT): With Reinforcement Learning from Feedback (RFT), users can “push the frontier” in specific niche areas with remarkable data efficiency (hundreds of samples) [00:22:08]. RFT is particularly useful for teaching agents workflows or decision processes, and in deep tech where unique, verifiable data allows for “absolute best results” [00:22:33].

RFT is essentially the same RL process OpenAI uses internally for model improvement and is “less fragile” than SFT [00:23:32]. Mental model for fine-tuning:

Stylistic changes: Use Preference Fine-Tuning [00:24:02].
Simple accuracy gaps: Use SFT (e.g., closing a 10% error gap for classification with a model like Nano) [00:24:09].
No market model works: Turn to RFT for problems where no existing model can meet the need [00:24:22]. Verifiable domains such as chip design, biology, and drug discovery (where outcomes, even if long-term, are verifiable) are strong candidates for RFT [00:24:47].

Overhyped vs. Underhyped Areas in AI

Overhyped: Benchmarks, especially agentic ones that are saturated or where reported “absolute best numbers” differ from realistic usage [00:43:38].
Underhyped: Companies’ own evals and using real usage data to understand what’s working [00:43:55]. Michelle’s personal view on fine-tuning shifted from being a “bear” to recognizing RFT’s value for pushing the frontier in specific domains [00:44:10]. This change was influenced by the fact that RFT provides access to the same reinforcement learning algorithms used internally for model training, enabling users to achieve capabilities previously exclusive to OpenAI [00:44:42].

Future of Model Development and OpenAI’s Strategy

The “shelf life of an eval” being only about three months points to the continuing challenge of evaluating progress [00:08:48]. New benchmarks, like successors to SWEBench, will be necessary as existing ones become saturated [00:15:16].

Generalization vs. Specialization

OpenAI’s general philosophy leans towards the “G in AGI,” aiming for “one model that’s general” [00:15:54]. The goal is to simplify product offerings and model selection (e.g., in ChatGPT) [00:16:04]. However, for GPT-4.1, a decision was made to “decouple from ChatGPT” due to an “acutely need” from developers, allowing for faster training, feedback, and shipment on a different timeline [00:16:15]. This allowed specific choices like removing ChatGPT-specific datasets and significantly upweighting coding data [00:16:35].

Despite this targeted approach, the expectation is generally to simplify in the future, as models improve when “creative energies of all researchers at OpenAI are working on them” [00:16:49]. There is “room for both” targeted and general models, and OpenAI may pursue targeted releases again if demand exists [00:17:11]. The trend of cross-domain generalization, where capabilities in one area benefit others, supports the generalist approach [00:17:01].

Regarding specialized foundation models (e.g., robotics, biology), the current trend observed internally is that “combining everything just produces a much better result,” supporting the idea of general models [00:25:49].

Challenges in GPT-5 Development

The core challenge for GPT-5 is combining the distinct capabilities of different models [00:37:21]. For example, GPT-4 is excellent for chat, matching tone, and conversational flow, while GPT-3 excels at hard reasoning problems [00:37:27]. The task is to train a single model that can be a “delightful chitchat partner” but also know when to engage in deep reasoning, which involves “zero-sum decisions” on data weighting [00:38:00].

Internal Process and Future Research

OpenAI is focused on improving its “speed of iteration” in research [00:32:26]. This involves running more experiments with fewer GPUs to quickly determine if a job is working [00:32:30]. This is not purely an infrastructure problem; it also involves ML challenges to ensure training at sufficient scale for signal detection [00:32:47].

A major area of excitement is using models to make models better, particularly in reinforcement learning [00:32:04]. This leverages “synthetic data,” which has been an “incredibly powerful trend” [00:33:18]. The more powerful models become, the easier it is to improve future models [00:33:28].

Conclusion

The field of AI model training and deployment is characterized by rapid advancements and evolving challenges. From the constant need for fresh evals to the strategic balancing act between specialized and generalized models, the process requires deep understanding of both technical capabilities and user needs. The trend toward more data-efficient fine-tuning and the use of AI to improve AI itself suggests a future of continued innovation and capability expansion. Michelle Pokris estimates that even if model progress stopped today, there would still be “10 years of building at least” to extract value from existing capabilities [00:37:08].

Tubegraph

Explorer

Table of Contents