Evaluation methodologies and user feedback for AI models

From: redpointai

The development of AI models, particularly at OpenAI, has shifted focus from traditional benchmarks to real-world utility and user satisfaction, emphasizing continuous customer feedback and AI model refinement to drive improvements [00:00:55]. Michelle Pokris, a post-training research lead at OpenAI, highlights that the goal is to create models that are a “joy to use for developers” [00:01:09].

Shifting Focus from Benchmarks to Utility

Historically, AI models were often optimized for benchmarks, which, while looking “really great,” could lead to models stumbling over basic real-world issues like instruction following or formatting [00:01:17]. For GPT-4.1, the focus was explicitly on what developers had been requesting, turning that feedback into actionable evaluations that could be used during research [00:01:32]. OpenAI developed an internal instruction following eval based on real API usage and user input, serving as a “north star” during development [00:01:57].

The Feedback Gathering Process

Gathering customer feedback is often about understanding subtle issues rather than explicit requests for specific evaluations [00:02:33]. Users might describe an issue as “kind of weird in this one use case,” requiring OpenAI to actively pull out key insights by generating prompts and investigating [00:02:42]. For example, a recent insight revealed that models could improve in situations where they need to “ignore everything you know about the world and only use the information in context” [00:03:00]. This type of nuanced feedback would not typically be captured by standard benchmarks [00:03:13].

The determination of what evaluations matter most comes from:

Repeated themes in customer feedback [00:03:30].
Internal usage of the models [00:03:34].
Insights from internal customers building on OpenAI’s models [00:03:39].

OpenAI actively seeks more feedback, particularly for long-context real-world evaluations and instruction following [00:04:08]. The company offers an “eval product” where users can opt-in to receive free inference on their evaluations in exchange for their data [00:04:01].

The Model Development and Evaluation Cycle

The process of shipping a model like GPT-4.1 involves significant effort from large teams [00:07:14]. This includes:

Pre-training: Developing new “mid-trains” or fresh pre-trains for different model sizes (standard, mini, nano) [00:07:22].
Post-training: Michelle’s team focuses on post-training, which involves determining the best mix of data, parameters for reinforcement learning (RL) training, and weighting of different rewards [00:07:46].
Iterative Evaluation: A lead-up of three months was spent on developing and refining evaluations to understand the biggest model problems [00:08:08]. This was followed by a three-month period of intensive training, running numerous experiments to tweak datasets and parameters [00:08:13].
Alpha Testing: A final month of alpha testing involved rapid training cycles, gathering feedback, and incorporating it as much as possible [00:08:27].

A significant challenge is the rapid saturation of benchmarks; the “shelf life of an eval is like three months” due to fast progress [00:08:48]. This necessitates a continuous hunt for new and relevant evaluations [00:08:55].

Unexpected Discoveries and User Adoption

Upon release, GPT-4.1 showed unexpected capabilities, particularly in improved UI and coding [00:06:05]. The “nano” model, being small, cheap, and fast, spurred significant AI adoption, demonstrating a demand for models across various cost-latency curves [00:06:15].

Evaluating AI System and Managing Human-AI Collaboration Evaluating AI Systems and Managing HumanAI Collaboration

Current State of Agents

Agents work “remarkably well” in well-scoped domains where the model has the right tools and clear user intent [00:09:16]. The challenge now lies in bridging the gap to the “fuzzy and messy real world,” where user requests are ambiguous, or the agent lacks awareness of its own capabilities [00:09:28]. Michelle believes many capabilities are already present, but the difficulty is getting sufficient context into the model [00:09:53]. Future improvements include enabling developers to tune for ambiguity (e.g., asking for more information vs. proceeding with assumptions) [00:10:00].

In AI model evaluation and benchmarking of agentic tool use, many “incorrect” grades are found to be misgraded, ambiguous, or due to user models not following instructions [00:10:47]. This suggests that existing benchmarks for function calling and agentic tool use are becoming saturated [00:11:11].

Long-Term Task Execution

To make progress on longer, multi-step, and ambiguous tasks, both engineering and model-side changes are needed [00:11:30].

Engineering: Develop APIs and UIs that allow users to follow an agent’s actions, see summaries, and “jump in and change the trajectory” [00:11:35].
Modeling: Enhance robustness and “grit” so models don’t get stuck when external systems (like APIs) fail [00:11:58].

Model Evaluation in Code Generation

GPT-4.1 and other models excel at locally scoped coding problems, where files are close together [00:12:31]. However, challenges remain in tasks requiring “global context” or reasoning about many disparate parts of the code [00:12:50]. While front-end coding has seen significant improvement, the goal is for models to produce code that a front-end engineer would be “proud of,” focusing on linting and code style [00:13:13]. Another ongoing focus is reducing “irrelevant edits,” where the model changes more than requested, down from 9% in GPT-4.0 to 2% in GPT-4.1 [00:13:33].

While benchmarks like SWEBench remain useful for differentiating significant performance gaps (e.g., 55% vs. 35% pass rates), many others are “fully saturated and not useful” [00:14:53]. The strategy is to “use the most out of an eval during its lifespan and then move on and create another one” [00:15:14].

Best Practices for Businesses in a Rapidly Evolving AI Landscape AI model selection and evaluation for businesses

Companies using AI APIs need strategies to stay current with rapid model progress:

Develop Strong Evals: The most successful startups have deep knowledge of their use case and “really good evals” that allow them to quickly test new models when they drop [00:17:58].
Adapt Prompts and Scaffoldings: Be able to switch and tune prompts and scaffoldings to specific models [00:18:13].
Build for “Just Out of Reach” Capabilities: Focus development on problems that current models struggle with (e.g., works 1/10 times, but ideally 9/10). These are likely to be “crushed” by future models [00:18:22]. A heuristic for “just out of reach” is if fine-tuning can improve a 10% pass rate to 50% [00:18:51].

Building scaffolding for current model limitations is “super worth it” for startups to deliver immediate value and achieve a few months of “arbitrage” [00:19:54]. However, it’s crucial to be prepared to adapt, keeping future trends in mind, such as improving context windows, reasoning capabilities, and instruction following [00:20:05]. Multimodal capabilities are also rapidly improving, making it worthwhile to connect models to as much task information as possible, even if results are “meh” today [00:20:41].

The Role of Fine-Tuning: SFT vs. RFT

Fine-tuning has seen a “renaissance” [00:21:26] and is generally categorized into two camps:

Fine-tuning for Speed and Latency (SFT): This is the “workhorse” for making models like GPT-4.1 faster at a fraction of the latency [00:21:46]. This is suitable for simpler classification tasks where a model might be 10% wrong, and SFT can close that gap [00:24:13].
Fine-tuning for Frontier Capabilities (RFT): Reinforced Fine-Tuning (RFT), based on the same RL process OpenAI uses internally, allows pushing the frontier in specific areas [00:22:08]. RFT is “extremely data efficient,” often requiring only hundreds of samples [00:22:16]. It is less fragile than SFT [00:23:38]. RFT is particularly useful for:
- Teaching an agent to pick workflows or reason through decision processes [00:22:37].
- Deep tech applications where an organization has unique, verifiable data, such as chip design or drug discovery [00:22:48]. RFT is recommended when “no model in the market does what you need” [00:24:22].

Model Selection for Businesses AI model selection and evaluation for businesses

For developers, the recommendation is to “start with 4.1” and see if it works well [00:27:33]. If speed is a priority, then explore Mini and Nano, and fine-tuning them [00:27:39]. If 4.1 struggles with certain tasks, move to O4 Mini for its reasoning capabilities, then O3, and finally RFT with O4 Mini if needed [00:27:48].

Future of AI Models and Research

Combining Model Capabilities

OpenAI aims to simplify its product offering, moving towards “one model that’s general” to lean into the “G in AGI” [00:15:54]. While GPT-4.1 temporarily decoupled from ChatGPT to move faster and optimize for developers (e.g., removing chat-specific data, upweighting coding data), the general philosophy is that models improve when “creative energies of all researchers at OpenAI are working on them” [00:16:15].

The challenge for future models like GPT-5 is combining diverse capabilities: being a “delightful chitchat partner” while also knowing “when to reason” without unnecessary delays [00:37:57]. This involves striking the right balance in training data, as some decisions (e.g., upweighting coding data) can be zero-sum [00:38:21].

Research Directions

Key research areas include:

Models Improving Models: Using existing models to generate signals and synthetic data for improving future models, a “remarkably powerful trend” [00:32:04].
Speed of Iteration: Accelerating the research cycle by running more experiments with fewer GPUs, allowing researchers to quickly determine if an approach is working [00:32:26]. This requires ensuring sufficient scale for signals during training [00:32:51].
Generalization of Tool Use: While products like Deep Research focused on deep training on specific tools, the trend with O3 models shows that learning one set of tools improves the model’s ability to use other tools more broadly [00:34:04]. This suggests less need for “one tool specific training” going forward [00:34:27].

Future capabilities expected sooner rather than later include coding agents, given current benchmark performance, and long workflows like customer support, where models can build on previous tool calls and outputs [00:35:07].

Industry Perspectives on AI Development

Overhyped: Benchmarks, especially agentic ones that are saturated, or claims based on “absolute best numbers” rather than realistic performance [00:43:38].
Underhyped: Internal evaluations using real usage data to understand what works well [00:43:55].
Model Progress: Expected to continue at a fast pace, similar to the last year, without necessarily being a “fast takeoff” [00:44:56].

Michelle Pokris leads the “power users research team” at OpenAI, focusing on users who push the models’ limits (e.g., developers, discerning Chat GPT users). This focus is strategic because “the things that the power users are doing today are going to be the things that the median users are doing a year from now,” providing valuable insights for future model improvements [00:41:42].

Tubegraph

Explorer

Table of Contents