Building custom evaluations for better AI performance

From: aidotengineer

Public AI benchmarks are often criticized for being rigged, with significant incentives for major players to maintain the status quo [00:00:01]. According to Darius, CEO of Scorecard, who built evaluation systems for autonomous cars at Waymo and Uber ATG, and for rockets at SpaceX, and holds patents on evaluating autonomous systems, companies across legal tech, health tech, and finance have employed various “eval tricks” [00:00:10].

The Influence and Flaws of Public Benchmarks

A benchmark is defined by three components: the model being tested, a test set of questions, and a metric for scoring [00:00:48]. The crucial aspect is that benchmarks standardize the test set and metrics across models to allow for comparability, similar to the SAT [00:01:04].

However, these scores wield immense influence, controlling billions in market value, investment decisions, and public perception [00:01:22]. Simon Willis noted that billions of dollars are evaluated based on these scores [00:01:31]. When a company like OpenAI or Anthropic claims the top spot, it affects funding, enterprise contracts, developer mind share, and market dominance [00:01:36]. A single benchmark score can define market leaders and dismantle competitors [00:02:02]. For example, Sona acquired Auto Rover largely because it showed strong results on SWE [00:01:54].

Common Tricks to Game the System

When the stakes are high, companies find “creative ways to win” [00:02:12].

1. Apples-to-Oranges Comparisons

This trick involves comparing the best configuration of one model against the standard configuration of another [00:02:22].

Example: XAI released benchmark results for Grok 3, showing it beating competitors [00:02:27]. However, OpenAI engineers found that XAI compared their best configuration against other models’ standard configurations [00:02:37]. They specifically omitted showing OpenAI’s GPT-3 models’ high performance at “consensus 64,” which involves running the model 64 times and taking the consensus answer, a much more expensive process [00:03:00]. Claiming performance leadership requires comparing best-to-best or standard-to-standard [00:03:04].

2. Privileged Access to Test Questions

This trick involves companies gaining early or exclusive access to benchmark data [00:03:21].

Example: Frontier Math was promoted as a highly protected benchmark for advanced mathematics [00:03:25]. However, OpenAI, which funded Frontier Math, gained access to the entire dataset [00:03:38]. Although there was a verbal agreement not to train on the data, and OpenAI employees publicly called it a “strongly held out evaluation set,” the optics create a trust problem [00:03:47]. A company funding a benchmark seeing all questions, evaluating models internally, and announcing scores before independent verification undermines the system [00:03:57].

3. Optimizing for Style Over Substance

Models can be optimized to perform well on style rather than accuracy [00:04:31].

Example: Meta entered 27 different versions of Llama 4 Maverick into LM Arena, each tweaked for maximum appeal, not necessarily accuracy [00:04:40]. One version, when asked for a riddle with the answer “3.145”, gave a long, emoji-filled, flattering, nonsensical response that still beat Claude’s correct answer because it was “chatty and engaging” [00:05:01]. This means companies are “literally training models to be wrong, but charming” [00:05:09].
Researchers at LM Arena showed that filtering out style effects (length, formatting, personality) completely changed rankings, with GBD40 Mini and Grok 2 dropping, and Claude 3.5 Sonnet jumping to tie for first [00:05:14]. This indicates that benchmarks often measure charm rather than accuracy, akin to choosing a surgeon based on bedside manner instead of skill [00:05:30].
Even human SATs have this issue, with 39% of score variance being attributed to essay length; writing more generally leads to higher scores [00:05:40].
The industry tends to measure “what sells” rather than “what matters” [00:05:50].

The “Evaluation Crisis”

The fundamental problem is Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure” [00:05:57]. Benchmarks have become billion-dollar targets, naturally leading them away from measuring what truly matters [00:06:07].

Experts and creators of these benchmarks acknowledge the problem:

Andre Karpathy (Co-founder of OpenAI): “My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now” [00:06:24].
John Yang (Creator of Sweetbench): “It’s sort of like we kind of just made these benchmarks up” [00:06:36].
Martin Sat (CMU): “The yard sticks are like pretty fundamentally broken” [00:06:44].

Fixing Public Metrics for Improving AI Evaluation Methods

To address the issues with public metrics, improvements are needed across model comparisons, test sets, and metrics [00:07:06].

Model Comparisons: Require “apples-to-apples” comparisons with the same computational budget and constraints, avoiding cherry-picking configurations [00:07:13]. Cost-performance trade-offs should be transparently displayed, as seen in the Arc Prize [00:07:24].
Test Sets: Demand transparency, open-sourcing data, methodologies, and code [00:07:34]. There should be no financial ties between benchmark creators and model companies [00:07:39]. Regular rotation of test questions is necessary to prevent overfitting [00:07:45].
Metrics: Implement controls for style effects to measure substance over engagement [00:07:51]. All attempts should be publicly required to prevent cherry-picking the best run [00:07:59].

Progress is being made through LM Arena’s style-controlled rankings and the emergence of independent, domain-specific benchmarks like LegalBench, MedQA, and FinTech [00:08:11]. Cross-cut coding efforts like AgentEval and BetterBench are also working to benchmark benchmarks [00:08:34].

Building Custom Evaluations for Specific Use Cases

Instead of chasing public benchmarks, a more effective approach is to build a set of evaluations that directly matter for a specific use case [00:08:53].

Steps to Building and Improving AI Agents through Custom Evaluations

Gather Real Data: Use actual queries from your production system. Five real user problems are more valuable than 100 academic questions [00:09:04].
Choose Your Metrics: Select metrics (e.g., quality, cost, latency) relevant to your application. A chatbot needs different metrics than a medical diagnosis system [00:09:21].
Test the Right Models: Don’t rely solely on leaderboards. Test the top five models against your specific data, as a model that tops generic benchmarks might fail on your unique legal documents [00:09:34].
Systematize It: Establish consistent, repeatable evaluation platforms for AI agents. This can be built in-house or using platforms like Scorecard [00:09:47].
Keep Iterating: Since models and needs evolve, make evaluation of AI agent performance and reliability a continuous process, not a one-time event [00:09:55].

At Scorecard, this process forms a continuous workflow: identify issues, build improvements, run evaluations of AI models before deployment, get feedback, improve, and deploy only when quality bars are met [00:10:03]. This pre-deployment evaluation loop distinguishes teams that ship reliable AI from those constantly firefighting production issues [00:10:25]. While this requires more effort than checking a leaderboard, it is the only way to build AI that genuinely serves users [00:10:34].

Ultimately, all benchmarks can be flawed, but some are useful [00:11:04]. The key is to know which ones, and to measure what truly matters to your users, not what sells on public forums [00:11:00].

Tubegraph

Explorer

Table of Contents