Challenges in current AI benchmarking practices

From: aidotengineer

The game of AI benchmarks is often rigged, with significant players having strong incentives to maintain the status quo [00:00:01]. Darius, CEO of Scorecard, who built evaluation systems for Waymo, Uber ATG (OG AI agents), and SpaceX, notes that his team has observed “every eval trick in the book” [00:00:10].

At its core, a benchmark consists of three components:

The model being tested [00:00:50].
A set of questions, referred to as the test set [00:00:52].
A metric to keep score [00:00:55].

A benchmark is essentially a bundle of many individual evaluations [00:01:00]. Their key function is to standardize the test set and metrics across different models, making them comparable, similar to the SAT [00:01:04].

Why Benchmarks Control Billions [00:00:31]

Benchmark scores significantly influence market value, investment decisions, and public perception [00:01:22]. Simon Willis highlights that billions of dollars in investment are now based on these scores [00:01:31]. When companies like OpenAI or Anthropic claim the top spot, it affects not only funding but also enterprise contracts, developer mind share, and market dominance [00:01:40]. A single benchmark result, like Auto Rover’s strong performance on SWE, can lead to acquisitions [00:01:56]. This means a single number can define market leaders and eliminate competitors [00:02:02].

Common Tricks to Manipulate AI Benchmark Results

When the stakes are high, companies employ creative methods to win [00:02:12].

1. Apples-to-Oranges Comparisons [00:02:22]

Companies often compare their best configurations against other models’ standard configurations [00:02:40]. For example, XAI released benchmark results for Grok 3, showing it beating competitors [00:02:27]. However, it was later observed that XAI didn’t show OpenAI’s 03 model’s high performance at “consensus 64” (running the model 64 times and taking the consensus answer), which is much more computationally expensive [00:02:49]. True performance leadership requires comparing the best to the best, or standard to standard [00:03:06]. This selective reporting is a common issue [00:03:13].

2. Privileged Access to Test Questions [00:03:23]

Another controversial trick is gaining privileged access to benchmark test questions [00:03:23]. Frontier Math, an “impossible to game” benchmark for advanced mathematics, was funded by OpenAI, which in turn gained access to its entire dataset [00:03:25]. While there was a verbal agreement not to train on the data, and OpenAI employees publicly stated it was a “strongly held out evaluation set” [00:03:47], the optics are problematic. A company funding a benchmark has access to all questions, evaluates its models internally, and announces scores before independent verification, creating a trust problem [00:03:57]. Financial ties between benchmark creators and model companies undermine the entire system [00:04:20].

3. Optimizing for Style Over Substance [00:04:31]

Models can be optimized for style over accuracy, which is a subtle yet significant trick [00:04:35]. Meta, for instance, entered 27 different versions of Llama 4 Maverick into LM Arena, each tweaked for “appeal, not necessarily accuracy” [00:04:40]. An example showed a private version of the model giving a long, emoji-filled, flattering, yet nonsensical response to a math riddle, beating Claude’s correct answer simply because it was “chatty and engaging” [00:04:53]. Companies are effectively training models to be “wrong, but charming” [00:05:09].

Researchers at LM Arena have shown that controlling for style effects (length, formatting, personality) can drastically change rankings, with models like GBD40 Mini and Grok 2 dropping, and Claude 3.5 Sonnet tying for first [00:05:14]. This indicates that benchmarks often measure charm rather than accuracy [00:05:30]. This issue is akin to human SATs, where essay length significantly impacts scores [00:05:40]. The industry currently prioritizes measuring “what sells” over “what matters” [00:05:50].

The Fundamental Problem: Goodhart’s Law [00:05:57]

These issues are a natural outcome of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure” [00:06:03]. By turning benchmarks into targets worth billions, they have inevitably stopped measuring what truly matters, driven by incentives [00:06:07].

Expert Confirmation [00:06:17]

Leaders in AI confirm this evaluation crisis:

Andre Karpathy (Co-founder of OpenAI): “My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now” [00:06:24].
John Yang (Creator of Sweetbench): “It’s sort of like we kind of just made these benchmarks up” [00:06:39].
Martin Sat (CMU): “The yard sticks are like pretty fundamentally broken” [00:06:44].

When the creators and leaders of AI acknowledge that benchmarks are broken and metrics cannot be trusted, it signals a serious problem [00:06:53].

How to Fix Public Metrics [00:07:06]

Addressing the challenges in AI Agent Evaluation requires improvements across all three components of a benchmark:

Model Comparisons

Require “apples-to-apples” comparisons: models should have the same computational budget and constraints, with no cherry-picking of configurations [00:07:15].
Transparently show cost-performance trade-offs [00:07:24].

Test Sets

Demand transparency: open-source data, methodologies, and code [00:07:34].
Ensure no financial ties between benchmark creators and model companies [00:07:39].
Implement regular rotation of test questions to prevent overfitting [00:07:45].

Metrics

Control for style effects to measure substance over engagement [00:07:51].
Require all attempts to be public to prevent cherry-picking of the best run [00:07:59].

Progress is being made, with LM Arena’s style-controlled rankings and the emergence of independent, open-source benchmarks in specific domains like LegalBench, MedQA, and FinTech [00:08:11]. Cross-cutting efforts like Agent Eval and BetterBench are also working to benchmark benchmarks [00:08:34].

Building Your Own Meaningful Evaluations [00:08:55]

To truly win the evaluation game, it’s advised to stop chasing public benchmarks and instead build evaluations that are relevant to your specific use case [00:08:53]. This approach helps in building AI applications that are reliable.

Here’s a five-step process:

Gather Real Data: Five actual queries from a production system are more valuable than 100 academic questions [00:09:04]. Real user problems consistently outperform synthetic benchmarks [00:09:18].
Choose Your Metrics: Select what truly matters for your application (e.g., quality, cost, latency). A chatbot’s metrics differ from a medical diagnosis system [00:09:21].
Test the Right Models: Don’t rely solely on leaderboards. Test the top models on your specific data, as a model that tops generic benchmarks might fail on your unique legal documents [00:09:34]. This is critical for AI agent development.
Systematize It: Implement consistent, repeatable evaluation processes. This can be built in-house or using platforms like Scorecard [00:09:47].
Keep Iterating: As models improve and needs change, evaluation must be a continuous process, not a one-time event [00:09:55].

This continuous cycle involves identifying issues, building improvements, running pre-deployment evaluations, getting feedback, and only deploying when quality bars are met [00:10:06]. This “pre-deployment evaluation loop” distinguishes teams that ship reliable AI from those constantly firefighting production issues [00:10:25]. While more work than checking a leaderboard, it’s the only way to build AI that truly serves users [00:10:34]. This is part of addressing the broader challenges in AI Development.

The AI benchmarks game is rigged due to the immense stakes involved: market caps, acquisitions, and developer mind share [00:10:45]. However, companies don’t have to participate in this game. Instead, they can build evaluations that genuinely help ship better products by measuring what matters to their users, not just what garners attention on social media [00:10:54]. All benchmarks are flawed, but some are useful; the key is discerning which ones [00:11:04].

Tubegraph

Explorer

Table of Contents