From: aidotengineer
AI benchmarks are designed to compare models, but their significant influence extends to market value, investment decisions, and public perception, often leading to a “rigged game” where the biggest players have incentives to maintain the current system [00:00:01].
What is a Benchmark?
A benchmark is composed of three core components [00:00:48]:
- Model: The AI system being tested [00:00:50].
- Test Set: A collection of questions or scenarios [00:00:52].
- Metric: The method used to keep score [00:00:55].
Crucially, benchmarks standardize the test set and metrics across different models, allowing for comparable results, much like the SAT exam uses the same questions and scoring system for different test takers [00:01:04].
Benchmarks’ Control Over Billions
The scores from these benchmarks wield immense power, controlling billions in market value, influencing investment decisions, and shaping public perception [00:01:22]. According to Simon Willis, billions of dollars of investment are now evaluated based on these scores [00:01:31].
When companies like OpenAI or Anthropic claim a top spot in benchmarks, it directly impacts funding, enterprise contracts, developer mind share, and market dominance [00:01:39]. For example, Andrej Karpathy’s tweets about benchmarks can shape entire ecosystems due to his millions of followers [00:01:46]. A recent acquisition saw Sona acquire Auto Rover because Auto Rover showed strong results on SWE benchmarks [00:01:53]. This demonstrates how a single number can define market leaders and potentially undermine competitors [00:02:02].
Common Tricks Companies Use to Manipulate AI Benchmark Results
When the stakes are high, companies often find “creative ways to win” in the benchmarking game [00:02:09].
1. Apples-to-Oranges Comparisons
This trick involves comparing a company’s best-configured model against competitors’ standard configurations [00:02:22].
- Example: XAI released benchmark results for Grok 3, showing it beating competitors [00:02:27]. However, it was later discovered that XAI compared their best configuration (e.g., using “consensus at 64,” which means running the model 64 times and taking the consensus answer, a much more expensive method) against OpenAI models’ standard configurations [00:02:37]. This selective reporting presents a misleading picture of performance [00:03:13].
2. Privileged Access to Test Questions
This involves a company gaining early or exclusive access to benchmark test data, potentially undermining the integrity of the evaluation [00:03:23].
- Example: Frontier Math was promoted as a highly secure benchmark for advanced mathematics [00:03:25]. However, OpenAI, which funded Frontier Math, gained access to the entire dataset [00:03:38]. While there was a verbal agreement not to train on the data, and OpenAI employees stated it was a “strongly held out evaluation set,” the optics create a trust problem [00:03:47]. The company funding the benchmark could evaluate its models internally and announce scores before independent verification [00:03:57]. This financial tie between benchmark creators and evaluated companies can undermine the entire system [00:04:20].
3. Optimizing for Style Over Substance
Models can be optimized to perform well on style aspects of responses rather than accuracy, especially in human-preferred benchmarks [00:04:35].
- Example: Meta entered 27 different versions of Llama 4 Maverick into LM Arena, each tweaked for “appeal” rather than strict accuracy [00:04:40]. One private version, when asked for a riddle with the answer 3.145, provided a long, emoji-filled, flattering, but nonsensical response [00:04:52]. This “charming” but incorrect answer still beat Claude’s correct answer because it was chatty and engaging [00:05:01]. Companies are effectively training models to be “wrong, but charming” [00:05:09].
- Researchers at LM Arena have shown that filtering out style effects (length, formatting, personality) completely changes rankings [00:05:14]. When controlled, models like GBD40 Mini and Grok 2 dropped, while Claude 3.5 Sonnet tied for first [00:05:22]. This highlights that benchmarks often measure charm rather than accuracy [00:05:30].
The Fundamental Problem: Goodhart’s Law
The core issue stems from Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure” [00:05:57]. Benchmarks have become targets worth billions, leading them to stop measuring what truly matters, an outcome guaranteed by the existing incentives [00:06:07].
Expert Consensus on the Evaluation Crisis
Leading AI researchers and benchmark creators acknowledge the severity of the problem:
- Andrej Karpathy (Co-founder of OpenAI): “My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now” [00:06:24].
- John Yang (Creator of Sweetbench): “It’s sort of like we kind of just made these benchmarks up” [00:06:39].
- Martin Sat (CMU): “The yard sticks are like pretty fundamentally broken” [00:06:46].
When the creators and leaders of AI admit the metrics are untrustworthy, it signals a serious problem with AI benchmarking practices [00:06:53].
Fixing Public Metrics
To improve public benchmarks, all three components (model comparisons, test sets, and metrics) need reform [00:07:06].
- Model Comparisons:
- Require apple-to-apple comparisons with the same computational budget and constraints, avoiding cherry-picking configurations [00:07:14].
- Transparently show cost-performance trade-offs [00:07:24].
- Test Sets:
- Demand transparency through open-sourced data, methodologies, and code [00:07:34].
- Eliminate financial ties between benchmark creators and model companies [00:07:39].
- Implement regular rotation of test questions to prevent overfitting [00:07:45].
- Metrics:
- Control for style effects to measure substance rather than just engagement [00:07:51].
- Require all attempts to be publicly recorded to prevent cherry-picking the best run [00:07:59].
Progress is being made, with LM Arena developing style-controlled rankings [00:08:11]. Additionally, independent benchmarks are emerging in specific domains, including Legal Bench, MedQA, FinTech, and cross-cutting efforts like Agent Eval and Better Bench, which aim to benchmark the benchmarks themselves [00:08:20].
Building Effective Internal Evaluations
Instead of solely relying on public benchmarks, companies can build their own evaluation systems tailored to their specific use cases [00:08:53].
- Gather Real Data: Five actual queries from a production system are more valuable than 100 academic questions, as real user problems consistently outperform synthetic benchmarks [00:09:04].
- Choose Your Metrics: Select metrics (e.g., quality, cost, latency) that are relevant to the application; a chatbot’s needs differ from a medical diagnosis system [00:09:21].
- Test the Right Models: Don’t just follow leaderboards. Test the top five models on specific data, as a model excelling in generic benchmarks might fail on domain-specific documents [00:09:34].
- Systematize It: Establish consistent, repeatable evaluation processes, either by building in-house systems or using platforms like Scorecard [00:09:47].
- Keep Iterating: Evaluation should be a continuous process, not a one-time event, as models improve and needs evolve [00:09:55].
Scorecard, for instance, implements a continuous pre-deployment evaluation loop: identify issues, build improvements, run evaluations before deployment, get feedback, improve, deploy only when quality bars are met, then monitor and restart the cycle [00:10:05]. This continuous process differentiates teams that ship reliable AI from those constantly firefighting production issues [00:10:25]. While this requires more effort than checking a leaderboard, it’s the only way to build AI that truly serves users [00:10:34].
The benchmarks game remains rigged due to high stakes like market caps, acquisitions, and developer mind share [00:10:45]. However, companies can choose to build evaluations that genuinely help them ship better products by measuring what matters to their users, not what sells on social media platforms [00:10:54]. As the saying goes, “All benchmarks are wrong, but some are useful. The key is knowing which ones” [00:11:04].