From: aidotengineer
The way AI benchmarks are currently structured allows them to control billions in market value and mind share, influencing investment decisions and public perception [00:00:31]. Simon Willis highlights that billions of dollars of investment are evaluated based on these scores [00:01:31].
Influence on the AI Ecosystem
When major players like OpenAI or Anthropic claim the top spot in a benchmark, it impacts not only their funding but also their enterprise contracts, developer mind share, and overall market dominance [00:01:38]. The influence extends to shaping entire ecosystems, as seen when Andre Karpathy, co-founder of OpenAI, tweets about a benchmark to millions of followers [00:01:46]. A tangible example is Sona’s acquisition of Auto Rover, which was driven by Auto Rover’s strong performance on the SWE benchmark [00:01:53]. This demonstrates how a single number can define market leaders and potentially destroy competitors [00:02:02].
The Problem: A Rigged Game
The speaker asserts that the AI benchmarks game is “rigged” because the stakes are too high for it not to be [00:00:01], [00:02:12], [00:10:45]. This leads companies to employ creative ways to win, undermining the trustworthiness of AI benchmarks.
Common Tricks to Game Benchmarks
- Apples-to-Oranges Comparisons: Companies may compare their best-performing, often more expensive, configurations against competitors’ standard ones [00:02:22], [00:02:40]. For instance, XAI released Grok 3 benchmark results showing superiority, but they compared their model using high-performance configurations (like consensus at 64, running the model 64 times) against standard configurations of other models without disclosing this difference [00:02:27], [00:02:49]. This “selective reporting” misrepresents true performance [00:03:13].
- Privileged Access to Test Questions: Some companies gain early or privileged access to benchmark datasets [00:03:23]. An example is OpenAI funding Frontier Math and receiving access to its entire dataset [00:03:38]. Even with verbal agreements not to train on the data, the optics of a funding company having access to evaluation questions, performing internal evaluations, and announcing scores before independent verification creates a trust problem [00:03:57], [00:04:17].
- Optimizing for Style Over Substance: Models are trained to optimize for style (e.g., chatty, engaging responses) rather than accuracy [00:04:35]. Meta’s Llama 4 Maverick entered 27 versions into LM Arena, with some tweaked for “appeal” over “accuracy” [00:04:40]. One version, though incorrect, beat a correct answer from Claude due to its engaging, emoji-filled, flattering response [00:04:57]. This means companies are “literally training models to be wrong, but charming” [00:05:11]. Researchers at LM Arena proved that when style effects (length, formatting, personality) are filtered out, rankings change dramatically, revealing that models are often measured for “charm” rather than accuracy [00:05:14], [00:05:30].
The “Evaluation Crisis”
This situation is an “natural outcome” of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure” [00:05:57]. Since benchmarks have become targets worth billions, they no longer accurately measure what matters [00:06:07]. Experts in the field confirm this crisis:
- Andre Karpathy (co-founder of OpenAI): “My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now” [00:06:24].
- John Yang (creator of Sweetbench): “It’s sort of like we kind of just made these benchmarks up” [00:06:39].
- Martin Sat (CMU): “The yard sticks are like pretty fundamentally broken” [00:06:44]. These statements from the very people who build and lead AI indicate a serious problem with the trustworthiness of current metrics [00:06:53].
Fixing Public Metrics
To improve public metrics, adjustments are needed across three components of a benchmark: model comparisons, test sets, and metrics [00:07:09]. This calls for improving AI evaluation methods.
Recommendations
- Model Comparisons: Require “apples-to-apples” comparisons with the same computational budget and constraints, transparently showing cost-performance trade-offs [00:07:15].
- Test Sets: Demand transparency by open-sourcing data, methodologies, and code [00:07:34]. Crucially, there should be no financial ties between benchmark creators and model companies [00:07:39]. Regular rotation of test questions is also necessary to prevent overfitting [00:07:45].
- Metrics: Implement controls for style effects to measure substance over engagement [00:07:51]. All attempts should be public to prevent cherry-picking the best run [00:07:59].
There is progress, with LM Arena offering style-controlled rankings [00:08:11] and the emergence of independent, open-source benchmarks in specific domains like LegalBench, MedQA, and FinTech [00:08:20].
Building Useful Internal Evaluations
To truly win the evaluation game, it’s advised to “stop playing” the rigged public benchmark game [00:08:46]. Instead, companies should build evaluations that directly matter for their specific use cases [00:08:55].
Steps for Effective Internal Evaluation
- Gather Real Data: Utilize actual queries from production systems, as a few real user problems are more valuable than many academic questions [00:09:04].
- Choose Your Metrics: Select metrics relevant to the application, such as quality, cost, and latency. A chatbot’s metrics differ from a medical diagnosis system’s [00:09:21].
- Test the Right Models: Don’t rely solely on leaderboards. Test the top models on your specific data, as a model that excels on generic benchmarks might fail on domain-specific documents [00:09:34].
- Systematize It: Implement consistent, repeatable evaluation processes, either by building in-house systems or using platforms like Scorecard [00:09:46].
- Keep Iterating: Make evaluation a continuous process, adapting as models improve and needs change [00:09:55]. This involves a continuous cycle of identifying issues, building improvements, running evaluations before deployment, gathering feedback, and only deploying when quality bars are met [00:10:06].
This pre-deployment evaluation loop is crucial for shipping reliable AI and avoiding constant production issues [00:10:25]. While more work than checking a leaderboard, it’s the only way to build AI that truly serves users [00:10:34]. The fundamental takeaway is that “all benchmarks are wrong, but some are useful. The key is knowing which ones” [00:11:04].