From: aidotengineer
AI benchmarks are a critical component in the AI industry, controlling billions in market value, influencing investment decisions, and shaping public perception [00:00:31], [00:01:22]. When major players like OpenAI or Anthropic claim a top spot, it impacts enterprise contracts, developer mind share, and market dominance [00:01:41]. For example, Sona acquired Auto Rover because it showed strong results on SWE benchmarks [00:01:56]. This highlights a system where a single number can define market leaders and destroy competitors [00:02:02].
A benchmark consists of three main components:
- A model being tested [00:00:50].
- A test set (questions) [00:00:52].
- A metric (how the score is kept) [00:00:55].
The key insight is that benchmarks standardize the test set and metrics across models, making them comparable, similar to the SAT [00:01:06]. However, when the stakes are high, companies employ various strategies to manipulate benchmark results [00:02:12].
Common Tricks to Gain the System
1. Apples-to-Oranges Comparisons
This trick involves comparing a company’s best-configured model against other models’ standard configurations [00:02:22], [00:02:39]. For instance, XAI released benchmark results for Grok 3, showing it beating competitors [00:02:27]. However, it was later observed that XAI did not show OpenAI’s GPT-3’s high performance at “consensus 64,” which involves running the model 64 times and taking the consensus answer [00:02:49]. While more expensive, claiming performance leadership requires comparing “best to best” or “standard to standard,” not mixing configurations [00:03:04].
2. Privileged Access to Test Questions
This strategy involves gaining early or exclusive access to benchmark data [00:03:23]. Frontier Math, a benchmark for advanced mathematics, was supposed to be highly protected [00:03:27]. However, OpenAI funded Frontier Math and gained access to the entire dataset [00:03:40]. Despite verbal agreements not to train on the data and public statements calling it a “strongly held out evaluation set,” the optics create a trust problem [00:03:47], [00:04:17]. The company funding the benchmark can see all questions, evaluate models internally, and announce scores before independent verification [00:03:57]. This undermines the entire system when benchmark creators accept money from the companies they are evaluating [00:04:20].
3. Optimizing for Style Over Substance
Models can be optimized to perform well on human preference benchmarks by focusing on stylistic elements rather than accuracy [00:04:35]. Meta, for example, entered 27 different versions of Llama 4 Maverick into LM Arena, each tweaked to maximize appeal, not necessarily accuracy [00:04:40]. One private version gave a long, emoji-filled, flattering response that made no sense but beat Claude’s correct answer because it was chatty and engaging [00:04:57].
This means companies are training models to be “wrong, but charming” [00:05:11]. Researchers at LM Arena proved this can be controlled; when style effects (length, formatting, personality) were filtered out, rankings completely changed [00:05:15]. GPT-4o Mini and Grok 2 dropped, while Claude 3.5 Sonnet jumped up and tied for first [00:05:22]. This indicates that current public benchmarks often measure which model is most charming, not most accurate [00:05:30]. The industry prefers measuring what sells rather than what matters [00:05:51].
The Fundamental Problem: Goodhart’s Law
These issues are a natural outcome of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure” [00:05:59]. Benchmarks have become targets worth billions, leading them to stop measuring what truly matters because the incentives guarantee it [00:06:07].
Expert Consensus on Broken Benchmarks
Experts and creators of these benchmarks acknowledge the problem:
- Andre Karpathy (Co-founder of OpenAI): “My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now” [00:06:27].
- John Yang (Creator of Sweetbench): “It’s sort of like we kind of just made these benchmarks up” [00:06:40].
- Martin Sat (CMU): “The yardsticks are like pretty fundamentally broken” [00:06:44].
When the very people who build and lead AI models distrust the metrics, it signifies a serious problem [00:06:53].
How to Fix Public Metrics
To improve public benchmarks:
- Model Comparisons: Require “apples-to-apples” comparisons with the same computational budget and constraints, no cherry-picking configurations, and transparent cost-performance trade-offs [00:07:14].
- Test Sets: Demand transparency, open-source data, clear methodologies and code, and no financial ties between benchmark creators and model companies [00:07:33]. Regular rotation of test questions is also needed to prevent overfitting [00:07:45].
- Metrics: Implement controls for style effects to measure substance over engagement [00:07:51]. All attempts should be made public to prevent cherry-picking the best run [00:07:59].
Progress is being made with LM Arena’s style-controlled rankings and the emergence of more independent, open-source benchmarks in specific domains like Legal Bench, MedQA, Fintech, Agent Eval, and Better Bench [00:08:11].
Building Evaluations That Actually Matter
Instead of chasing public benchmarks, companies should focus on building evaluations that truly matter for their specific use case [00:08:53]. This approach involves:
- Gather Real Data: Use actual queries from your production system, as real user problems are more valuable than synthetic benchmarks [00:09:04].
- Choose Your Metrics: Select metrics (e.g., quality, cost, latency) that are relevant to your application; a chatbot needs different metrics than a medical diagnosis system [00:09:21].
- Test the Right Models: Don’t rely solely on leaderboards; test the top models on your specific data, as a model that tops generic benchmarks might fail on your unique legal documents [00:09:34].
- Systematize It: Establish consistent, repeatable evaluation processes, either by building in-house or using a platform [00:09:47].
- Keep Iterating: Make evaluation a continuous process, not a one-time event, as models improve and needs change [00:09:55].
This continuous pre-deployment evaluation loop, including identifying issues, building improvements, running evaluations, and monitoring post-deployment, is essential for teams that ship reliable AI rather than constantly firefighting production issues [00:10:06]. While this requires more effort than simply checking a leaderboard, it is the only way to build AI that genuinely serves users [00:10:34].
Ultimately, the benchmarks game is rigged due to the high stakes involved, including market caps, acquisitions, and developer mind share [00:10:45]. However, companies can choose to build evaluations that help them ship better products by measuring what matters to their users, not what sells on public forums [00:10:54].