From: aidotengineer

AI benchmarks are designed to standardize test sets and metrics across different models, allowing for comparable evaluations, similar to how the SAT provides a standardized test for different students [00:01:04]. However, the integrity and trustworthiness of these benchmarks are frequently undermined by various practices.

Why Benchmarks Control Billions

Benchmarks significantly influence billions in market value, investment decisions, and public perception within the AI industry [00:01:22]. Simon Willis notes that billions of dollars in investment are now evaluated based on these scores [00:01:31]. When companies like OpenAI or Anthropic claim the top spot, it affects not only funding but also enterprise contracts, developer mind share, and market dominance [00:01:40]. For instance, Andrej Karpathy’s tweets about a benchmark can shape entire ecosystems [00:01:46]. The acquisition of Auto Rover by Sona was influenced by Auto Rover’s strong results on SWE benchmarks [00:01:53]. This system, where a single number can define market leaders, creates high stakes, leading companies to find “creative ways to win” [00:02:02]. How AI benchmarks influence market value and public perception is a critical aspect of the current AI landscape.

Common Tricks Companies Use to Manipulate AI Benchmark Results

There are three primary ways companies game the benchmark system [00:00:35].

Trick 1: Apples-to-Oranges Comparisons

This trick involves comparing a company’s best configuration against other models’ standard configurations [00:02:22]. For example, XAI released benchmark results for Grok 3, showing it beating competitors [00:02:27]. However, it was later discovered that XAI compared their optimal configuration (e.g., using consensus 64, which involves running the model 64 times and taking the consensus answer) against the standard configurations of other models [00:02:37]. Fair comparison requires comparing the best to the best or standard to standard, not a selectively reported “best against their standard” [00:03:06].

Trick 2: Privileged Access to Test Questions

Another controversial trick is gaining privileged access to benchmark test questions [00:03:21]. The Frontier Math benchmark, intended to be a super-secret, impossible-to-game evaluation for advanced mathematics, was funded by OpenAI [00:03:27]. This funding granted OpenAI access to the entire dataset [00:03:40]. While there was a verbal agreement not to train on the data, and OpenAI employees publicly stated it was a “strongly held out evaluation set,” the optics create a trust problem [00:03:47]. A company funding a benchmark, seeing all the questions, evaluating internally, and announcing scores before independent verification undermines trust [00:03:57]. When benchmark creators accept money from the companies they evaluate, it jeopardizes the entire system’s integrity [00:04:20].

Trick 3: Optimizing for Style Over Substance

This subtle trick involves models optimizing for engaging style rather than factual accuracy [00:04:35]. For instance, Meta entered 27 tweaked versions of Llama 4 Maverick into LM Arena, each optimized for appeal rather than accuracy [00:04:40]. One private version, when asked for a riddle with the answer 3.145, provided an emoji-filled, nonsensical but flattering response that beat Claude’s correct answer because it was “chatty and engaging” [00:04:53]. This indicates that companies are “literally training models to be wrong, but charming” [00:05:09].

Researchers at LM Arena have shown that by filtering out style effects (length, formatting, personality), rankings can change dramatically, with models like GBD40 Mini and Grock 2 dropping, and Claude 3.5 Sonnet tying for first [00:05:14]. This means benchmarks often measure charm over accuracy, akin to choosing a surgeon based on bedside manner instead of surgical skill [00:05:30]. This issue even affects human SATs, where 39% of score variance is attributed to essay length [00:05:40]. The industry currently prioritizes measuring what sells over what truly matters [00:05:50].

The Fundamental Problem: Goodhart’s Law and the AI Evaluation Crisis

These issues stem from Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure” [00:05:57]. Benchmarks, transformed into targets worth billions, have ceased to measure what truly matters due to guaranteed incentives [00:06:09].

Experts acknowledge this “evaluation crisis” [00:06:27]:

  • Andrej Karpathy (Co-founder of OpenAI): “My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now” [00:06:24].
  • John Yang (Creator of Sweetbench): “It’s sort of like we kind of just made these benchmarks up” [00:06:39].
  • Martin Sat (CMU): “The yard sticks are like pretty fundamentally broken” [00:06:44].

When the creators of benchmarks and leaders in AI express such doubts about the metrics, it signals a serious problem [00:06:53].

How to Fix Public Metrics and Build Trust

To address evaluating AI system performance and building trust and community in AI applications, improvements are needed across model comparisons, test sets, and metrics [00:07:06].

Model Comparisons

  • Require Apples-to-Apples Comparisons: Mandate the same computational budget, constraints, and prohibit cherry-picking configurations [00:07:14].
  • Show Cost-Performance Trade-offs: Transparently display these trade-offs, as demonstrated by initiatives like the Arc Prize [00:07:23].

Test Sets

  • Transparency and Open Source: Open source the data, methodologies, and code [00:07:33].
  • No Financial Ties: Ensure no financial connections exist between benchmark creators and model companies [00:07:39].
  • Regular Rotation: Routinely rotate test questions to prevent overfitting [00:07:45].

Metrics

  • Control for Style Effects: Develop methods to measure substance and accuracy, not just engagement [00:07:50].
  • Public Reporting of All Attempts: Require all evaluation attempts to be publicly disclosed to prevent cherry-picking the best run [00:07:59].

Progress and Independent Efforts

Some progress is being made:

  • LM Arena’s style-controlled rankings offer the ability to mitigate style effects [00:08:11].
  • More independent benchmarks are emerging in specific domains, such as Legal Bench, MedQA, and FinTech [00:08:20].
  • Cross-cutting efforts like Agent Eval and Better Bench are working to benchmark benchmarks themselves [00:08:34].

Building Evaluations That Actually Matter

Instead of chasing rigged public benchmarks, a better approach is to build a set of evaluations tailored to specific use cases [00:08:42]. Strategies for AI evaluation and troubleshooting must focus on real-world utility.

Steps for Effective Evaluation:

  1. Gather Real Data: Use actual queries from production systems. Five real user problems are more valuable than 100 academic questions [00:09:04].
  2. Choose Your Metrics: Select metrics that are crucial for your application, such as quality, cost, or latency [00:09:21].
  3. Test the Right Models: Don’t rely solely on leaderboards. Test the top models against your specific data, as generic top performers may fail on specialized documents [00:09:33].
  4. Systematize It: Implement consistent, repeatable evaluation processes, either by building them internally or using platforms like Scorecard [00:09:46].
  5. Keep Iterating: Make evaluation a continuous process, adapting as models improve and needs change [00:09:54].

This approach creates a continuous cycle: identify issues, build improvements, run evaluations before deployment, get feedback, and only deploy when quality bars are met [00:10:05]. This pre-deployment evaluation loop is crucial for shipping reliable AI and avoiding constant firefighting of production issues [00:10:25]. While more work than checking a leaderboard, it is the only way to build AI that truly serves users [00:10:34].

Ultimately, the benchmarks game is rigged due to the high stakes involved [00:10:45]. However, companies can opt out of this game by building evaluations that prioritize user needs and product improvement [00:10:54]. The key to evaluating AI systems at scale and challenges in AI agent evaluation is to measure what matters to your users, not what generates buzz on social media [00:11:00]. As the saying goes, “All benchmarks are wrong, but some are useful. The key is knowing which ones” [00:11:06].