Steps to create effective evaluations for AI applications

From: aidotengineer

Public benchmarks for AI systems, while seemingly useful, are often prone to manipulation and can be misleading, controlling billions in market value and public perception [00:00:01]. This environment creates an evaluation crisis where even experts question which metrics to trust [06:27:00]. Instead of chasing these potentially “rigged” public benchmarks, the focus should shift to building custom evaluations tailored to specific use cases [08:53:00].

Understanding Benchmarks

A benchmark is composed of three main elements [00:48:00]:

Model: The AI system being tested [00:50:00].
Test Set: A collection of questions or data used for testing [00:52:00].
Metric: The method by which the score is kept [00:55:00].

The key insight is that benchmarks standardize the test set and metrics across various models to enable comparability, similar to how the SAT uses the same questions and scoring system for different test-takers [01:04:00].

The Flaws of Public AI Benchmarks

When billions of dollars in investment, enterprise contracts, and developer mind share hinge on benchmark scores, there’s a strong incentive for companies to manipulate the system [01:22:00]. Common “tricks” include:

Apples-to-Oranges Comparisons: Companies may compare their best, often more expensive, configurations against competitors’ standard ones, selectively reporting results to appear superior [02:22:00]. For example, XAI released benchmarks for Grok 3 showing it beating competitors, but they were comparing their best configuration (using consensus at 64, running the model 64 times) against standard configurations of other models [02:49:00].
Privileged Access to Test Questions: Organizations funding benchmarks may gain early or full access to test data, allowing them to evaluate their models internally and announce scores before independent verification [03:23:00]. This creates a trust problem, even if no explicit training on the data occurs [04:17:00].
Optimizing for Style Over Substance: Models can be trained to produce chatty, engaging, or flattering responses that appeal to evaluators, even if the content is incorrect [04:35:00]. Research shows that filtering out style effects (length, formatting, personality) can completely change model rankings, indicating that often, “we’re measuring which model is most charming” instead of most accurate [05:15:00].

These issues highlight Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure” [06:03:00]. Experts like Andrej Karpathy and John Yang acknowledge an evaluation crisis and that current benchmarks are “pretty fundamentally broken” [06:27:00].

Principles for Fixing Public AI Metrics

To improve public metrics, it’s necessary to address their three components [07:06:00]:

Model Comparisons: Require “apples-to-apples” comparisons with the same computational budget and constraints, no cherry-picking, and transparently show cost-performance trade-offs [07:14:00].
Test Sets: Demand transparency through open-sourced data, methodologies, and code, with no financial ties between benchmark creators and model companies [07:33:00]. Regular rotation of test questions is also crucial to prevent overfitting [07:45:00].
Metrics: Implement controls for style effects to measure substance over engagement [07:51:00]. All attempts should be publicly disclosed to prevent cherry-picking the best run [07:59:00].

Progress is being made with style-controlled rankings in platforms like LM Arena and the emergence of independent, open-source benchmarks in specific domains like legal tech, health tech, and finance [08:11:00].

Building Custom, Effective AI Evaluations

Instead of relying on flawed public benchmarks, organizations should create a set of evaluations that genuinely matter for their specific use case [08:55:00].

Five Steps to Effective Custom Evaluations

Gather Real Data: Focus on actual queries from your production system. Five real user problems are significantly more valuable than 100 academic questions, as they reflect genuine needs [09:04:00].
Choose Your Metrics: Define what truly matters for your application. This might include quality, cost, or latency. A chatbot, for example, requires different metrics than a medical diagnosis system [09:21:00].
Test the Right Models: Do not solely depend on public leaderboards. Test the top five models directly on your specific data. A model like GPT-4 might excel on generic benchmarks but fail when applied to your unique legal documents [09:34:00].
Systematize It: Establish consistent and repeatable evaluation processes. This can be built internally or by utilizing a dedicated platform like Scorecard [09:47:00].
Keep Iterating: Recognize that models evolve and business needs change. Evaluation should be a continuous process, not a one-time event [09:55:00].

The Continuous Evaluation Workflow

A robust evaluation strategy integrates seamlessly into the AI development lifecycle [10:06:00]:

Identify Issues: Pinpoint problems within your AI system.
Build Improvements: Develop solutions based on identified issues.
Run Evaluations Before Deployment: Critically, evaluations are performed before new versions are deployed to production [10:12:00].
Continuous Cycle: This creates a cycle where you run evaluations, get feedback, improve the model, and only deploy once your quality bar is met. Post-deployment, monitoring continues, leading to further iterations [10:16:00].

This pre-deployment evaluation loop is crucial for teams that consistently ship reliable AI, preventing constant firefighting of production issues [10:25:00]. While this approach demands more effort than merely checking a leaderboard, it is the only way to build AI that truly serves its users [10:34:00].

Conclusion

The benchmarks game is heavily influenced by high stakes like market capitalization, acquisitions, and developer mind share [10:45:00]. However, companies don’t have to participate in this rigged system. By building custom evaluations that measure what matters to your users rather than what generates buzz, organizations can ship superior products and achieve effective AI implementation [10:55:00]. As the saying goes, “All benchmarks are wrong, but some are useful. The key is knowing which ones” [11:04:00].

Tubegraph

Explorer

Table of Contents