From: aidotengineer

Effective evaluation is crucial for building AI systems that deliver real-world value, moving beyond mere “fancy demos” [02:01:01]. Just as traditional software requires unit and integration testing before changes are pushed to production, AI applications require robust evaluations [02:11:00]. A key insight is that evaluators and data sets must be iteratively aligned, similar to how an LLM application itself is aligned [12:59:00].

Fundamentals of Evaluation

To effectively test the quality of an AI application before production, three key components are needed:

  1. Agent: This is whatever system or component is being evaluated [02:55:00]. This could range from an end-to-end AI agent, a small function within an agent, or even a retrieval pipeline [02:59:00]. Examples include customer service chatbots or Q&A agents processing legal contracts [03:10:00]. Each agent has unique requirements, such as accuracy, compliance, explainability, or nuance specific to its domain [03:19:00].
  2. Data Set: This component defines what the agent is evaluated against [03:48:00]. It should include both the expected inputs (queries and requests the system will receive in production) and the ideal outputs (what good responses should look like) [04:12:00]. Crucially, the data set must cover not just the “happy path” but also tricky edge cases where things might go wrong [04:26:00]. These examples should be written by domain experts who understand the business context and can define the requirements for the agent [04:34:00]. Many teams “stumble” by relying on a few handwritten test cases that don’t cover all use cases [03:57:00].
  3. Evaluators: This refers to how quality is measured [04:52:00].
    • Human Evaluators: Traditionally, subject matter experts review outputs, score them, and provide feedback. While this works, it is very slow and expensive [04:57:00].
    • Code-based Evaluators: Effective for objective metrics like response time or latency [05:09:00].
    • LLM Evaluators: These have gained popularity due to their promise of combining nuanced reasoning with the speed and scalability of automated systems [05:21:00]. They offer significant advantages in speed (e.g., 8-10 hour human evaluations can be done in under an hour) [06:43:00], cost (up to a 10x reduction compared to human evaluations) [07:19:00], and consistency (over 80% consistency with human judgments) [07:41:00]. Research papers like NLG Eval and SPADE show strong correlations between human judgments and LLM scores [08:09:00].

These three components are dynamic and must evolve over time. As an agent improves, its data set may need to include more challenging cases, and evaluation criteria may become more sophisticated, requiring different kinds of evaluators [05:57:00].

Challenges with LLM Evaluators

Despite their advantages, LLM evaluators face two major problems, representing challenges in AI Agent Evaluation:

Criteria Drift

This occurs when an evaluator’s notion of “good” no longer aligns with the user’s notion of “good” [08:49:00]. Using generalizable evaluation criteria from popular frameworks (like Fraqas, Promptfoo, LangChain) might not measure what’s important for a unique use case [09:10:00]. For example, an e-commerce recommendation system’s evaluator might focus too hard on keyword relevance, missing the broader context of user intent, leading to user complaints in production despite good test scores [09:21:00]. This can also happen if the underlying LLM model for the evaluator changes, leading to inconsistent grading [10:30:00]. The concept of criteria drift is explored in the “EvalGen” paper by Shanker and team at Berkeley, highlighting the need for evaluation criteria to evolve and balance true positives with false positives to maximize F1 score against human judgments [10:50:00].

Data Set Drift

This problem arises when data sets lack sufficient test coverage, meaning the test cases no longer represent real-world usage patterns [11:19:00]. Hand-crafted test cases, while initially perfect, fail to hold up when real users provide messy, context-dependent, or multi-faceted inputs that were not anticipated [11:27:00]. Even if metrics look good on the static test cases, the system can underperform significantly in production because the tests don’t reflect reality [12:15:00].

The Solution: Iterative Alignment

The fundamental insight to fix these problems is that evaluators and data sets need to be iteratively aligned, similar to how an LLM application itself is aligned [12:59:00]. This forms the basis for evaluating AI agents methods and metrics.

A three-step approach for achieving this alignment involves:

  1. Align Evaluators with Domain Experts: Have domain experts regularly grade outputs and critique the evaluator’s results [13:19:00]. Use their feedback and few-shot examples of critiques to refine the evaluator prompt, grounding its understanding of “good” and “bad” in the real world [13:38:00]. Continuous iteration on the evaluator prompt is necessary [13:47:00].
  2. Keep Data Sets Aligned with Real-World User Queries: Treat the test bank as a “living, breathing thing” [14:09:00]. When the system underperforms in production, automatically flow those underperforming queries back into the test suite [14:19:00]. These real-world failures are invaluable for improving the test bank and identifying where the evaluation system falls short [16:30:00].
  3. Measure and Track Alignment Over Time: Use concrete metrics like F1 score (for binary judgments) or correlation coefficients (for Likert scales) to track how well the evaluator matches human judgment with every iteration [14:31:00]. This informs whether the evaluator is truly improving or regressing [14:50:00].

While this sounds like significant work, it is “far less work than dealing with the consequences of a meaningless eval that doesn’t really tell you anything” [14:56:00].

Practical Steps for Effective Evaluation Alignment

  • Customize the LLM Evaluator Prompt: Instead of relying on templated metrics, carefully tailor the evaluation criteria. Add few-shot examples of critiques from domain experts and decide whether to use binary or Likert scales (binary is highly recommended) [15:11:00]. Ensure that the metrics measure what is truly meaningful to the specific use case and business context [15:38:00].
  • Involve Domain Experts Early: Get domain experts to “evaluate the evaluator” by reviewing its judgments [15:52:00]. Starting with as few as 20 examples in spreadsheets can provide a good sense of alignment and inform necessary changes [15:59:00].
  • Log and Continuously Improve the Test Bank: Every production underperformance is an opportunity to improve the test bank [16:26:00]. Continuously add these real-world failure cases and their ground truth labels to the test bank [16:43:00].
  • Iterate LLM Evaluator Prompts: Evaluator prompts are not static; they need to evolve [16:55:00]. Test new versions against the expanding test bank, making them more specific to the use case [17:01:00]. Investing in an “eval console” or similar tool allows domain experts to directly iterate on prompts and gauge agreement with evaluator critiques [17:08:00].
  • Measure Alignment Systematically: Set up a simple dashboard to track alignment scores (F1 or correlation metrics) over time [17:23:00]. This systematic measurement reveals whether the evaluator template is improving or not [17:47:00].

Ultimately, LLM evaluations are only as good as their alignment with real-world usage [18:07:00]. It’s crucial to avoid “static evaluation” and instead build iterative feedback loops into the development process. The goal is continuous improvement, not perfection [18:00:00].