From: aidotengineer

The success of AI systems in the real world hinges on effective evaluations [00:01:57]. While traditional testing frameworks exist, unique patterns emerge with AI that necessitate a different approach to evaluation [00:00:38]. Getting evaluations [00:01:50] right is not just about catching bugs or measuring accuracy; it’s about building AI systems that truly deliver value [00:01:57].

What is an AI Evaluation?

An evaluation [00:02:09] in AI, similar to unit and integration testing in traditional software, is crucial before pushing changes to an AI application into production [00:02:11]. Many teams express uncertainty about what constitutes a robust evaluation framework [00:02:42].

To test quality before production, three key components are needed:

  1. Agent [00:02:55]: This is whatever is being evaluated [00:02:59], ranging from an end-to-end agent to a small function or retrieval pipeline [00:03:00]. Agents can be customer service chatbots, Q&A agents for legal contracts, or other complex systems [00:03:10]. Each agent has unique requirements, such as accuracy, compliance with regulations, explainability, or nuance around specific standards [00:03:19]. The evaluation [00:03:40] must account for all these aspects [00:03:43].
  2. Dataset [00:03:48]: Considered the most important component, this is what the agent is evaluated [00:03:53] against [00:03:53]. A robust dataset must include both inputs (queries/requests the system will receive in production) and ideal outputs (what good responses should look like) [00:04:12]. It should cover not only the “happy path” but also tricky edge cases where things might go wrong [00:04:24]. These examples should ideally be written by domain experts who understand the business context and can define quality requirements [00:04:34].
  3. Evaluators [00:04:52]: This defines how quality is measured [00:04:55].

These components are dynamic and must evolve as the agent improves, the dataset needs to include more challenges, and evaluation criteria become more sophisticated [00:05:57].

Types of Evaluators

Traditional Human Evaluators

Traditionally, human evaluators, often subject matter experts, review outputs, score them, and provide feedback [00:04:57].

  • Pros: Provides nuanced judgment and understanding of complex contexts [00:05:26].
  • Cons: Very slow and expensive [00:05:05]. For example, processing a thousand test cases with human evaluators like Mechanical Turk could take a full day of work and hundreds of dollars [00:06:54].

Goal-Based Evaluators

These are great for objective metrics like response time or latency [00:05:09].

LLM Evaluators (Automated)

Large Language Model (LLM) evaluators have gained popularity due to their promise of combining nuanced reasoning with the speed and scalability of automated systems [00:05:21]. They act as “LM as a judge” [00:06:29].

Benefits of LLM Evaluators

  • Speed: Evaluations [00:06:43] that used to take 8-10 hours with human evaluators can now be completed in under an hour [00:06:46]. A thousand test cases could be processed in 50-60 minutes [00:07:07].
  • Cost Reduction: Costs can drop from several hundred dollars for human evaluation to 120 for LLM evaluators, representing a 10x reduction [00:07:21].
  • Consistency: LLM evaluators show over 80% consistency with human judgments [00:07:45]. This is comparable to the agreement levels seen between different human evaluators, who also don’t agree 100% of the time [00:07:55]. Research papers like NLG Eval and SPADE show strong correlations between human judgments and LLM scores [00:08:12].

Challenges with LLM Evaluators

Despite their benefits, LLM evaluators present two major problems:

  1. Criteria Drift [00:08:49]: This occurs when an evaluator’s notion of what is “good” no longer aligns with the user’s notion of “good” [00:10:41]. Popular frameworks like Ragas, Prompts, or LangChain often rely on built-in evaluation criteria designed for generalizability [00:09:01].

    • Example: An AI startup building an LLM-based recommendation system for e-commerce websites found that while their evaluator checked standard boxes like context and generation relevance, it missed crucial user requirements for relevance in production [00:09:26]. The evaluator indexed too heavily on keyword relevance without considering the broader context of product descriptions or user queries [00:09:58].
    • Underlying Model Changes: Even if an LLM evaluator works fine on a single test case, its consistency can drop if the underlying LLM model changes (e.g., using an unstable version of OpenAI’s models) [00:10:22].
    • Research: The “EvalGen” paper by Shanker and team at Berkeley highlighted that evaluation criteria need to evolve over time, balancing true positives with false positives to maximize F1 score alignment against human judgments [00:10:50].
  2. Dataset Drift [00:11:19]: This problem refers to a lack of test coverage in datasets [00:11:21]. Hand-written test cases, perfect in theory, often fail to represent the messy, context-dependent inputs from real-world users in production [00:11:38].

    • Real-world Usage: Users constantly ask broader topics, require real-world data (e.g., from search APIs), or combine multiple questions in unanticipated ways [00:11:53].
    • False Confidence: Metrics might still look good because the evaluator is scoring happily on the outdated test cases, but these tests no longer represent reality [00:12:18].

Fixing Evaluation Problems: Iterative Alignment

The key insight to fixing these problems is that evaluators and datasets need to be iteratively aligned, similar to how an LLM application itself is aligned [00:12:59].

Here’s a three-step approach for effective evaluation [00:13:08]:

  1. Align your Evaluators with Domain Experts [00:13:19]:

    • Have domain experts regularly grade outputs and critique the evaluator’s results [00:13:25].
    • Use their critiques and few-shot examples in the evaluator prompt to ground it in a real-world notion of what is good or bad [00:13:36].
    • Continuously massage and iterate on the evaluator prompt, not just relying on templated library metrics [00:13:46].
  2. Keep your Data Sets Aligned with Real-World User Queries [00:14:09]:

    • Log your test bank, treating it as a living, breathing entity [00:14:13].
    • Automatically (or manually) flow underperforming queries from production back into the test suite [00:14:19].
  3. Measure and Track Alignment Over Time [00:14:31]:

    • Use concrete metrics like F1 score for binary judgments or correlation coefficients for Likert scales [00:14:35].
    • Track how well your evaluator matches human judgment with every iteration to determine if it’s improving or regressing [00:14:44].

Practical Steps for Effective Evaluation

  • Customize the LLM Evaluator Prompt [00:15:14]: Instead of relying on templated metrics, carefully tailor your evaluation criteria [00:15:22]. Add few-shot examples of critiques provided by domain experts [00:15:25]. Decide between binary scales (highly recommended) or Likert scales for ratings [00:15:31]. Ensure you’re measuring something meaningful to your specific use case and business context [00:15:38].
  • Involve Domain Experts Early [00:15:50]: Get them to evaluate the evaluator [00:15:56]. Starting with 20 examples in a spreadsheet can provide a good sense of whether evaluator judgments align with domain experts, guiding future changes to the evaluator prompt [00:15:59].
  • Start with Logging [00:16:17]: Every time your system underperforms in production, it’s an opportunity to improve your test bank [00:16:26]. These real-world failures are “golden” because they highlight exactly where your evaluation system [00:16:41] is falling short [00:16:41]. Continuously add these test cases and ground truth labels to your test bank [00:16:44].
  • Iterate LLM Evaluator Prompts [00:16:55]: Evaluator prompts are not static; they need to evolve [00:16:58]. Test new versions against your expanding test bank and make them more specific to your use case [00:17:03].
  • Invest in an Eval Console [00:17:08]: Whether built internally or using a tool, an eval console allows domain experts to iterate on the evaluator prompt and assess agreement with its critiques and judgments [00:17:10].
  • Systematic Measurement [00:17:23]: Track alignment scores (F1 or correlation metrics) over time using a simple dashboard [00:17:26]. This systematic tracking informs whether your evaluator template is improving or not [00:17:44].

Ultimately, your LM evaluations [00:18:07] are only as good as their alignment with real-world usage [00:18:10]. Avoid static evaluation [00:18:14] and build iterative feedback loops into your development process [00:18:24].