From: aidotengineer

Traditional software development relies on static tests, but AI evaluations must be dynamic and continuously aligned with real-world usage to build effective AI systems [00:01:50], [00:18:10]. This requires iterative alignment of evaluators and datasets, much like aligning an LLM application itself [00:13:01].

Fundamentals of AI Evaluation

An evaluation in AI functions similarly to unit or integration testing in traditional software, preventing the deployment of changes without proper verification [00:02:09], [00:02:17]. To test quality before production, three key components are essential [00:02:49]:

  1. Agent The agent is the system or component being evaluated, which could be an end-to-end agent, a specific function, or a retrieval pipeline [00:02:55], [00:03:08]. Different agents, such as customer service chatbots or Q&A systems for legal contracts, have unique requirements regarding accuracy, compliance, explainability, and nuance that evaluations must account for [00:03:19], [00:03:43].

  2. Dataset The dataset is crucial as it defines what the agent is evaluated against [00:03:48], [00:03:51]. It must include both the types of queries and requests the system will receive in production (inputs) and the ideal responses (outputs) [00:04:12]. Importantly, datasets need to cover not only typical “happy paths” but also tricky edge cases where things might go wrong [00:04:24], [00:04:26]. These examples should ideally be written by domain experts who understand the business context and quality requirements [00:04:34].

  3. Evaluators Evaluators determine how quality is measured [00:04:52], [00:04:55].

    • Human Evaluators: Traditionally, subject matter experts review outputs, score them, and provide feedback, though this is slow and expensive [00:04:57], [00:05:05].
    • Code-Based Evaluators: Suitable for objective metrics like response time or latency [00:05:09].
    • LLM Evaluators: Promise to combine the nuance of human reasoning with the speed and scalability of automated systems [00:05:21], [00:05:35].

These components are dynamic and must evolve as the AI system improves [00:05:57]. An improving agent requires more challenging test cases, and increasingly sophisticated evaluation criteria may necessitate different types of evaluators [00:06:02], [00:06:08].

The Rise of LLM Evaluators

LLM evaluators have become popular due to their compelling advantages [00:06:19], [00:06:22]:

  • Speed: Evaluations that previously took 8-10 hours with human evaluators can now be completed in under an hour, or a full day’s work with Mechanical Turk can be done in 50-60 minutes with an LLM evaluator [00:06:43], [00:07:00], [00:07:07]. This represents a huge, incremental improvement [00:07:14].
  • Cost: A 10x cost reduction is observed; traditional human evaluations via Mechanical Turk cost several hundred dollars for a thousand ratings, while LLM evaluators range from 120 depending on the model [00:07:19], [00:07:26].
  • Consistency: LLM evaluators achieve over 80% consistency with human judgments, a level comparable to agreement rates between different human evaluators [00:07:45], [00:08:00]. Research papers like NLG Eval and SPADE show strong correlations between human judgments and LLM scores [00:08:09].

Challenges with LLM Evaluators

Despite their promise, LLM evaluators face two major problems that necessitate continuous improvement [00:08:42]:

  1. Criteria Drift: Relying on built-in evaluation criteria from popular frameworks (e.g., Ragas, Prompts, LangChain) can lead to issues [00:08:54]. These generalizable criteria may not measure what’s important for a unique use case [00:09:10]. For example, an e-commerce recommendation system’s evaluator might check for context relevance but miss deeper user requirements for relevance, leading to user complaints despite good testing results [00:09:21]. Criteria drift occurs when the evaluator’s definition of “good” no longer aligns with the user’s [00:10:41].

  2. Dataset Drift: This problem arises from a lack of test coverage in datasets [00:11:19]. Handcrafted, “perfect” test cases may not hold up when real-world users introduce context-dependent, messy inputs [00:11:41]. Users often ask questions broader than anticipated, require real-world data (e.g., from search APIs), or combine multiple questions in unexpected ways [00:11:53]. When datasets don’t represent reality, metrics can still appear good, creating a false sense of security [00:12:18], [00:12:40].

A Three-Step Approach to Continuous Improvement

To address these challenges and ensure evaluations work effectively, an iterative alignment process is crucial [00:12:49], [00:12:59]:

  1. Align Evaluators with Domain Experts Regularly have domain experts grade outputs and critique evaluator results to identify what the evaluator is missing or overemphasizing [00:13:22], [00:13:30]. Use these critiques and few-shot examples in the evaluator prompt to better ground the evaluator’s notion of good and bad [00:13:38], [00:13:40]. This involves significant massaging and iteration on the evaluator prompt itself [00:13:46].

  2. Keep Data Sets Aligned with Real-World User Queries Your test bank needs to be a living, breathing entity [00:14:09], [00:14:16]. Automatically flow underperforming queries from production back into the test suite [00:14:19]. These real-world failures are “golden” opportunities to improve the test bank and the application [00:16:32]. Continuously add these test cases with ground truth data to refine the dataset [00:16:43], [00:16:47].

  3. Measure and Track Alignment Over Time Use concrete metrics like F1 score for binary judgments or correlation coefficients for Likert scales to track how well your evaluator matches human judgment [00:14:31], [00:14:35]. This systematic measurement informs whether the evaluator is truly improving or regressing with each iteration [00:14:47].

Practical Implementation Steps

  • Customize the LLM Evaluator Prompt: Don’t rely solely on templated metrics, which can be meaningless [00:15:16], [00:15:18]. Carefully tailor evaluation criteria, add few-shot examples of critiques from domain experts, and decide between binary or Likert scales for ratings (binary is highly recommended) [00:15:22], [00:15:33]. Ensure that what you’re measuring is truly meaningful to your use case and business context [00:15:38].
  • Involve Domain Experts Early: Get domain experts to evaluate the evaluator, even starting with 20 examples in spreadsheets, to gauge alignment between evaluator judgments and expert opinions [00:15:53], [00:15:59]. Their feedback will inform changes to the evaluator prompt [00:16:09].
  • Log and Iterate: Log every time your system underperforms in production [00:16:19]. These real-world failures should be continuously added to your test bank with ground truth data [00:16:43]. Iteratively improve LLM evaluator prompts by testing new versions against your expanding test bank and making them more specific to your use case [00:16:55].
  • Invest in an Eval Console: Build or use a tool that allows domain experts to iterate on evaluator prompts and assess agreement with evaluator critiques and judgments [00:17:08].
  • Systematic Measurement: Set up a simple dashboard to track alignment scores (F1 score or correlation metrics) over time [00:17:26], [00:17:30]. This provides a systematic way to track the improvement of your evaluator template [00:17:47].

The ultimate goal is not perfection but continuous improvement, by building iterative feedback loops into the development process [00:18:00], [00:18:02], [00:18:24]. This approach helps overcome challenges with early AI models and ensures advancements in AI model technology translate into real-world impact and efficiency.