From: aidotengineer

Evaluation in AI applications is crucial for ensuring systems deliver value in the real world rather than remaining mere demonstrations [00:01:57]. Just as traditional software requires unit and integration tests before deployment, AI applications need robust evaluations to assess quality before changes are pushed to production [00:02:11].

While many teams claim to have automated tests or internal testing processes, there is often uncertainty regarding what constitutes a good evaluation framework [00:02:36].

Key Components of AI Evaluation

To test the quality of an AI system before production, three key components are necessary [00:02:52]:

  1. Agent: This is the system being evaluated [00:02:55]. It could be an end-to-end agent, a small function within an agent, or even a retrieval pipeline [00:03:01]. Agents vary widely, from customer service chatbots to Q&A agents processing legal contracts, each with unique requirements and challenges [00:03:10]. For example, a financial Q&A system needs to be accurate, compliant with regulations, explain its reasoning, and understand financial accounting standards [00:03:22]. The evaluation must account for these specific aspects [00:03:40].

  2. Dataset: This is the data against which the agent is evaluated [00:03:48]. An effective dataset must include both the expected inputs (queries and requests the system will receive in production) and the ideal outputs (what good responses should look like) [00:04:12]. It should cover not only common scenarios but also tricky edge cases where problems might arise [00:04:26]. These examples should ideally be created by domain experts who understand the business context and can define quality requirements [00:04:34].

  3. Evaluators: This component determines how quality is measured [00:04:52].

Types of Evaluators

Historically, different approaches have been used for evaluating AI agents:

  • Human Evaluators: Traditionally, subject matter experts review outputs, score them, and provide feedback [00:04:57]. While effective, this method is often slow and expensive [00:05:05].
  • Code-based Evaluators: These are suitable for objective metrics like response time or latency [00:05:09]. However, metrics like ROUGE-L often fall short for nuanced linguistic evaluation [00:05:14].
  • Large Language Model (LLM) Evaluators: These promise to combine the nuance of human reasoning with the speed and scalability of automated systems [00:05:21]. They have gained popularity due to their compelling benefits [00:06:19]:
    • Speed: Evaluations that previously took 8-10 hours with human evaluators can now be completed in under an hour [00:06:46]. For example, 1,000 test cases that might take a full day with human evaluators can be done in 50-60 minutes using an LLM evaluator (assuming sequential execution) [00:06:54].
    • Cost Reduction: Human evaluations via platforms like Mechanical Turk can cost several hundred dollars for 1,000 ratings [00:07:21]. LLM evaluators can achieve similar results for 120, representing a 10x cost reduction [00:07:28].
    • Consistency: LLM evaluators show over 80% consistency with human judgments [00:07:45]. This consistency is comparable to the agreement observed between different human evaluators, as humans do not always agree 100% of the time [00:07:54]. Research papers like NLG-Eval and SPADE have shown strong correlations between human judgments and LLM scores [00:08:09], and major model providers are increasingly using LLM evaluators for alignment [00:08:20].

Challenges with LLM Evaluators

Despite their benefits, LLM evaluators face two major problems [00:08:42]:

  • Criteria Drift: This occurs when an evaluator’s definition of “good” no longer aligns with the user’s perception of quality [00:10:41]. Standard evaluation criteria in popular frameworks (e.g., Fraugash, Promptfoo, Langchain) are designed for generalizability and may not measure what is important for a unique use case [00:09:01]. For instance, an e-commerce recommendation system’s evaluator might focus too much on keyword relevance, missing broader product context, leading to user complaints in production despite good test scores [00:09:31]. Criteria can also drift if the underlying LLM model used for evaluation changes unexpectedly [00:10:29]. Research by Shanker and his team at Berkeley highlights that evaluation criteria must evolve over time to balance true positives and false positives effectively [00:10:50].
  • Dataset Drift: This refers to a lack of test coverage in datasets, where carefully crafted test cases fail to represent real-world user queries [00:12:19]. When real users interact with a system, their inputs are often context-dependent, messy, or combine multiple questions in unexpected ways [00:11:41]. If the test dataset doesn’t reflect these varied usage patterns, evaluation metrics might look good, but the system will underperform in production [00:11:50].

Improving AI Evaluation Methods

To overcome these challenges, evaluations and datasets must be iteratively aligned, similar to how LLM applications themselves are aligned [00:12:59]. A three-step approach for effective evaluation involves:

  1. Align Evaluators with Domain Experts:

    • Have domain experts regularly grade outputs and critique the evaluator’s results [00:13:22].
    • Use these critiques and few-shot examples in the evaluator’s prompt to ground its understanding of quality [00:13:38].
    • Continuously massage and iterate on the evaluator prompt itself, rather than relying solely on templated metrics [00:13:47].
    • Involve domain experts early in the process to validate evaluator judgments, even by starting with 20 examples in a spreadsheet [00:15:52].
  2. Keep Datasets Aligned with Real-World User Queries:

    • Log all queries, especially underperforming ones from production, and automatically flow them back into the test suite [00:14:11]. These real-world failures are invaluable for identifying where the evaluation system falls short [00:16:26].
    • Continuously add these test cases with ground truth labels to the test bank [00:16:43].
  3. Measure and Track Alignment Over Time:

    • Use concrete metrics like F1 score for binary judgments or correlation coefficients for Likert scales [00:14:35].
    • Track how well the evaluator matches human judgment with every iteration [00:14:41]. This informs whether the evaluator is truly improving or regressing [00:14:50].
    • Set up a simple dashboard to track alignment scores (F1 or correlation metrics) to systematically monitor improvements in the evaluator template [00:17:28].

Practical Steps for Customizing Evaluations

  • Customize LLM Evaluator Prompts: Avoid relying on templated metrics [00:15:14]. Carefully tailor evaluation criteria, add few-shot examples of critiques from domain experts, and choose appropriate rating scales (binary is highly recommended over Likert) [00:15:22]. Ensure the metrics chosen are meaningful to the specific use case, application, and business context [00:15:38].
  • Iterate LLM Evaluator Prompts: Prompts are not static; they need to evolve [00:16:55]. Test new versions against an expanding test bank and make them more specific [00:17:01]. Consider investing in an evaluation console tool, or building one internally, to allow domain experts to iterate on prompts and confirm agreement with evaluator judgments [00:17:08].
  • Implement an Iterative Feedback Loop: AI systems, unlike traditional software, require continuous feedback loops [00:18:20]. The goal is not perfection, but continuous improvement based on real-world usage and alignment [00:18:00].

Ultimately, LLM evaluations are only as good as their alignment with real-world usage [00:18:07]. Avoiding static evaluation and building iterative feedback loops into the development process yields significant payoffs in improving evaluation over time [00:18:13].