From: aidotengineer

Evaluations are considered the most significant challenge and the “missing piece” when attempting to scale generative AI (GenAI) workloads [00:01:11]. Justin Mohler, a principal applied AI architect at AWS, highlights that a lack of evaluations is the primary reason why many GenAI projects fail to scale [00:01:45]. When implemented, evaluations can unlock the ability to scale these applications [00:02:07].

Why Evaluations are Crucial

A primary objective of any evaluation framework should be to identify problems [00:04:43]. While evaluations do produce scores, this is a lesser goal compared to uncovering issues and even suggesting solutions, especially when generative AI reasoning is incorporated [00:04:50]. Designing an evaluation framework with a “find errors” mindset leads to a more effective design than merely aiming to measure performance [00:05:08].

Mohler’s team at AWS uses evaluations as a filter to distinguish between “science projects” (those unlikely to scale) and successful projects [00:05:42]. Teams unwilling to invest time in building a “gold standard set for evaluations” often indicates a project that will not achieve scale [00:05:57]. Conversely, wildly successful projects demonstrate a strong willingness to invest significant time in building evaluation frameworks [00:06:32].

Case Study: Document Processing Workload

In one instance, a customer’s document processing workload, after 6-12 months of development by six to eight engineers, had an accuracy of only 22% and faced potential cancellation [00:02:26]. The core issue was a complete lack of evaluations [00:03:07]. After designing and implementing an evaluation framework, it became clear where the problems lay, making fixes trivial [00:03:22]. Within six months, the accuracy improved to 92%, leading to its launch as the largest document processing workload on AWS in North America at the time [00:03:43].

Challenges and Mindset Shift in Generative AI Evaluations

Traditional AI/ML backgrounds often associate evaluations with exact numerical scores [00:07:00]. However, GenAI outputs are often free-text, which might seem daunting to quantify precisely [00:07:02]. Humans have been grading free text for centuries (e.g., English essays), and GenAI can be evaluated similarly [00:07:14]. The key is to go beyond just assigning a score and instead point out what went wrong and where improvements can be made [00:08:14].

The Importance of Evaluating Reasoning

When evaluating AI systems, it’s crucial to understand how the model arrived at its answer, not just the answer itself [00:09:11].

  • 2x4 Example: Drilling a perfect hole through a 2x4 might seem successful, but if the methodology used was unsafe or erratic (e.g., using a chainsaw on a hand), the underlying system needs rethinking [00:08:42].
  • Meteorology Example: A model asked to summarize weather data might incorrectly report “sunny and bright” despite receiving data indicating rain [00:10:05]. While the output is clearly wrong (a score of zero), asking the model to explain its reasoning (e.g., “it’s important to mental health to be happy, so I decided not to talk about the rain”) provides critical insight into the problem [00:10:18]. Conversely, a correct output (e.g., “sunny” response to “sunny” data) can be misleading if the model’s reasoning was flawed (e.g., “I just chose ‘sunny’ randomly”) [00:10:40].

Understanding the model’s reasoning is vital for effective troubleshooting and improvement, as simply evaluating the output can be insufficient [00:10:29].

Prompt Decomposition and Segmentation

While not exclusive to evaluations, prompt decomposition is often performed in conjunction with them [00:11:23]. It involves breaking a large, complex prompt into a series of smaller, chained prompts [00:13:02]. This allows for attaching an evaluation to each section, helping to pinpoint where errors occur and focus improvement efforts [00:13:07].

This approach also helps determine if GenAI is even the appropriate tool for a given segment of the prompt [00:13:18]. For instance, in the weather company example, the model sometimes incorrectly interpreted numerical comparisons (e.g., “7 is less than 5”) [00:12:41]. By replacing the GenAI calculation for this specific step with a Python mathematical comparison, accuracy for that step reached 100% [00:13:30].

Semantic Routing

A common pattern in scaled workloads is semantic routing, where an input query is first classified to determine the task type [00:14:02]. Easy tasks might be routed to a smaller, faster model, while harder tasks go to a larger, more complex model [00:14:11]. Attaching evaluations to each step in this process allows for using the right model for the job and significantly increases accuracy by eliminating “dead space” or unnecessary tokens from prompts [00:14:34].

Seven Habits of Highly Effective Generative AI Evaluations

Successful scaled GenAI workloads almost always incorporate these seven habits [00:15:37]:

  1. Fast: Evaluation frameworks should provide results quickly, ideally within 30 seconds [00:15:50]. This enables rapid iteration (hundreds of changes and tests daily) crucial for innovation and accuracy improvement [00:16:19]. Achieving this speed often involves using GenAI as a judge or Python for numeric outputs, leveraging parallel processing for test case generation and judging [00:16:51].
  2. Quantifiable: Effective evaluation frameworks must produce numerical scores [00:18:21]. While scores might have some “jitter,” this can be mitigated by using a numerous set of test cases and averaging the results, similar to how grades are averaged in school [00:18:54].
  3. Explainable: Beyond just looking at outputs, it’s vital to examine the model’s reasoning process [00:20:09]. This applies to both the generation by the model and the scoring by the judge [00:20:19]. Designing a “judge prompt” with clear instructions and asking it to explain its reasoning helps in prompt engineering for the judge itself [00:21:11].
  4. Segmented: Nearly all scaled workloads involve multiple steps, not just a single prompt [00:21:28]. Each step should be evaluated individually to identify precise error sources [00:21:36]. This also helps in choosing the most appropriate model (e.g., a small, fast model like Nova Micro for simple tasks) for each segment, optimizing for cost and latency [00:21:50]. This is part of strategies for AI evaluation and troubleshooting.
  5. Diverse: Test cases should cover all anticipated use cases, both in-scope and out-of-scope [00:22:10]. A rule of thumb is around 100 test cases, with more for core use cases and fewer for edge cases [00:22:18]. Building diverse test cases helps define the project’s scope and ensures the model handles various queries appropriately [00:19:32]. This is a key part of evaluating AI agents methods and metrics.
  6. Traditional: While GenAI is powerful, it’s important not to abandon traditional AI evaluation techniques [00:22:30]. For numeric outputs, direct numeric evaluations are superior [00:22:52]. For RAG architectures, metrics like retrieval precision, recall, and F1 scores are still relevant [00:23:00]. Measuring cost and latency also relies on traditional tools [00:23:11]. These are crucial aspects of evaluating AI system performance.

Building an Evaluation Framework

A robust evaluation framework starts with a gold standard set [00:23:31]. This set serves as the foundation for the entire system, so investing time in its quality is paramount [00:23:52]. It’s generally not recommended to use GenAI to create the gold standard set, as it risks building a system that replicates the GenAI’s own errors [00:24:00]. While GenAI can help create a “silver standard set,” it must always be human-reviewed for accuracy [00:24:16]. This is one of the key steps to create effective evaluations for AI applications.

Evaluation Process

  1. Input Generation: An input from the gold standard set is fed into a prompt template and then an LLM to generate an output, which includes both the answer and its reasoning [00:24:26].
  2. Judging: The generated output is compared against the matching answer from the gold standard using a “judge prompt” [00:24:41]. The judge generates a numerical score and, crucially, the reasoning behind that score [00:24:45].
  3. Categorization and Summary: Outputs are often categorized (categories typically included in the gold standard set) to allow for breaking down results and generating summaries of right and wrong answers for each category [00:24:54]. This provides actionable insights for improvement, forming part of strategies for AI evaluation and troubleshooting and evaluating AI systems at scale.

Evaluations and finetuning in AI development are deeply intertwined, with evaluations providing the necessary feedback loop for iterative improvements. Understanding challenges in AI Agent Evaluation is also crucial, as agents introduce additional layers of complexity requiring robust evaluation methods and metrics.