From: aidotengineer

The biggest challenge in scaling generative AI workloads is evaluations [00:01:11]. While concerns like cost, hallucinations, accuracy, and capacity frequently arise, a lack of evaluations is identified as the number one issue across all workloads [00:01:43]. Evaluations are considered the “missing piece” to scaling Generative AI, as their implementation often unlocks the ability to scale a workload [00:02:07].

Case Study: Document Processing Workload

In July 2024, an AWS Principal Applied AI Architect was called in to assist a customer with a document processing workload [00:02:26]. This project had been ongoing for six to twelve months with six to eight engineers, but its accuracy was only 22% [00:02:52]. The core issue discovered was the complete absence of evaluations [00:03:07].

After an evaluations framework was designed and implemented, it became trivial to identify and fix problems [00:03:26]. The real challenge was not increasing accuracy, but knowing where the problems were and their causes [00:03:33]. Within six months, the customer achieved 92% accuracy, exceeding their 90% threshold for production launch [00:03:47]. This led to the workload becoming the single largest document processing workload on AWS in North America at the time [00:03:57].

Purpose of Evaluations

The primary goal of an evaluation framework should be to discover problems [00:04:44]. While generating a quality score is important, it’s a secondary benefit [00:04:58]. An effective framework pinpoints issues and can even suggest solutions, especially if it incorporates Generative AI reasoning [00:04:53]. Adopting a mindset that the framework will find errors leads to a different and more effective design compared to one solely focused on measuring performance [00:05:08].

Evaluations as a Success Filter

Evaluations serve as a critical filter distinguishing a “science project” from a successful project [00:05:44]. Teams unwilling to invest time in building a “gold standard set” for evaluations, often view it as boring, and are likely to create projects that do not scale [00:06:12]. Conversely, successful projects, some achieving 100x ROI or significant cost reductions, are characterized by teams eager to invest substantial time in building robust evaluation frameworks [00:06:38].

Overcoming Generative AI Evaluation Challenges

Traditional AI/ML evaluations often rely on exact numerical scores (e.g., F1 score, precision, recall) [00:07:08]. However, Generative AI outputs are typically free-text, which can seem daunting to evaluate mathematically [00:07:04]. The human race has been grading and evaluating free text for centuries, similar to how a professor grades an essay [00:07:22].

Importance of Reasoning in Evaluations

Simply receiving a score (like an ‘F’ on an essay) without explanation is unhelpful for improvement [00:08:12]. With Generative AI, it’s crucial to understand the model’s reasoning behind its output, akin to a good professor pointing out where a student went wrong [00:08:20].

For example, in a weather summary use case, if the model incorrectly reports “sunny and bright” despite data showing rain, a simple “zero” score doesn’t explain why [00:10:14]. However, if the model explains its reasoning (e.g., “it’s important to mental health to be happy, so I decided not to talk about the rain”), this insight reveals the underlying flaw [00:10:23]. Similarly, even if a correct output is generated (e.g., “sunny” when it’s sunny), if the model’s reasoning is flawed (e.g., “the weather is sunny and bright, and it’s nice to be happy”), it indicates a systemic issue that could lead to failures in other cases [00:11:01]. Evaluating the reasoning process helps identify and fix such problems [00:10:29].

Prompt Decomposition for Scalability

Prompt decomposition is a technique where a large, complex prompt is broken down into a series of chained, smaller prompts [00:13:02]. This allows evaluations to be attached to each section of the prompt, making it easier to pinpoint where errors occur [00:13:09]. This approach also helps determine if Generative AI is even the appropriate tool for a particular section [00:13:24].

In the weather company example, a complex prompt included mathematical comparisons (e.g., “if wind speed is less than five, it’s not very windy”) [00:12:29]. When scaled, Claude occasionally miscalculated (e.g., “wind speed is seven, seven is less than five”) [00:12:45]. By decomposing the prompt and replacing the mathematical comparison with a Python step, accuracy for that section became 100% [00:13:38].

Semantic Routing and its Benefits for Scaling

Semantic routing is a common pattern for building scalable AI systems [00:14:02]. It involves routing an input query to different models based on its complexity [00:14:15]:

  • Easy tasks go to smaller, faster models (e.g., Nova Micro) [00:14:14].
  • Hard tasks go to larger models [00:14:15].

Attaching evaluations to each step of this process helps ensure the right model is used for the job [00:14:34]. This segmented approach often significantly increases accuracy because it removes “dead space” or unnecessary instructions, reducing cost and confusion for the model [00:15:22].

Seven Habits of Highly Effective Generative AI Evaluations

Successfully scaled Generative AI workloads almost always incorporate these seven habits [00:15:37]:

  1. Fast [00:15:50]: Evaluations should provide results in seconds, not days or weeks [00:16:21]. A target of 30 seconds for an evaluation framework run is ideal [00:16:54]. This allows for hundreds of changes and tests daily, significantly accelerating the pace of innovation and accuracy improvements [00:16:25]. This speed is achieved by using Generative AI as a judge and processing test cases and judgments in parallel [00:17:31].
  2. Quantifiable [00:18:21]: Effective frameworks produce numerical scores [00:18:24]. While scores may have some “jitter,” this can be mitigated by having numerous test cases and averaging the results [00:19:04].
  3. Numerous [00:19:17]: A diverse and comprehensive set of test cases is essential to cover all use cases and understand the project’s scope [00:19:22]. Building 100 test cases is a useful rule of thumb, ensuring core use cases have many examples while edge cases have a few [00:22:19]. This process can even help teams clarify product design by defining what questions should be answered and what should be redirected [00:19:50].
  4. Explainable [00:20:09]: Evaluations should provide insight into how the model reached its output and how the judge reasoned its score [00:20:19]. This requires engineering the “judge prompt” with clear instructions and a rubric, similar to a professor grading an essay [00:21:13].
  5. Segmented [00:21:24]: Almost all scaled workloads involve multiple steps, meaning each step needs to be evaluated individually [00:21:36]. This allows for determining the most appropriate and smallest model for each step (e.g., using a smaller model for semantic routing) [00:22:05].
  6. Diverse [00:22:10]: The test cases should cover all anticipated use cases, including those intended to be handled and those out of scope [00:22:20]. This ensures the model correctly redirects out-of-scope queries [00:20:06].
  7. Traditional [00:22:30]: Not everything needs to be evaluated by Generative AI [00:22:37]. Traditional techniques remain powerful and important for certain aspects of evaluation, such as numeric outputs, database accuracy, retrieval precision (for RAG architectures), and measuring cost and latency [00:23:19].

Evaluation Workflow

A typical evaluation workflow for Generative AI involves:

  1. Gold Standard Set [00:23:33]: This is the most crucial part, requiring significant investment to ensure accuracy [00:23:54]. Using Generative AI to create the gold standard is discouraged, as it can propagate errors; human review is essential even for “silver standard” sets generated by AI [00:24:23].
  2. Generation [00:24:30]: An input from the gold standard set is fed into the prompt template and LLM to generate an output, which includes both the answer and its reasoning [00:24:36].
  3. Judging [00:24:43]: The generated output is compared against the gold standard’s matching answer using a “judge prompt” [00:24:45]. The judge generates a numerical score and its reasoning [00:24:48].
  4. Categorization and Summary [00:24:57]: Categories, often included in the gold standard set, allow for breaking down and summarizing the evaluation results for both correct and incorrect answers by category [00:25:05]. This provides clear insights into performance trends and areas needing improvement [00:17:50].