Evaluating generative AI workloads

From: aidotengineer

Justin Mohler, a Principal Applied AI Architect at AWS, presented a talk on the “seven habits of highly effective generative AI evaluations” [00:00:05]. His small specialist team focuses on helping customers scale GenAI workloads [00:00:32], having observed many successful and failed deployments across various industries and customer sizes [00:00:34]. Through this experience, they’ve gathered best practices from successful workloads and identified common failure points [00:00:52].

The Core Challenge in Scaling Generative AI

The single biggest challenge in scaling generative AI [00:01:11] is the lack of proper evaluations [00:01:43]. While concerns like cost, hallucinations, accuracy, and capacity often arise [00:01:33], evaluations are the “missing piece to scaling GenAI” [00:01:47]. Implementing evaluations can unlock the ability to scale solutions [00:02:07].

Customer Example: Document Processing Workload

In July 2024, Mohler was called to an escalated document processing workload that had been in development for 6-12 months with 6-8 engineers [00:02:20]. The project’s accuracy was only 22%, leading to thoughts of cancellation [00:02:49]. After a couple of weeks of discovery, it was evident they had “zero evaluations” beyond a single end-to-end accuracy number [00:03:06].

Designing an evaluation framework made the problems clear, and fixing them became “trivial” [00:03:22]. Within six months, the team built the framework, fixed issues, and achieved 92% accuracy by January [00:03:36]. This success allowed them to launch to production at scale and become the single largest document processing workload on AWS in North America at the time [00:03:48].

Why Evaluations are Crucial for Project Success

Mohler views evaluations as the “number one filter that separates a science project from a successful project” [00:05:42]. Teams willing to invest time in creating a “gold standard set” for evaluations are more likely to have successful projects, whereas those who see it as “boring” often end up with unscalable “science projects” [00:06:14].

Purpose of Evaluation Frameworks

While GenAI evaluations do produce a score, measuring quality is a secondary goal [00:04:31]. The main goal of any evaluation framework should be to discover problems [00:04:43]. A good framework tells you where the problems are and can even suggest solutions, especially if it incorporates generative AI reasoning [00:04:50].

Thinking of evaluations as a tool to find errors leads to a different design mindset than simply measuring performance [00:05:03].

Addressing Generative AI Evaluation Challenges

Generative AI workloads, especially those producing free-text outputs, can seem daunting to evaluate if coming from a traditional AI/ML background focused on exact numeric scores [00:07:09]. However, humans have been grading and evaluating free text for centuries (e.g., essay grading) [00:07:14].

The key distinction with generative AI evaluations is to go deeper than just a score [00:08:12]. Like a good professor, the evaluation should point out what went wrong and where to improve [00:08:16].

Evaluating Reasoning

A unique complexity in generative AI evaluation is that the methodology or reasoning behind the output matters, not just the output itself [00:09:11].

For instance, a meteorology company summarized sensor data. If the model incorrectly states “today it’s sunny and bright” when it’s raining, the output score is zero [00:10:05]. But if the model’s reasoning is revealed as, “it’s important to mental health to be happy, so I decided not to talk about the rain” [00:10:18], this provides crucial insight into fixing the problem [00:10:29]. Conversely, a correct output (“sunny”) could still be problematic if the underlying reasoning is flawed [00:11:09].

Prompt Decomposition

Prompt decomposition is a technique often used in the context of evaluations [00:11:26]. It involves breaking a large, complex prompt into a “chaining series of prompts” [00:13:02]. This allows for attaching an evaluation to each section of the prompt [00:13:07], making it easier to identify where errors are occurring and focus efforts [00:13:11].

This technique also helps determine if generative AI is even the right tool for a specific part of the prompt [00:13:18]. For example, a weather company’s prompt struggled with simple mathematical comparisons (e.g., “is seven less than five?“) [00:12:48]. By decomposing the prompt and replacing the mathematical comparison with a Python step, accuracy increased to 100% [00:13:45].

Semantic Routing

Semantic routing is a common pattern where an input query is first categorized to determine the task type (easy vs. hard), and then routed to an appropriate model (small model for easy, large model for hard) [00:14:15]. Attaching evaluations to each step in this process allows for proving which model is most appropriate for each step, often leading to significantly increased accuracy by removing “dead space” or unnecessary tokens/instructions [00:14:56].

Seven Habits of Highly Effective Generative AI Evaluations

These are the seven most common trends observed across successfully scaled generative AI workloads [00:15:37]. Mohler states he has never seen a workload scale without evaluations, and most include these seven habits [00:15:46].

Fast:
- Evaluation results should be available quickly, ideally within seconds, not days [00:15:58].
- This rapid feedback allows teams to make hundreds of changes and tests daily, significantly accelerating the pace of innovation and accuracy improvements [00:16:30].
- A target rule of thumb is 30 seconds for an evaluation framework run, using generative AI as a judge or Python for numeric outputs [00:17:11]. This 30 seconds can be broken down into:
  - 10 seconds for parallel generation across 100 test cases [00:17:25].
  - 10 seconds for parallel judging of those results against a gold standard [00:17:39].
  - 10 seconds for summarizing the judged output, often broken down by categories, highlighting right and wrong trends [00:18:15].
Quantifiable:
- Effective frameworks always produce numbers [00:18:24].
- While scores might have some “jitter,” this is managed by having enough test cases and averaging the results, similar to how multiple assignments contribute to a student’s final grade [00:19:14].
Numerous / Diverse:
- It is important to be numerous and diverse in test cases to cover all use cases and the full scope of the project [00:19:22].
- Building 100 test cases helps teams define the project’s scope, identifying what queries should be answered and what should be redirected [00:20:06].
- For core use cases, many examples are needed, while edge cases might only require a few [00:22:27].
Explainable:
- Evaluations should provide insight into how the model reached its output, not just the output itself [00:20:13]. This includes the reasoning for both the generation and the scoring by the judge [00:20:20].
- Just as prompt engineering is done for the user-facing prompt, the judge prompt also needs to be engineered [00:20:35]. Asking the judge to explain its reasoning helps in this process [00:21:22].
- The judge should operate with a clear “rubric” of rules and instructions, similar to how a professor grades an essay [00:21:13].
Segmented:
- Almost all scaled workloads are multi-step processes, not single prompts [00:21:35].
- Each step should be evaluated individually [00:21:38]. This allows for determining the most appropriate and smallest model for each specific step, optimizing for cost and efficiency [00:22:06].
Traditional:
- It’s important not to discard traditional tooling and evaluation techniques when working with GenAI [00:22:39].
- For numeric outputs, traditional numeric evaluations are appropriate [00:22:54].
- For RAG (Retrieval Augmented Generation) architectures, database accuracy evaluations, retrieval precision, recall, and F1 scores are still relevant [00:23:08].
- Measuring cost and latency also relies on traditional tooling [00:23:13].

Evaluation Framework Example

A visual representation of an evaluation framework:

Gold Standard Set: The most crucial part to invest time in [00:23:36]. Errors in this set will lead to a system that creates similar errors [00:23:48]. It’s generally not recommended to use GenAI to create the gold standard directly, as it can propagate the system’s own biases or errors [00:24:13]. A “silver standard set” can be generated by GenAI as a guess, but must be human-reviewed for accuracy [00:24:23].
Generation: An input from the gold standard set is fed into a prompt template and an LLM to generate an output, which includes both the answer and the reasoning [00:24:36].
Judging: The generated output and its reasoning are compared with the matching answer from the gold standard input, using a “judge prompt” to generate a score and the reasoning behind that score [00:24:48].
Categorization and Summary: The gold standard set often includes categories [00:24:58]. The final step involves breaking down the results by category and generating a summary for right and wrong answers within each category [00:25:16].

Tubegraph

Explorer

Table of Contents