From: aidotengineer
Evaluations are identified as the most significant challenge and “the missing piece” in scaling generative AI (GenAI) workloads [01:11:00] [01:43:00] [01:47:48]. Despite common concerns like cost, hallucinations, accuracy, or capacity, a lack of evaluations is the number one issue seen across workloads [01:31:00] [01:45:00]. Implementing evaluation frameworks is crucial for unlocking the ability to scale GenAI projects [02:07:00] [02:12:00].
Why Evaluations are Crucial for Scaling AI
A real-world example highlights the transformative power of evaluations: A customer working on a document processing workload for 6-12 months with 6-8 engineers was considering cutting the project due to only 22% accuracy [02:26:00] [02:49:00]. The core problem was that they had “zero evaluations,” with only a single accuracy number at the end of their process [03:06:00] [03:10:00]. After designing and implementing an evaluation framework, which allowed them to pinpoint the exact problems, fixing them became “trivial” [03:16:00] [03:22:00] [03:26:00]. Within six months, the accuracy improved to 92% by January, exceeding their 90% launch threshold, and the workload became the largest document processing workload on AWS in North America at the time [03:36:00] [03:43:00] [03:47:00] [03:52:00].
This demonstrates that evaluations are not merely important for measuring quality but are essential for diagnosing and resolving issues that hinder scalability [04:02:00].
Primary Goal of Evaluation Frameworks
The main objective of any evaluation framework should be to discover problems [04:43:00]. While GenAI evaluations do produce a score, measuring quality is a secondary benefit [04:31:00] [04:58:00]. If the framework can identify where problems are and even suggest solutions (especially with generative AI reasoning incorporated), it enables improvement and unlocks workload scaling [04:48:00] [04:53:00] [04:56:00]. Designing an evaluation framework with the mindset of finding errors leads to a very different and more effective design than merely measuring performance [05:02:00] [05:08:00].
Evaluations as a “Filter” for Project Success
Evaluations serve as the primary filter distinguishing “science projects” from successful, scalable projects [05:42:00] [05:44:00]. Teams unwilling to invest time in creating a “gold standard set” for evaluations (e.g., spending two hours on it) often indicate a lack of commitment to scaling, making their efforts more akin to a science project [05:57:00] [06:06:00] [06:09:00] [06:12:00]. Conversely, successful projects, often achieving 100x ROI or significant cost reductions, are characterized by teams eager to invest substantial time (e.g., four hours) in building a robust evaluation framework [06:22:00] [06:25:00] [06:28:00] [06:32:00] [06:36:00].
Addressing “Baggage” (Free Text Challenges)
A common concern in GenAI is how to evaluate free text outputs, which can seem daunting compared to traditional AI/ML’s specific numerical metrics like F1 score or precision/recall [06:59:00] [07:02:00] [07:08:00] [07:11:00]. However, humanity has been grading and evaluating free text for hundreds, if not thousands, of years (e.g., professors grading essays) [07:14:00] [07:22:00] [07:26:00]. Just as a good professor explains what went wrong and how to improve, GenAI evaluations can go deeper than just a score to provide actionable insights for improvement [08:01:00] [08:14:00] [08:16:00] [08:20:00].
Evaluating Reasoning, Beyond Just Output
A critical aspect of best practices for AI evaluation in GenAI is evaluating how the model arrived at an output, not just the output itself [09:11:00] [09:20:00].
The 2x4 Analogy
If a 1-inch hole is successfully drilled through a 2x4, the output (the hole) looks good [08:42:00] [08:45:00]. However, if the “methodology” used was dangerous or unsustainable (e.g., using a precarious setup), the system is flawed despite a successful individual outcome [09:11:00] [09:16:00] [09:22:00] [09:27:00].
Meteorology Company Use Case
A meteorology company summarizing local weather from sensor data received correct summaries in some cases, but if the model’s reasoning for a “sunny” output was “it’s important to mental health to be happy, so I decided not to talk about the rain” when it was actually raining, the underlying process is fundamentally broken [09:44:00] [09:50:00] [09:56:00] [10:05:00] [10:17:00] [10:20:00]. Knowing the reasoning provides critical insight into how to fix the problem [10:29:00] [10:30:00]. Without evaluating reasoning, even a seemingly correct output might mask a flawed underlying process, creating a “danger” for scaling [10:52:00].
Prompt Decomposition and its Impact on Evaluations
Prompt decomposition involves breaking down large, complex prompts into a series of chained, smaller prompts [11:23:00] [12:59:00] [13:02:00]. This technique, though not exclusive to evaluations, is often used in that context because it allows for attaching an evaluator to each section [11:28:00] [13:05:00] [13:07:00].
This approach offers several benefits:
- Error Localization: It helps pinpoint where errors are occurring within a complex workflow, rather than just knowing “something is going wrong” in the whole prompt [11:56:00] [12:01:00] [12:03:00] [13:11:00].
- Tool Appropriateness: It enables deciding if generative AI is even the right tool for a specific part of the prompt [13:18:00] [13:21:00]. For example, a simple mathematical comparison (like “is seven larger than five?”) is better handled by Python for perfect accuracy than by GenAI [13:28:00] [13:30:00] [13:33:00] [13:35:00].
- Semantic Routing: A common pattern involves semantic routing, where an initial step determines the task type and directs it to the appropriate model (e.g., easy tasks to a small model, hard tasks to a large model) [14:02:00] [14:06:00] [14:11:00] [14:14:00]. Attaching evaluations to each step of this process often significantly increases accuracy by removing “dead space” or “dead tokens” (unnecessary instructions) that can confuse the model and add cost [14:34:00] [14:37:00] [14:47:00] [14:52:00] [14:54:00] [15:10:00] [15:14:00] [15:17:00].
Seven Habits of Highly Effective Generative AI Evaluations
Successfully scaled GenAI workloads almost universally include these seven habits in their evaluation frameworks [15:35:00] [15:37:00] [15:42:00] [15:44:00].
1. Fast
Evaluations must be fast to enable rapid iteration [15:50:00]. A target rule of thumb is 30 seconds for an evaluation framework to run [16:51:00] [16:54:00]. This allows teams to make hundreds of changes and run hundreds of tests daily, significantly accelerating the pace of innovation and accuracy improvement [16:19:00] [16:22:00] [16:25:00] [16:27:00]. This speed is achieved by using generative AI as a judge and processing generation and judging in parallel [16:57:00] [16:58:00] [17:18:00] [17:28:00].
2. Quantifiable
All effective frameworks produce numbers [18:21:00] [18:24:00]. While GenAI scores may have some “jitter” (not always the exact same number), this is mitigated by using numerous test cases and averaging scores, similar to how grades are averaged in school [18:29:00] [18:31:00] [18:51:00] [18:54:00] [18:56:00] [18:58:00] [19:03:00].
3. Explainable
Evaluations should provide insight into how the model reached its output, including its reasoning for both generation and scoring [20:09:00] [20:11:00] [20:13:00] [20:17:00] [20:19:00]. This is akin to a professor using a rubric to explain why a paper received a certain score, providing clear instructions for improvement [20:41:00] [20:44:00] [20:48:00] [20:50:00] [20:56:00] [21:02:00].
4. Segmented
Almost all scaled workloads are multi-step, requiring individual evaluation of each step [21:24:00] [21:26:00] [21:28:00] [21:31:00] [21:35:00] [21:36:00]. This allows for choosing the most appropriate (and often smallest, most cost-effective) model for each specific task within the workflow [21:44:00] [21:47:00] [21:50:00] [21:51:00] [22:01:00] [22:03:00].
5. Diverse
Evaluation frameworks need to be diverse, covering all intended use cases with a sufficient number of test cases [22:10:00] [22:12:00] [22:13:00] [22:15:00] [22:16:00]. Creating 100 test cases is a valuable exercise for defining project scope and ensuring questions cover both in-scope and out-of-scope scenarios (to measure proper redirection) [22:18:00] [22:20:00] [22:21:00] [22:26:00] [22:28:00] [22:30:00] [22:32:00].
6. Traditional
It’s important not to discard traditional evaluation techniques when working with GenAI [22:30:00] [22:32:00] [22:35:00]. If an output is numeric, a simple numeric comparison is appropriate [22:52:00] [22:54:00]. For RAG (Retrieval Augmented Generation) architectures, traditional database accuracy evaluations, retrieval precision, and F1 scores remain very powerful [22:58:00] [23:00:00] [23:02:00] [23:04:00] [23:06:00]. Measuring cost and latency also still relies on traditional tooling [23:11:00] [23:13:00].
Building an Evaluation Framework: A Visual Example
A typical evaluation platform for AI agents involves several steps:
- Gold Standard Set: Start with a meticulously built “gold standard set” of inputs and expected outputs [23:31:00] [23:33:00] [23:36:00]. This is the most crucial investment of time, as the entire system is designed towards it; errors in the gold standard will propagate [23:38:00] [23:41:00] [23:44:00] [23:46:00]. GenAI should generally not be used to create the gold standard directly, as it can introduce the same errors the GenAI system itself has [24:00:00] [24:03:00] [24:08:00] [24:11:00]; however, it can be helpful for generating a “silver standard” that still requires human review for accuracy [24:16:00] [24:18:00] [24:20:00] [24:22:00].
- Generation: An input from the gold standard is fed into a prompt template and then into the LLM to generate an output, which includes both the answer and the reasoning [24:26:00] [24:28:00] [24:30:00] [24:33:00] [24:36:00].
- Judging: The generated output is compared against the matching gold standard answer using a “judge prompt” [24:38:00] [24:41:00] [24:43:00] [24:45:00]. The judge then generates a score (number) and the reasoning behind that score [24:48:00].
- Categorization and Summary: The gold standard set often includes categories, allowing for the final step of breaking down evaluation results by category [24:54:00] [24:57:00] [24:58:00] [25:02:00]. This generates a summary of right and wrong answers for each category, providing actionable insights for improvement [25:09:00] [25:12:00].