From: aidotengineer
Justin Mohler, a principal applied AI architect at AWS, shares insights on scaling generative AI workloads based on his 15 years of experience in natural language processing and four years in generative AI at AWS [00:00:14] [00:00:20] [00:00:23]. His small specialist team assists customers with scaling GenAI workloads across various industries and customer sizes, including some of the largest in North America [00:00:30] [00:00:32] [00:00:35] [00:00:38] [00:00:42]. Through this experience, he has observed common failure points and best practices from successful projects, which he shares in talks [00:00:47] [00:00:50] [00:00:52] [00:01:02].
The Biggest Challenge in Scaling Generative AI
According to Mohler, the biggest challenge in scaling generative AI is evaluations [00:01:11] [00:01:21]. While concerns like cost, hallucinations, accuracy, and capacity are often raised, a lack of evaluations is consistently the number one issue across all workloads [00:01:29] [00:01:33] [00:01:35] [00:01:37] [00:01:43]. Evaluations are the “missing piece” that, when added, unlocks the ability to scale generative AI workloads [00:01:47] [00:02:07] [00:02:14].
Customer Example: Document Processing Workload
In July 2024, Mohler was called in to assist a customer with a document processing workload that had been in development for six to twelve months with six to eight engineers [00:02:20] [00:02:26] [00:02:30] [00:02:34]. The project’s accuracy was only 22%, leading the VP of technology to consider cutting it [00:02:42] [00:02:48] [00:02:52].
Upon discovery, Mohler found that the project had zero evaluations [00:03:01] [00:03:06]. While they had an end-to-end process, they only had a single accuracy number [00:03:10] [00:03:12]. Mohler designed an evaluation framework, which revealed the exact locations of the problems [00:03:16] [00:03:20]. Fixing these issues became trivial once identified [00:03:25] [00:03:29]. Over the next six months, the team built the framework and fixed the issues, achieving 92% accuracy by January, exceeding their 90% production launch threshold [00:03:36] [00:03:43] [00:03:47]. This allowed them to launch and become the single largest document processing workload on AWS in North America at that time [00:03:52] [00:03:57].
What are Generative AI Evaluations?
While evaluations in traditional AI/ML often focus on measuring quality (e.g., F1 score, precision, recall) [00:04:11] [00:04:15] [00:04:22], the main goal with any generative AI evaluation framework should be to discover problems [00:04:40] [00:04:43]. Although generative AI evaluations produce a score, this is a secondary benefit [00:04:31]. By identifying where problems are and potentially suggesting solutions through generative AI reasoning, workloads can be improved [00:04:48] [00:04:51] [00:04:56]. Designing an evaluation framework with an error-finding mindset is crucial [00:05:03] [00:05:08].
Evaluations as a Filter for Project Success
Mohler’s team uses evaluations as the number one filter to distinguish between a “science project” and a “successful project” [00:05:35] [00:05:42] [00:05:44]. Teams willing to invest time in building a gold standard set for evaluations are typically successful, while those who just want “toys to play with” without focusing on evaluation often fail to scale [00:05:57] [00:06:06] [00:06:09] [00:06:12] [00:06:19]. Wildly successful projects, often achieving 100x ROI or significant cost reductions, prioritize evaluation frameworks [00:06:22] [00:06:25] [00:06:28] [00:06:30].
Addressing Generative AI Evaluation Complexities
Evaluating generative AI outputs, often free-form text, can seem daunting, especially for those with a traditional AI/ML background accustomed to precise numeric scores [00:06:54] [00:06:59] [00:07:02] [00:07:09]. However, humans have been grading free text for centuries (e.g., essays) [00:07:14] [00:07:22]. The key is to not just assign a score, but to point out what went wrong and where to improve, similar to a good professor providing feedback with a rubric [00:08:01] [00:08:16] [00:20:48].
The Importance of Reasoning
Evaluating how a generative AI model arrived at an answer is as critical as evaluating the answer itself [00:09:20] [00:09:31].
- 2x4 Example: If a 1-inch hole is drilled through a 2x4, the output appears correct [00:08:42] [00:08:45] [00:08:49]. However, if the methodology was dangerous (e.g., drilling with unsafe tools and posture), even a correct output indicates a flawed system that needs rethinking [00:09:11] [00:09:16] [00:09:25].
- Meteorology Company Example: A meteorology company used generative AI to summarize local weather from sensor data [00:09:42] [00:09:44] [00:09:47].
- If the input indicates rain and 40° but the summary says “sunny and bright,” the score is zero [00:09:56] [00:10:00] [00:10:05] [00:10:12]. However, simply knowing the score doesn’t explain why it failed [00:10:14].
- If the model’s reasoning is, “It’s important to mental health to be happy, so I decided not to talk about the rain,” this insight reveals the problem: the model is prioritizing a positive tone over factual accuracy [00:10:16] [00:10:20] [00:10:22] [00:10:23]. This allows for targeted fixes [00:10:29].
- Conversely, if the input is “sunny” and the response is “sunny” (score 10/10), it appears successful [00:10:40] [00:10:44]. But if the reasoning was equally flawed (e.g., “I decided not to talk about the rain”), this indicates the prompt isn’t working correctly, and the success was coincidental [00:10:54] [00:10:57] [00:11:00] [00:11:03].
Prompt Decomposition
Prompt decomposition involves breaking a large, complex prompt into a series of chained, smaller prompts [00:11:23] [00:11:59] [00:13:02]. While not exclusive to evaluations, it greatly aids them because evaluations can then be attached to each section of the prompt [00:11:28] [00:11:30] [00:13:07]. This allows identification of exactly which part of the prompt is failing, helping to focus improvement efforts [00:11:56] [00:11:59] [00:13:11] [00:13:15].
It also helps determine if generative AI is the right tool for a specific part of the prompt [00:13:18] [00:13:21]. For the weather company, a section of their prompt handled wind speed classification (e.g., wind speed > 5 means windy) [00:12:12] [00:12:14] [00:12:21]. While this worked in proof-of-concept, it sometimes failed at scale (e.g., Claude incorrectly stating “seven is less than five”) [00:12:35] [00:12:38] [00:12:41]. By decomposing the prompt, they replaced the mathematical comparison with a Python step, achieving 100% accuracy for that part [00:12:51] [00:13:28] [00:13:30] [00:13:33] [00:13:38] [00:13:42].
Semantic Routing
Semantic routing is a common pattern in agentic architectures for generative AI [00:14:02]. An incoming query is first routed to the appropriate model based on its task complexity (e.g., easy tasks to small models, hard tasks to large models) [00:14:03] [00:14:08] [00:14:11]. This ensures the “right model for the job” is used [00:14:18].
Attaching evaluations to each step of semantic routing is crucial [00:14:34] [00:14:37]. For a semantic router, the evaluation input might be a query, and the output is a simple number indicating the chosen route [00:14:39] [00:14:41] [00:14:43]. Breaking down prompts this way significantly increases accuracy by removing “dead space” or “dead tokens” (unnecessary instructions for a given task), which reduces cost and confusion for the model [00:14:47] [00:14:50] [00:14:52] [00:14:54] [00:15:10] [00:15:14] [00:15:17].
Seven Habits of Highly Effective Generative AI Evaluations
Mohler identifies seven common trends among successful generative AI workloads that have scaled, all of which include robust evaluations [00:15:35] [00:15:37] [00:15:40] [00:15:42].
- Fast: Evaluations should run quickly, ideally within 30 seconds [00:15:50] [00:15:53] [00:16:54]. This enables hundreds of changes and tests daily, accelerating the pace of innovation and accuracy improvements [00:16:21] [00:16:25] [00:16:27]. A 30-second target allows for:
- 10 seconds for parallel generation across 100 test cases [00:17:15] [00:17:18] [00:17:25].
- 10 seconds for parallel judgment of these results by generative AI (or Python for numeric outputs) [00:17:28] [00:17:31] [00:17:35] [00:17:39].
- 10 seconds to summarize judge outputs by categories, highlighting right and wrong trends [00:17:41] [00:17:44] [00:17:46] [00:17:50]. This focuses on discovering errors and how to fix them [00:18:09] [00:18:12].
- Quantifiable: Effective frameworks produce numbers, even if there’s some jitter [00:18:21] [00:18:24] [00:18:29]. This jitter is managed by using numerous test cases and averaging scores, similar to how grades are averaged in school [00:18:51] [00:18:54] [00:19:00] [00:19:03] [00:19:07].
- Explainable: Evaluations should provide insight into the model’s reasoning during generation and scoring [00:20:09] [00:20:11] [00:20:13] [00:20:17]. Just as users’ prompts need engineering, so do the judge’s prompts, ensuring the judge scores correctly [00:20:27] [00:20:30] [00:20:33]. Clear instructions and a rubric help the judge explain its reasoning [00:20:49] [00:21:09] [00:21:13].
- Segmented: Nearly all scaled workloads involve multiple steps, not a single prompt [00:21:24] [00:21:26] [00:21:28] [00:21:33]. Each step should be evaluated individually to determine the most appropriate model (e.g., Nova Micro for semantic routing) [00:21:36] [00:21:39] [00:21:50] [00:21:53] [00:21:55].
- Diverse: The test cases should cover all in-scope use cases, ideally around 100 cases [00:22:10] [00:22:12] [00:22:18]. This ensures broad coverage and helps teams define the project’s scope, including how to handle out-of-scope queries [00:19:17] [00:19:22] [00:19:32] [00:19:54] [00:20:01].
- Traditional: Do not abandon traditional AI/ML evaluation techniques [00:22:30] [00:22:32]. For numeric outputs, use numeric evaluations [00:22:52] [00:22:54]. For RAG architectures, traditional database accuracy, retrieval, precision, and F1 scores are still very powerful [00:22:58] [00:23:02] [00:23:04]. Measuring cost and latency also relies on traditional tooling [00:23:11] [00:23:13].
- Good Gold Standard Set: The most important investment of time is building a high-quality gold standard set [00:23:33] [00:23:36] [00:23:52]. The entire system is designed around this set, so errors in the gold standard will propagate [00:23:38] [00:23:41]. Generative AI should not be used to create the gold standard set directly, as it can introduce the same errors [00:24:00] [00:24:03] [00:24:07]. It can, however, generate a “silver standard set” which still requires human review for accuracy [00:24:16] [00:24:20] [00:24:22].
Evaluation Framework Flow
An evaluation framework typically involves the following steps:
- Input: Select an input from the gold standard set [00:24:26] [00:24:28].
- Generation: Pass the input through the prompt template and LLM to generate an output, including the answer and its reasoning [00:24:30] [00:24:33] [00:24:36].
- Judgment: Compare the generated output with the matching answer from the gold standard input using a “judge prompt.” The judge then generates a score and the reasoning behind that number [00:24:38] [00:24:41] [00:24:43] [00:24:45] [00:24:48].
- Categorization & Summary: The output is categorized (often pre-defined in the gold standard set) to break down results by category and generate a summary of right and wrong answers for each [00:24:54] [00:24:57] [00:25:02] [00:25:09].