From: aidotengineer

Effective evaluations are considered the single biggest challenge and the “missing piece” to scaling generative AI (GenAI) workloads [00:01:21]. Many concerns arise when attempting to scale GenAI, such as cost, hallucinations, accuracy, and capacity, but the most common missing element is a robust evaluation framework [00:01:31]. Implementing such frameworks can unlock the ability to scale these applications [00:02:07].

Why Evaluations are Crucial

A real-world example highlights the transformative power of evaluations. A document processing workload, worked on by six to eight engineers for six to twelve months, stalled at 22% accuracy [00:02:26]. The core issue was a complete lack of evaluations beyond a single, end-to-end accuracy number [00:03:06]. By designing an evaluation framework that identified specific problems, the team was able to achieve 92% accuracy within six months, leading to the project’s successful launch as the largest document processing workload on AWS in North America at the time [00:03:36].

The primary goal of any evaluation framework should be to discover problems and, ideally, suggest solutions [00:04:43]. While measuring quality is important, the ability to pinpoint and fix errors is paramount for improvement [00:04:56].

Evaluations as a Project Filter

Within AWS, evaluations serve as the primary filter distinguishing “science projects” from successful ventures [00:05:42]. Teams willing to invest time in building a robust gold standard for evaluations are the ones whose projects achieve significant return on investment and scale [00:06:22].

Unique Complexities of Generative AI Evaluation

Unlike traditional machine learning where outputs are often numerical and easily quantifiable (e.g., F1 score, precision, recall), GenAI outputs are typically free text [00:07:02].

The Importance of Reasoning

A critical aspect of GenAI evaluation is assessing not just the output, but how the model arrived at that output [00:09:20]. This means evaluating the model’s reasoning. For instance, a weather summary GenAI application given sensor data indicating rain and wind might output “today it’s sunny and bright outside.” While the output is clearly wrong (a score of zero), understanding why it produced that output (e.g., “it’s important to mental health to be happy, so I decided not to talk about the rain”) provides crucial insight for correction [00:10:01]. Conversely, a correct output (e.g., input: sunny, output: sunny) might still be based on flawed reasoning, indicating a fragile system prone to future errors [00:10:40].

Prompt Decomposition

One technique to manage the complexity of GenAI evaluations is prompt decomposition [00:11:23]. This involves breaking down a large, complex prompt into a series of chained, smaller prompts [00:13:02].

  • Benefit 1: Granular Evaluation: By attaching an evaluation to each section of the prompt, developers can pinpoint exactly where errors are occurring [00:13:07].
  • Benefit 2: Right Tool for the Job: Decomposition allows for determining whether GenAI is even the appropriate tool for a given segment of the prompt [00:13:18]. For example, a mathematical comparison (“is seven larger than five?”) is better handled by Python than GenAI for perfect accuracy, reducing cost and confusion [00:13:28].
  • Benefit 3: Cost and Accuracy Improvement: By removing “dead space” or unnecessary instructions for a given task (e.g., complex instructions for an easy query), decomposition can significantly increase accuracy and reduce cost [00:14:50].

An example of prompt decomposition in practice is semantic routing [00:14:02]. Here, an initial step uses GenAI to classify an input query (e.g., easy vs. hard task) and then routes it to the appropriate model (small model for easy tasks, large model for hard tasks) [00:14:11]. Evaluations are attached to each step, ensuring the correct routing and model selection [00:14:34].

Seven Habits of Highly Effective Generative AI Evaluations

Successfully scaled GenAI workloads consistently exhibit the following seven habits in their evaluation frameworks [00:15:35]:

  1. Fast [00:15:50]:

    • Iterative improvement is key; evaluations should enable rapid iteration, ideally within seconds [00:16:19].
    • A target of 30 seconds for a full evaluation run is recommended, achieved by parallelizing generation (100 test cases in 10 seconds), judging (100 judges in 10 seconds), and summarizing results [00:16:51].
    • GenAI is often used as a “judge” to evaluate outputs for speed [00:16:58].
  2. Quantifiable [00:18:21]:

    • Effective frameworks produce numerical scores, even if there’s slight “jitter” in results [00:18:24].
    • This jitter is mitigated by having a numerous set of test cases and averaging across them [00:18:54].
  3. Explainable [00:20:09]:

    • Beyond just the output, understand the reasoning behind the generation [00:20:11].
    • Also, ensure the judge’s reasoning is clear, often by engineering a rubric and asking the judge model to explain its scoring [00:20:20]. This ensures the judge prompt is correctly engineered.
  4. Segmented [00:21:24]:

    • Most scaled GenAI workloads involve multiple steps, not a single prompt [00:21:28].
    • Each step should be evaluated individually [00:21:36], allowing for selection of the most appropriate (and often smallest, most cost-effective) model for that specific step [00:21:50].
  5. Diverse [00:22:10]:

    • Evaluation sets must cover all intended use cases, including core cases (many examples) and edge cases (fewer examples) [00:22:10].
    • Include questions that fall outside the desired scope to measure the model’s ability to redirect such queries [00:19:58].
  6. Traditional [00:22:30]:

The Gold Standard Set

The foundation of an effective evaluation framework is the gold standard set [00:23:31]. This set dictates the entire system’s design and purpose, so investing time in building a high-quality, error-free gold standard is crucial [00:23:38].

Avoid GenAI for Gold Standard Creation

Using GenAI to create the gold standard set is generally a poor practice, as it can embed the AI’s own errors into the benchmark, leading to a system that perpetuates those same errors [00:24:00]. While GenAI can generate a “silver standard” (a preliminary guess), it must always be reviewed and confirmed for accuracy by a human [00:24:16].

The Evaluation Process Visualized

An evaluation process typically flows as follows [00:23:28]:

  1. Input from Gold Standard Set: A specific query or scenario is taken from the pre-defined, human-verified gold standard [00:24:26].
  2. Prompt Template & LLM: This input is fed into the GenAI prompt template and processed by the Large Language Model (LLM) [00:24:28].
  3. Generated Output: The LLM produces an output, which should ideally include both the answer and the reasoning behind it [00:24:33].
  4. Judge Prompt: The generated output is compared against the correct answer from the gold standard, often by another GenAI model acting as a “judge” [00:24:38].
  5. Judge Output: The judge generates a numerical score and the reasoning behind that score [00:24:45].
  6. Categorization & Summary: The evaluation results are categorized (often based on categories pre-defined in the gold standard) to provide a summary of correct and incorrect answers, highlighting trends and areas for improvement [00:24:54].