Prompt decomposition in AI evaluations

From: aidotengineer

Prompt decomposition is a technique where a large, complex prompt is broken down into a series of smaller, chained prompts [00:12:59]. While not exclusive to evaluations, this method is frequently applied within their context [00:11:26].

Why Use Prompt Decomposition?

The primary challenge with evaluating large, complex prompts is that an evaluation can typically only be attached to the entire prompt, providing an overall score without pinpointing specific issues [00:11:32], [00:11:51]. This makes it difficult to understand where errors originate within a complex sequence of instructions [00:11:54].

By breaking down a prompt:

Pinpoint Errors: You can attach an evaluation to each section of the prompt, allowing you to identify which specific part is performing well and which is not, thus directing where to focus improvement efforts [00:13:05], [00:13:11].
Optimize Tool Selection: It helps determine if generative AI is even the most appropriate tool for a particular part of the prompt [00:13:18].
Improve Accuracy: By removing “dead space” or “dead tokens” (unnecessary instructions for a given sub-task), evaluations often show a significant increase in accuracy [00:14:54], [00:15:20]. This also reduces cost and opportunities for the model to get confused [00:15:14], [00:15:17].

Practical Example: Weather Summary Workload

Consider a meteorology company that used a single, large prompt to create summaries of local weather based on sensor data [00:12:07], [00:12:12]. This prompt contained various instructions, including a conditional logic to determine windiness (e.g., if wind speed is less than 5, it’s not very windy; if more than 5, it’s windy) [00:12:16], [00:12:21].

While this worked well in a proof-of-concept (POC) stage, during scaling, the generative AI model (Claude) incorrectly processed the mathematical comparison about 2-3% of the time, leading to errors like “the wind speed is seven, seven is less than five, so it’s not windy” [00:12:35].

The solution involved a series of prompt decompositions [00:12:50]. For the mathematical comparison, instead of relying on the generative AI, a Python script was integrated into the chaining steps [00:13:38]. Python is “perfectly accurate” for such comparisons, whereas generative AI is not needed [00:13:30]. This change improved the accuracy of that specific step to 100% [00:13:42].

Segmented Evaluations

Evaluations within a prompt decomposition framework are described as “segmented” [00:21:24]. This means that each individual step in a multi-step workload is evaluated separately [00:21:36]. This is particularly useful because:

Model Selection: It allows for evaluation of which model is most appropriate for each specific step [00:21:47], [00:21:50]. For instance, a smaller, faster model like Nova Micro might be suitable for simpler tasks like semantic routing that only require a numeric output [00:21:53], [00:21:58].
Cost Efficiency: By proving which minimal model is effective for each step, costs can be optimized [00:22:01].

Semantic Routing

Semantic routing is a common pattern that exemplifies prompt decomposition [00:14:02]. In this pattern, an incoming query or input is first routed based on the type of task it represents [00:14:06]. An easy task might be directed to a small model, while a difficult task goes to a larger model [00:14:11], [00:14:14]. Attaching evaluations to each step in this process allows for precise measurement of accuracy and efficiency [00:14:34].

Tubegraph

Explorer

Table of Contents

Prompt decomposition in AI evaluations

Why Use Prompt Decomposition?

Practical Example: Weather Summary Workload

Segmented Evaluations

Semantic Routing

Graph View

Backlinks