Evaluating AI system performance

From: aidotengineer

Companies that adopt a test-driven development approach are able to build reliable and stronger AI systems for production, from simple to more advanced agentic workflows [00:00:18]. Success for an AI product in production is not just about the models, but about how the system is built around them [00:04:15].

The Evolution of AI and the Need for Evaluation

Initially, in 2023, many built AI wrappers, with arguments against their defensibility [00:00:44]. However, companies like Cursor AI, an AI-powered IDE, achieved significant growth, demonstrating the impact of AI adoption [00:00:52]. This was partly due to models improving at coding and coding being an obvious target for disruption [00:01:11]. More importantly, new techniques and patterns were developed to orchestrate models to work better, sync with data, and perform effectively in production [00:01:26].

These techniques are crucial due to clear limits in model performance, such as hallucinations and overfitting [00:01:38]. While model providers improved tooling, significant leaps in model capabilities, similar to the jump from GPT-3.5 to GPT-4, started to slow down [00:01:48]. Models began reaching limits on existing tests, despite more data [00:02:01].

Recently, new training methods like self-reinforced learning, which trains models without labeled data, have pushed the field forward [00:02:32]. Reasoning models now use Chain of Thought thinking at inference time, allowing them to “think” before answering and solve complex reasoning problems [00:02:58]. Model providers are also adding capabilities like tool use, research capabilities, and near-perfect OCR accuracy [00:03:24].

However, traditional benchmarks are saturated, leading to the introduction of new ones to capture the performance of these reasoning models, especially for truly difficult tasks [00:03:41].

Test-Driven Development for Reliable AI Products

The best AI teams follow a structured, test-driven approach: they experiment, evaluate at scale, deploy in production, and continuously monitor, observe, and improve their product [00:05:22]. This process is even more critical with agentic workflows [00:13:34].

Stages of Evaluation

1. Experimentation (Prototyping)

Before building anything production-grade, extensive experimentation is needed to prove if AI models can solve a given use case [00:06:12].

Try different prompting techniques: Explore few-shot or Chain of Thought for simple or complex reasoning tasks [00:06:23].
Test various techniques: Prompt chaining can be effective by splitting instructions into multiple prompts, or adopt agentic workflows like ReAct, which plan, reason, and refine answers [00:06:33].
Involve domain experts: Engineers should not be the sole prompt tweakers; involving domain experts saves engineering time and validates the approach [00:06:53].
Stay model-agnostic: Incorporate and test different models to find which performs best for the specific use case [00:07:13].

2. Evaluation (At Scale)

Once proof of concept is established, the next stage involves evaluating at scale to ensure production readiness for hundreds or millions of requests per minute [00:07:44].

Create a dataset: Build a dataset of hundreds of examples to test models and workflows against [00:07:57].
Balance quality, cost, latency, and privacy: Define priorities early, as trade-offs are inevitable. For example, high quality might sacrifice speed, or cost-critical applications might use lighter, cheaper models [00:08:06].
Use ground truth data: Subject matter experts designing databases and testing models against them is very useful [00:08:32]. Synthetic benchmarks help but may not fully evaluate for specific use cases [00:08:46].
Utilize LLMs for evaluation: If ground truth data is unavailable, an LLM can reliably evaluate another model’s response [00:08:58].
Flexible testing framework: The framework should be dynamic to capture non-deterministic responses, allow custom metric definitions (e.g., using Python or TypeScript), and avoid strict limitations [00:09:14].
Run evaluations at every stage: Implement guard rails to check internal nodes and ensure correct responses at every step of the workflow [00:09:48]. Evaluations should be run during prototyping and revisited with real data after deployment [00:10:03].

3. Deployment (Production)

Once satisfied with evaluation, the product can be deployed [00:10:15].

Monitor beyond deterministic outputs: Log all LLM calls, track inputs, outputs, and latency, as AI models are unpredictable [00:10:35]. This is especially important for agentic workflows which can take different paths and make decisions [00:10:56].
Handle API reliability: Maintain stability in API calls with retries and fallback logic to prevent outages [00:11:09].
Version control and staging: Always deploy in controlled environments before wider public rollout to prevent regressions when updating prompts or other workflow parts [00:11:35]. Decouple AI feature deployments from overall app deployment schedules, as AI features may need more frequent updates [00:12:00].

4. Continuous Improvement

After deployment, capture user responses to create a feedback loop for identifying edge cases and continuously improving the workflow [00:12:26].

Re-run evaluations: Use captured production data to re-evaluate and test new prompts that address emerging cases [00:12:38].
Build a caching layer: For repeat queries, caching can drastically reduce costs and improve latency by storing and serving frequent responses instantly [00:12:47].
Fine-tune custom models: Over time, use production data to fine-tune custom models, which can create better responses for specific use cases, reduce reliance on API calls, and lower costs [00:13:16].

Evaluating AI Agents

When it comes to agentic workflows, evaluation is not just about measuring performance at every step, but also about assessing the agents’ behavior to ensure they make correct decisions and follow intended logic [00:13:53].

A framework for agentic behavior defines different levels based on control, reasoning, and autonomy:

L0: Simple LLM Call + RAG: An LLM call retrieves data (e.g., from a vector database) with inline evaluations. Reasoning is primarily within the prompt and model behavior, with no external agent organizing decisions [00:15:19].
L1: Tool-Using Agent: The AI system uses tools, deciding when to call APIs or retrieve data from a vector database. Memory plays a key role for multi-threaded conversations, and evaluation is needed at every step to ensure correct decisions and accurate responses when using tools [00:15:52].
L2: Structured Reasoning Agent: Workflows move beyond simple tool use to structured reasoning. They notice triggers, plan actions, and execute tasks in a structured sequence, breaking down tasks into multiple steps, retrieving information, calling tools, and refining as needed. The agent’s behavior is more intentional, actively deciding what needs to be done [00:17:12]. This process is finite; the workflow terminates after completing its planned steps [00:18:16].
L3: Proactive & Autonomous Agent: These systems proactively take actions without direct input, continuously monitoring their environment and reacting as needed. They can interact with external services (email, Slack, Google Drive) and plan next moves, either executing actions or requesting human input. They act less as tools and more as independent systems [00:18:33].
L4: Fully Creative/Inventor Agent: The AI moves beyond automation and reasoning to invent, creating its own new workflows, utilities (agents, prompts, function calls, tools), and solving problems in novel ways. This level is currently out of reach due to model constraints like overfitting and inductive bias [00:19:38].

Currently, many production-grade solutions are at the L1 segment, focusing on orchestrating models to interact better with systems and data [00:20:41]. Significant innovation is expected at L2 this year, with AI agents developing to plan and reason for complex tasks [00:21:41]. L3 and L4 are still limited by current models and surrounding logic [00:22:22].

Practical Example: An SEO Agent

An SEO agent demonstrates evaluation in practice by automating keyword research, content analysis, and creation [00:23:04]. This workflow operates between L1 and L2 [00:23:37]:

SEO Analyst/Researcher: Takes a keyword and other parameters (writing style, audience), calls Google Search, and analyzes top-performing articles [00:23:43]. It identifies strengths to amplify and missing segments or areas for improvement [00:23:53]. The researcher then uses these identified missing pieces to perform additional searches and gather more data [00:25:50].
Writer: Uses the research and planning information as context to create a first draft [00:26:02].
Editor (LLM-based Judge): Evaluates the first draft against predefined rules (set in its prompt). This feedback is passed back to the writer through a memory component (chat history) [00:26:19]. This forms a continuous loop until certain criteria are met, ensuring a useful and impressive first draft [00:26:33]. This agent features an embedded evaluator that tells the agent if it’s performing well [00:14:03], which is a form of human review of AI outputs (or LLM as human review substitute).

This iterative process of analysis, writing, and LLM-based evaluation allows for the creation of high-quality content that leverages comprehensive context [00:24:49].

Tubegraph

Explorer

Table of Contents