Test driven development for AI agents

From: aidotengineer

Companies that have successfully deployed reliable AI Solutions in production have adopted a test-driven development approach [00:19:41]. This method enables the creation of stronger, more reliable systems [00:22:05].

Evolution of AI and the Need for TDD

While models have improved significantly, particularly in coding [01:11:00], and AI adoption has skyrocketed [01:14:00], there are clear limits to model performance, including hallucinations [01:41:00] and overfitting [01:43:00]. Major advancements in model capabilities, like the jump between GPT-3.5 and GPT-4, have slowed down [01:57:00]. Models began to hit limits on existing tasks despite more data [02:14:00].

New training methods, such as real reinforcement learning (e.g., DeepSeek R1) [02:42:00], and the use of Chain of Thought thinking in reasoning models (like 01 and 03) [03:03:00], have pushed the field forward, allowing models to solve more complex reasoning problems [03:20:00]. Models are also gaining more capabilities like tool use and improved OCR accuracy [03:24:00].

However, success for an AI product in production relies less on just the models and more on how they are built around [04:15:00]. Techniques such as prompt engineering (e.g., Chain of Thought) [04:25:00], Retrieval Augmented Generation (RAG) [04:36:00], managing memory for multi-threaded conversations [04:42:00], graph RAG [04:56:00], and agentic RAG [05:12:00] are crucial. These techniques, while evolving, are not enough on their own [05:22:00]. A deep understanding of the problem and a test-driven development approach are essential to finding the right mix of techniques, models, and logic [05:27:00].

Test-Driven Development for AI Products

The best AI teams follow a structured approach involving experimentation, evaluation, deployment, and continuous monitoring [05:48:00].

1. Experimentation

Before building anything production-grade, extensive experimentation is needed to prove if AI models can solve the use case [06:14:00].

Try different prompting techniques [06:23:00]:
- Few-shot prompting.
- Chain of Thought for more complex reasoning [06:27:00].
Test various techniques [06:33:00]:
- Prompt chaining, splitting instructions into multiple prompts [06:38:00].
- Agentic workflows like React, which involve planning, reasoning, and refining [06:41:00].
Involve domain experts [06:53:00]: Engineers should not be the sole prompt tweakers [06:57:00]; their involvement saves engineering time [07:01:00].
Stay model agnostic [07:13:00]: Incorporate and test different models based on the use case, e.g., Gemini 2.0 Flash for OCR [07:17:00].

2. Evaluation

Once initial proof of concept is established, evaluation is crucial for production readiness, especially when dealing with high request volumes [07:44:00].

Create a data set of hundreds of examples for testing workflows [07:57:00].
Balance quality, cost, latency, and privacy [08:06:00]. Define priorities early, as no AI system perfects all aspects [08:10:00].
Use ground truth data [08:32:00]: Subject matter experts designing databases and testing models against them is very useful [08:38:00].
Utilize LLMs for evaluation [09:00:00]: Even without ground truth, an LLM can reliably evaluate another model’s response [09:04:00].
Use a flexible testing framework [09:14:00]: AI systems are dynamic, so workflows need to be dynamic, capable of capturing non-deterministic responses, defining custom metrics, and writing metrics in Python or TypeScript [09:23:00].
Run evaluations at every stage [09:48:00]: Implement guardrails to check internal nodes and ensure correct responses at each step [09:50:00]. Evaluate during prototyping and use real data later [10:05:00].

3. Deployment

After extensive evaluation, deploying to production requires specific considerations [10:25:00].

Monitor beyond deterministic outputs [10:35:00]: Log all LLM calls, track inputs, outputs, and latency to understand and debug unpredictable AI behavior [10:39:00]. This is especially critical for agentic workflows due to their complexity and decision-making paths [10:56:00].
Handle API reliability [11:09:00]: Maintain stability with retries and fallback logic to prevent outages [11:12:00]. For example, during an OpenAI outage, a fallback model could be used [11:19:00].
Use Version Control and staging [11:35:00]: Deploy in controlled environments first to avoid regressions when updating prompts or workflows [11:37:00].
Decouple deployments from scheduled app deployments [12:00:00], as AI features often need more frequent updates [12:11:00].

4. Continuous Improvement

Deployment is not the end; continuous monitoring and feedback loops are vital for improvement [12:26:00].

Capture user responses to identify edge cases in production [00:26:00], then run evaluations again with new prompts to address them [12:38:00].
Build a caching layer [12:46:00]: For repeat queries, caching drastically reduces costs and improves latency by storing and serving frequent responses instantly [12:50:00].
Fine-tune custom models [13:16:00]: Use accumulated production data to fine-tune models for specific use cases, reducing reliance on API calls and lowering costs [13:19:00].

Levels of Agentic Behavior

Every AI workflow has some level of agentic behavior, differing in control, reasoning, and autonomy [14:40:00]. A framework defines different levels:

L0: Basic LLM Call [15:19:00]: An LLM call retrieves data, with inline evaluations, and produces a response. Reasoning and decision-making are entirely within the model’s prompt and behavior, with no external agent organizing actions [15:30:00].
L1: Tool Use [15:52:00]: The AI system can now use various tools and decide when to call them [15:59:00]. This introduces more agentic behavior as the model chooses specific tools or retrieves more data [16:10:00]. Memory becomes crucial for multi-threaded conversations, and evaluation is needed at every step to ensure correct decisions and accurate responses [16:25:00].
- Many production-grade solutions currently fall within the L1 segment [20:41:00]. The focus here is on orchestrating models to interact with systems and data effectively [21:01:00].
L2: Structured Reasoning [17:12:00]: Workflows move beyond simple tool use to structured reasoning [17:26:00]. The system notices triggers, plans actions, and executes tasks in a structured sequence [17:28:00]. It can break down tasks, retrieve information, call other tools, and refine its output in a continuous loop [17:37:00]. Agentic behavior becomes more intentional, as the system actively decides what needs to be done and spends more time thinking [17:55:00]. The process is still finite, terminating after completing its planned steps [18:16:00].
- This year is expected to see significant innovation in L2, with AI agents being developed for planning and reasoning using models like 01, O3, or DeepSeek [21:41:00].
L3: Autonomy [18:33:00]: The system proactively takes actions without waiting for direct input [18:45:00]. Instead of terminating after a single request, it continuously monitors its environment and reacts as needed [18:52:00]. It can access external services like email, Slack, or Google Drive, plan next moves, and execute actions or ask for human input [19:02:00]. AI workflows become independent systems rather than mere tools [19:19:00].
L4: Fully Creative [19:38:00]: The AI moves beyond automation and reasoning to become an inventor [19:41:00]. It can create its own new workflows, utilities (agents, prompts, function calls, tools), and solve problems in novel ways [19:56:00]. True L4 is currently out of reach due to constraints like overfitting and inductive bias in models [20:08:00].

Case Study: SEO AI Agent

An example of an AI agent that automates the entire SEO process (keyword research, content analysis, content creation) demonstrates a workflow between L1 and L2 levels of agentic behavior [23:04:00].

Workflow Overview

The SEO agent workflow involves multiple interconnected components:

SEO Analyst and Researcher: Takes a keyword, calls Google Search, and analyzes top-performing articles [23:43:00]. It identifies good components to amplify and missing segments for improvement [23:52:00]. The researcher then uses these identified gaps to make further searches and gather more data [25:53:00].
Writer: Takes the information from the research phase and creates a first draft of the article [24:13:00]. The content is generated using the context from analyzed articles [26:21:00].
Editor (LLM-based Judge): An embedded evaluator that assesses the first draft based on predefined rules [24:21:00].
Feedback Loop and Memory: The editor’s feedback is passed back to the writer via a memory component (chat history), creating a continuous loop until certain criteria are met (e.g., article quality, number of iterations) [24:31:00].

This workflow highlights the importance of evaluating agent behavior to ensure it makes the right decisions and follows intended logic [14:03:00]. The agent can decide whether to use tools and has an embedded evaluator [23:12:00].

Tools and Frameworks

Bellum offers workflows and an SDK designed to bridge product and engineering teams, speeding up AI development while adhering to the test-driven approach [27:51:00]. The workflow SDK provides building blocks, infinite customizability, and a self-documenting syntax, allowing developers to understand agent behavior directly from the code [28:14:00]. The UI and code remain synchronized for alignment across teams during definition, debugging, and improvement [28:34:00]. The SDK is open-source and free [28:43:00].

Tubegraph

Explorer

Table of Contents