Test driven development for AI

From: aidotengineer

In the realm of AI solutions, from simple applications to advanced agentic workflows, a key differentiator for companies successfully deploying reliable AI in production has been the adoption of a test-driven development (TDD) approach [00:00:10] [00:00:18]. This methodology enables the creation of stronger and more reliable systems for production [00:00:22].

Evolution of AI Development

Around 2023, many AI wrappers were being built, and questions arose about their defensibility strategy [00:00:44] [00:00:50]. However, the field progressed significantly, as exemplified by Cursor AI, an AI-powered IDE that achieved 100 million ARR in just 12 months [00:00:54] [01:00:59]. This rapid growth was attributed to several factors:

Models improved at coding [00:01:11].
AI adoption skyrocketed [00:01:14].
Coding was an obvious target for AI disruption [00:01:17].
Crucially, new techniques and patterns emerged for orchestrating models to work better with data and effectively in production [00:01:26].

These techniques became vital due to inherent limitations in model performance, such as hallucinations and overfitting [00:01:38]. While model providers have shipped better tooling, significant leaps in model performance, like the jump from GPT-3.5 to GPT-4, have slowed down [00:01:47]. For years, making models bigger and feeding them more data made them smarter, but this approach hit a wall, with improvements slowing down and models reaching limits on existing tasks [00:02:01].

However, new training methods have emerged, pushing the field forward [00:02:38]. For example, the DeepSeek-R1 model was trained using real reinforcement learning, meaning it learned autonomously without labeled data [00:02:42]. This method is reportedly used by OpenAI for their reasoning models (e.g., O1, O3), which employ Chain of Thought thinking at inference time to enable complex reasoning [00:02:57]. Model providers are also adding more capabilities, such as tool usage, research capabilities, and near-perfect OCR accuracy (e.g., Gemini 2.0 Flash) [00:03:24].

As traditional benchmarks become saturated, new ones are being introduced to capture the performance of these reasoning models, like the Humanities Last Exam, which measures performance on difficult tasks [00:03:41].

Beyond just model improvements, success in production AI products is increasingly about how systems are built around the models [00:04:10]. This has led to the evolution of several techniques:

Prompting: Advanced techniques like Chain of Thought for better model interaction [00:04:25].
Retrieval Augmented Generation (RAG): Grounding model responses with proprietary data [00:04:31].
Memory: Essential for multi-threaded conversations and long context from models [00:04:42].
Graph RAG: Experimentation with hierarchy of responses [00:04:54].
Agentic RAG: Combining reasoning models with RAG for more powerful workflows [00:05:12].

However, even with these techniques, deep understanding of the problem and a test-driven development approach are necessary to find the right mix of techniques, models, and logic for a specific use case [00:05:22].

The Test-Driven Development Approach in AI

The most effective AI teams follow a structured TDD approach [00:05:48]:

Experiment: Explore initial concepts and proofs of concept [00:05:52].
Evaluate: Rigorously test and refine the system [00:05:54].
Scale: Prepare for handling larger loads [00:05:54].
Deploy: Roll out the solution to production [00:05:57].
Monitor, Observe, and Improve: Continuously capture responses and feedback from production to refine the product [00:06:01].

Stages of TDD in AI

1. Experimentation

Before building anything production-grade, extensive experimentation is crucial to prove that AI models can solve the specific use case [00:06:14].

Try different prompting techniques: Explore few-shot, Chain of Thought, or prompt chaining, splitting instructions into multiple prompts for better results [00:06:23].
Adopt agentic workflows: Techniques like React, which involve planning, reasoning, and refining before generating an answer [00:06:41].
Involve domain experts: Engineers should not be the sole prompt tweakers; domain experts can save significant engineering time by validating proofs of concept [00:06:53].
Stay model agnostic: Incorporate and test different models to identify which ones perform best for specific use cases (e.g., Gemini 2.0 Flash for OCR) [00:07:13].

2. Evaluation

Once a proof of concept is established, the evaluation stage involves creating a dataset of hundreds of examples to test models and workflows against [00:07:55].

Balance quality, cost, latency, and privacy: Define priorities early, as no AI system can perfectly optimize all these factors [00:08:06]. Trade-offs are necessary; for example, high quality might sacrifice speed, or critical cost might necessitate lighter models [00:08:16].
Use ground truth data: Subject matter experts designing datasets and testing workflows against them is highly useful for accurate evaluation [00:08:32]. Synthetic benchmarks can help but won’t fully evaluate for specific use cases [00:08:46].
Utilize LLMs for evaluation: Even without ground truth data, an LLM can reliably evaluate another model’s response [00:08:58].
Employ a flexible testing framework: Whether in-house or external, the framework must be dynamic to capture non-deterministic responses, allow custom metrics (e.g., Python, Typescript), and avoid strict structures, emphasizing customizability [00:09:14].
Run evaluations at every stage: Implement guardrails to check internal nodes and ensure models produce correct responses at every step of the workflow [00:09:48]. Test during prototyping and leverage the evaluation phase with real data [00:10:03].

3. Deployment

Once workflows are extensively evaluated and satisfactory, they are ready for production deployment [00:10:15].

Monitor non-deterministic outputs: Log all LLM calls, track inputs, outputs, and latency [00:10:35]. AI models are unpredictable, so debugging issues and understanding behavior at every step is critical, especially for complex agentic workflows that can take different paths and make autonomous decisions [00:10:46].
Handle API reliability: Maintain stability in API calls with retries and fallback logic to prevent outages (e.g., if a primary model provider experiences downtime) [00:11:09].
Version control and staging: Always deploy in controlled environments before wider public rollouts [00:11:35]. This ensures prompt updates don’t introduce regressions or break existing production workflows [00:11:47].
Decouple deployments: AI feature updates will likely be more frequent than overall app deployments, so independent deployment schedules are advisable [00:12:00].

4. Continuous Improvement

After deployment, the focus shifts to ongoing enhancement.

Create feedback loops: Capture user responses from production to identify edge cases for continuous improvement [00:12:26]. Re-run evaluations with this new data to test new prompts or solutions [00:12:38].
Build a caching layer: For systems handling repeat queries, caching significantly reduces costs and improves latency by storing and serving frequent responses instantly instead of repeatedly calling expensive LLMs [00:12:47].
Fine-tune custom models: Once sufficient reliable production data is collected, it can be used to fine-tune custom models for better responses specific to the use case, potentially reducing reliance on API calls and lowering costs [00:13:16].

Test-Driven Development for Agentic Workflows

The TDD process becomes even more critical for agentic workflows [00:13:35] because these workflows are characterized by:

Wide range of tool usage and API calls [00:13:41].
Multi-agent structures executing tasks in parallel [00:13:48].
Complex decision-making and autonomous paths [00:11:06].

For agentic workflows, evaluation is not just about measuring performance at every step, but also assessing the behavior of agents to ensure they make the right decisions and follow intended logic [00:13:53].

Levels of Agentic Behavior

A framework defines different levels of agentic behavior based on their control, reasoning, and autonomy [00:14:30]:

L0 (Basic LLM Call): An LLM call retrieves data from a vector database with inline evaluations, producing a response [00:15:19]. Reasoning is baked into the prompt and model behavior; no external agent organizes decisions or plans [00:15:32].
L1 (Tool Usage): The system can now use tools, deciding when to call APIs or retrieve data from a vector database [00:15:55]. Memory plays a key role for multi-threaded conversations, capturing context throughout the workflow [00:16:24]. Evaluation is needed at every step to ensure correct decisions and accurate responses [00:16:37]. This can range from simple to very complex workflows with multiple tools and branching logic [00:16:50].
L2 (Structured Reasoning): Workflows move beyond simple tool use to structured reasoning [00:17:14]. The system notices triggers, plans actions, and executes tasks in a structured sequence [00:17:28]. It can break down tasks, retrieve information, call tools, evaluate their usefulness, and refine as needed in a continuous loop to generate a final output [00:17:37]. Agentic behavior is more intentional, actively deciding what needs to be done and spending more time thinking [00:17:54]. The process is still finite, terminating once steps are completed [00:18:16].
L3 (Autonomy): Systems proactively take actions without direct input [00:18:32]. Instead of terminating after a single request, they continuously monitor their environment and react as needed [00:18:50]. They can access external services (email, Slack, Google Drive) to plan next moves or ask for human input [00:19:02]. These become independent systems that ease work [00:19:22].
L4 (Creative/Inventor): The AI moves beyond automation and reasoning to become an inventor [00:19:39]. It can create its own new workflows, utilities (agents, prompts, function calls, tools), and solve problems in novel ways [00:19:50]. True L4 is currently out of reach due to model constraints like overfitting and inductive bias, but it represents the goal of AI that invents, improves, and solves problems in unforeseen ways [00:20:08].

Currently, many production-grade AI solutions fall within the L1 segment, focusing on orchestrating models to interact better with systems and data [00:20:41]. L2 is expected to see the most innovation in the near future, with AI agents being developed to plan and reason for complex tasks [00:21:41]. L3 and L4 are still limited by current models and surrounding logic, though innovation continues in these areas [00:22:22].

Example: SEO Agent Workflow

An example of an effective agentic workflow is an SEO agent that automates the SEO process from keyword research to content creation [00:23:04]. This agent, which sits between L1 and L2 type agentic workflows [00:23:37], includes:

SEO Analyst and Researcher: Takes a keyword and calls Google Search to analyze top-performing articles [00:23:43]. It identifies strong components to amplify and missing segments for improvement [00:23:53]. The researcher then conducts further searches to gather more data [00:25:36].
Writer: Uses the gathered information as context to create a first draft [00:26:02]. The content is designed to be useful and contextually relevant, not just generic [00:26:13].
Editor: An LLM-based judge evaluates the first draft against predefined rules [00:24:19]. Feedback is passed back to the writer in a continuous loop until specific criteria are met [00:24:31].
Memory Component: Captures all previous conversations between the writer and editor within the loop [00:24:40].

This process ensures a useful and impressive first draft that leverages context intelligently [00:24:50].

Bellum Workflows SDK

Bellum Workflows was designed to bridge the gap between product and engineering teams, accelerating AI development while adhering to a test-driven approach [00:27:50]. Recognizing developers’ desire for more control and flexibility, the workflow SDK provides building blocks that are infinitely customizable, with a self-documenting syntax that reveals agent behavior directly in the code [00:28:04]. Its expressiveness ensures understanding at every stage, and the UI and code remain synchronized for team alignment during definition, debugging, or improvement [00:28:28]. The SDK is open source and free [00:28:44].

Tubegraph

Explorer

Table of Contents