From: aidotengineer
Vellum has observed that companies successfully deploying reliable AI solutions, from simple to advanced agentic workflows, adopt a test-driven development approach to build robust systems [00:00:06]. This approach is key to building effective agentic workflows [00:00:28].
Evolution of AI Models and Applications
In 2023, many were building “AI wrappers,” leading to arguments about a lack of defensibility strategy [00:00:44]. However, the success of Cursor AI, an AI-powered IDE that hit $100 million ARR in 12 months, demonstrated a shift [00:00:52]. This success was driven by models improving at coding, skyrocketing AI adoption, and coding being an obvious first target for disruption by AI models [00:01:11].
More importantly, new techniques and patterns emerged for orchestrating these models to work better, sync with data, and perform effectively in production [00:01:26].
Limitations of Early AI Models
These techniques became crucial due to clear limits in model performance [00:01:35]:
- Hallucinations [00:01:41]
- Overfitting [00:01:43]
- Need for more structured outputs for developers [00:01:45]
While model providers shipped better tooling, significant “leaps” like that between GPT 3.5 and GPT-4 slowed down [00:01:47]. For years, making models bigger and fitting more data made them smarter, but eventually, improvements started to slow down, and models seemed to reach limits on existing tests [00:02:01].
New Training Methods
Recent advancements in AI model training methods have pushed the field forward [00:02:24]:
- DeepSeek R1 Model: The first model reportedly trained without labeled data, using “real reinforcement learning” where the model learns on its own [00:02:42].
- Chain of Thought (CoT) Thinking: Used by reasoning models (e.g., OpenAI’s 01 and 03) at inference or response time to generate answers [00:03:03]. This allows models to “think” before responding, solving more complex reasoning problems [00:03:12].
Model providers are also enhancing capabilities, including tool use, research capabilities, and near-perfect OCR accuracy (e.g., Gemini 2.0 Flash) [00:03:24].
Benchmarking Challenges
Traditional benchmarks are saturated, leading to the introduction of new benchmarks to capture the performance of new reasoning models [00:03:41]. The “Humanities Last Exam,” for example, measures performance on truly difficult tasks, where even the latest smart models still struggle [00:03:51].
Beyond Model Performance: Building Around AI
For an AI product to work effectively in production, success is not just about the models themselves, but about how you build around them [00:04:10]. This “building around” has evolved in parallel to model training [00:04:20].
Key evolutions in AI techniques and patterns:
- Prompting: Learning to prompt models better and developing advanced techniques like Chain of Thought [00:04:25].
- Retrieval Augmented Generation (RAG): Grounding model responses with proprietary data became important [00:04:31].
- Memory: Crucial for multi-threaded conversations [00:04:42].
- Long Context Models: Enabled new use cases [00:04:47].
- Graph RAG: Experimenting with hierarchy of responses [00:04:52].
- Reasoning Models: Using models that take more time to think in real-time, developing new use cases [00:04:59].
- Agentic RAG: Making workflows even more powerful [00:05:12].
Even with these techniques, success requires understanding the problem deeply and adopting a test-driven development approach to find the right mix of techniques, models, and logic for a specific use case [00:05:22].
Test-Driven Development for Reliable AI Products
The most effective AI teams follow a structured approach: experiment, evaluate, scale, deploy, and then continuously monitor, observe, and improve their product [00:05:46].
1. Experimentation
Before building anything production-grade, extensive experimentation is needed to prove whether AI models can solve the use case [00:06:12].
- Try different prompting techniques: Few-shot or Chain of Thought, suitable for simple vs. complex reasoning tasks [00:06:23].
- Test various techniques:
- Prompt chaining: Splitting instructions into multiple prompts [00:06:33].
- Agentic workflows (e.g., ReAct): Involve planning, reasoning, and refining before generating an answer [00:06:41].
- Involve domain experts: Engineers shouldn’t be the sole prompt tweakers; involving experts saves engineering time [00:06:53].
- Stay model agnostic: Incorporate and test different models to determine which are best for a specific use case (e.g., Gemini 2.0 Flash for OCR) [00:07:13].
2. Evaluation
Once models prove effective in experimentation, evaluation is critical to ensure performance under high request loads in production [00:07:44].
- Create a dataset: Develop a dataset of hundreds of examples to test models and workflows against [00:07:57].
- Balance quality, cost, latency, and privacy: No AI system achieves perfection in all areas, so defining priorities early is crucial (e.g., sacrificing speed for quality, or using a cheaper model if cost is critical) [00:08:03].
- Use ground truth data: Subject matter experts should design databases and test models against them. Synthetic benchmarks are less effective for specific use cases [00:08:32].
- LLM-based evaluation: If ground truth data is unavailable, an LLM can evaluate another model’s response; this is a standard and reliable method [00:08:57].
- Flexible testing framework: The framework should be dynamic to capture non-deterministic responses, allow custom metrics (Python, Typescript), and avoid strict structures [00:09:14].
- Run evaluations at every stage: Implement guardrails to check internal nodes and ensure correct responses at each step. Test during prototyping and re-evaluate with real data [00:09:48].
3. Deployment
Once satisfied with the product, deploy it to production [00:10:23].
- Monitor more than deterministic outputs: Log all LLM calls, track inputs, outputs, and latency [00:10:35].
- AI models are unpredictable; debugging and understanding behavior at every step is essential, especially with complex agentic workflows that can take different paths and make their own decisions [00:10:46].
- Handle API reliability: Maintain stability with retries and fallback logic to prevent outages (e.g., using a different model if OpenAI goes down) [00:11:09].
- Version control and staging: Always deploy in controlled environments before wider public release. Ensure new updates don’t introduce regressions or break existing workflows [00:11:35].
- Decouple deployments: Separate AI feature updates from scheduled app deployments, as AI features often need more frequent updates [00:12:00].
4. Continuous Improvement
After deployment, capture user responses to identify edge cases and continuously improve the workflow [00:12:21].
- Feedback loops: Capture responses, run evaluations again, and develop new prompts to solve new cases [00:12:26].
- Caching layer: For repeat queries, caching drastically reduces costs and improves latency by storing and serving frequent responses instantly instead of calling expensive LLMs [00:12:47].
- Fine-tune custom models: Use collected production data to fine-tune custom models for specific use cases, reducing reliance on API calls and lowering costs [00:13:16].
Agentic Workflows
The test-driven approach is even more critical for agentic workflows due to their complexity, tool use, multiple APIs, and parallel execution [00:13:34]. Evaluation must not only measure performance at every step but also assess agent behavior to ensure correct decisions and adherence to intended logic [00:13:53].
Every AI workflow has some level of agentic behavior, varying in control, reasoning, and autonomy [00:14:36]. A framework defines five levels of agentic behavior [00:14:50]:
Levels of Agentic Behavior
-
L0: Basic LLM Call [00:15:19]
- Involves an LLM call, data retrieval from a vector database, and inline evaluations [00:15:21].
- No external reasoning, planning, or decision-making beyond what’s in the prompt or model behavior [00:15:30]. The model itself does the reasoning within the prompt [00:15:38].
-
L1: Tool Use [00:15:52]
- The AI system uses tools (APIs) and decides when to call them and when to make actions [00:15:56].
- The model decides whether to call a tool or a vector database for data retrieval [00:16:13].
- Memory plays a key role for multi-threaded conversations, capturing context [00:16:23].
- Evaluation is needed at every step to ensure correct decisions, tool usage, and accurate responses [00:16:35].
- Workflows can be simple or complex with branching and many tools [00:16:50].
- Many production-grade solutions currently fall within the L1 segment (e.g., Redfin, Headspace) [00:20:41]. Focus is on orchestration: how models interact with systems and data [00:21:01].
-
L2: Structured Reasoning [00:17:12]
- Workflows move beyond simple tool use to structured reasoning [00:17:26].
- The workflow notices triggers, plans actions, and executes tasks in a structured sequence [00:17:28].
- Can break down tasks, retrieve info, call tools, evaluate usefulness, refine, and continuously loop to generate output [00:17:37].
- Agentic behavior becomes more intentional; the system actively decides what needs to be done and spends more time thinking [00:17:54].
- The process is still finite, terminating after completing planned steps [00:18:16].
- Most innovation this year is expected to happen at L2, with AI agents planning and reasoning using models like 01, 03, or DeepSeek [00:21:41].
-
L3: Proactive Autonomy [00:18:33]
- The system proactively takes actions without waiting for direct input [00:18:45].
- Instead of terminating after a single request, it stays alive, continuously monitors its environment (e.g., email, Slack, Google Drive), and reacts as needed [00:18:52].
- Can plan next moves, execute actions in real-time, or ask humans for input [00:19:11].
- AI workflows become less of a tool and more of an independent system, truly easing work [00:19:19].
-
L4: Fully Creative/Inventor [00:19:38]
- Goes beyond automation and reasoning to become an inventor [00:19:44].
- Creates its own new workflows, utilities (agents, prompts, function calls, tools) [00:19:56].
- Solves problems in novel ways [00:20:06].
- Currently out of reach due to model constraints like overfitting (models loving their training data) and inductive bias (models making assumptions based on training data) [00:20:09]. The goal is AI that invents, improves, and solves problems in unforeseen ways [00:20:30].
L3 and L4 levels are still limited by current models and surrounding logic [00:22:22].
Practical Example: SEO Agent
An SEO agent automates the entire SEO process, from keyword research to content analysis and creation [00:23:04]. It decides when to use tools and has an embedded evaluator [00:23:12]. This workflow lies between L1 and L2 agentic workflows [00:23:37].
Workflow Steps:
- SEO Analyst & Researcher:
- Takes a keyword (e.g., “Chain of Thought prompting”) and parameters (writing style, audience) [00:24:43].
- Calls Google search to analyze top-performing articles for the keyword [00:23:45].
- Identifies strong components to amplify in the new article and missing segments/areas for improvement [00:23:52].
- The researcher utilizes missing opportunities to perform further searches and gather more data, aiming for a superior article [00:25:38].
- Writer:
- Receives all research and planning information as context [00:24:13].
- Creates a comprehensive first draft, using context from analyzed articles [00:26:02].
- Can connect to a RAG system to look into a database of articles and learnings [00:26:27].
- Editor:
- An LLM-based judge that evaluates the first draft against predefined rules [00:24:19].
- Provides feedback to the writer [00:26:36].
- The feedback is passed back to the writer via a memory component (chat history) [00:24:31].
- This feedback loop continues until specific criteria are met (e.g., editor deems the post “excellent,” or a set number of loops are completed) [00:24:33].
- Final Article: The process yields a useful, contextually rich article, not a generic, unhelpful piece [00:24:49].
This workflow saves significant time by providing a strong foundation for content creation [00:27:47]. Products like Vellum Workflows are designed to bridge product and engineering teams, speeding up AI development while adhering to a test-driven approach [00:27:50]. The Vellum workflow SDK offers building blocks, customizability, and self-documenting syntax, keeping UI and code in sync for alignment across teams [00:28:04].