From: aidotengineer

Vellum has observed that companies successfully deploying reliable AI solutions, from simple to advanced agentic workflows, adopt a test-driven development approach to build robust systems [00:00:06]. This approach is key to building effective agentic workflows [00:00:28].

Evolution of AI Models and Applications

In 2023, many were building “AI wrappers,” leading to arguments about a lack of defensibility strategy [00:00:44]. However, the success of Cursor AI, an AI-powered IDE that hit $100 million ARR in 12 months, demonstrated a shift [00:00:52]. This success was driven by models improving at coding, skyrocketing AI adoption, and coding being an obvious first target for disruption by AI models [00:01:11].

More importantly, new techniques and patterns emerged for orchestrating these models to work better, sync with data, and perform effectively in production [00:01:26].

Limitations of Early AI Models

These techniques became crucial due to clear limits in model performance [00:01:35]:

While model providers shipped better tooling, significant “leaps” like that between GPT 3.5 and GPT-4 slowed down [00:01:47]. For years, making models bigger and fitting more data made them smarter, but eventually, improvements started to slow down, and models seemed to reach limits on existing tests [00:02:01].

New Training Methods

Recent advancements in AI model training methods have pushed the field forward [00:02:24]:

  • DeepSeek R1 Model: The first model reportedly trained without labeled data, using “real reinforcement learning” where the model learns on its own [00:02:42].
  • Chain of Thought (CoT) Thinking: Used by reasoning models (e.g., OpenAI’s 01 and 03) at inference or response time to generate answers [00:03:03]. This allows models to “think” before responding, solving more complex reasoning problems [00:03:12].

Model providers are also enhancing capabilities, including tool use, research capabilities, and near-perfect OCR accuracy (e.g., Gemini 2.0 Flash) [00:03:24].

Benchmarking Challenges

Traditional benchmarks are saturated, leading to the introduction of new benchmarks to capture the performance of new reasoning models [00:03:41]. The “Humanities Last Exam,” for example, measures performance on truly difficult tasks, where even the latest smart models still struggle [00:03:51].

Beyond Model Performance: Building Around AI

For an AI product to work effectively in production, success is not just about the models themselves, but about how you build around them [00:04:10]. This “building around” has evolved in parallel to model training [00:04:20].

Key evolutions in AI techniques and patterns:

  • Prompting: Learning to prompt models better and developing advanced techniques like Chain of Thought [00:04:25].
  • Retrieval Augmented Generation (RAG): Grounding model responses with proprietary data became important [00:04:31].
  • Memory: Crucial for multi-threaded conversations [00:04:42].
  • Long Context Models: Enabled new use cases [00:04:47].
  • Graph RAG: Experimenting with hierarchy of responses [00:04:52].
  • Reasoning Models: Using models that take more time to think in real-time, developing new use cases [00:04:59].
  • Agentic RAG: Making workflows even more powerful [00:05:12].

Even with these techniques, success requires understanding the problem deeply and adopting a test-driven development approach to find the right mix of techniques, models, and logic for a specific use case [00:05:22].

Test-Driven Development for Reliable AI Products

The most effective AI teams follow a structured approach: experiment, evaluate, scale, deploy, and then continuously monitor, observe, and improve their product [00:05:46].

1. Experimentation

Before building anything production-grade, extensive experimentation is needed to prove whether AI models can solve the use case [00:06:12].

  • Try different prompting techniques: Few-shot or Chain of Thought, suitable for simple vs. complex reasoning tasks [00:06:23].
  • Test various techniques:
    • Prompt chaining: Splitting instructions into multiple prompts [00:06:33].
    • Agentic workflows (e.g., ReAct): Involve planning, reasoning, and refining before generating an answer [00:06:41].
  • Involve domain experts: Engineers shouldn’t be the sole prompt tweakers; involving experts saves engineering time [00:06:53].
  • Stay model agnostic: Incorporate and test different models to determine which are best for a specific use case (e.g., Gemini 2.0 Flash for OCR) [00:07:13].

2. Evaluation

Once models prove effective in experimentation, evaluation is critical to ensure performance under high request loads in production [00:07:44].

  • Create a dataset: Develop a dataset of hundreds of examples to test models and workflows against [00:07:57].
  • Balance quality, cost, latency, and privacy: No AI system achieves perfection in all areas, so defining priorities early is crucial (e.g., sacrificing speed for quality, or using a cheaper model if cost is critical) [00:08:03].
  • Use ground truth data: Subject matter experts should design databases and test models against them. Synthetic benchmarks are less effective for specific use cases [00:08:32].
  • LLM-based evaluation: If ground truth data is unavailable, an LLM can evaluate another model’s response; this is a standard and reliable method [00:08:57].
  • Flexible testing framework: The framework should be dynamic to capture non-deterministic responses, allow custom metrics (Python, Typescript), and avoid strict structures [00:09:14].
  • Run evaluations at every stage: Implement guardrails to check internal nodes and ensure correct responses at each step. Test during prototyping and re-evaluate with real data [00:09:48].

3. Deployment

Once satisfied with the product, deploy it to production [00:10:23].

  • Monitor more than deterministic outputs: Log all LLM calls, track inputs, outputs, and latency [00:10:35].
    • AI models are unpredictable; debugging and understanding behavior at every step is essential, especially with complex agentic workflows that can take different paths and make their own decisions [00:10:46].
  • Handle API reliability: Maintain stability with retries and fallback logic to prevent outages (e.g., using a different model if OpenAI goes down) [00:11:09].
  • Version control and staging: Always deploy in controlled environments before wider public release. Ensure new updates don’t introduce regressions or break existing workflows [00:11:35].
  • Decouple deployments: Separate AI feature updates from scheduled app deployments, as AI features often need more frequent updates [00:12:00].

4. Continuous Improvement

After deployment, capture user responses to identify edge cases and continuously improve the workflow [00:12:21].

  • Feedback loops: Capture responses, run evaluations again, and develop new prompts to solve new cases [00:12:26].
  • Caching layer: For repeat queries, caching drastically reduces costs and improves latency by storing and serving frequent responses instantly instead of calling expensive LLMs [00:12:47].
  • Fine-tune custom models: Use collected production data to fine-tune custom models for specific use cases, reducing reliance on API calls and lowering costs [00:13:16].

Agentic Workflows

The test-driven approach is even more critical for agentic workflows due to their complexity, tool use, multiple APIs, and parallel execution [00:13:34]. Evaluation must not only measure performance at every step but also assess agent behavior to ensure correct decisions and adherence to intended logic [00:13:53].

Every AI workflow has some level of agentic behavior, varying in control, reasoning, and autonomy [00:14:36]. A framework defines five levels of agentic behavior [00:14:50]:

Levels of Agentic Behavior

  1. L0: Basic LLM Call [00:15:19]

    • Involves an LLM call, data retrieval from a vector database, and inline evaluations [00:15:21].
    • No external reasoning, planning, or decision-making beyond what’s in the prompt or model behavior [00:15:30]. The model itself does the reasoning within the prompt [00:15:38].
  2. L1: Tool Use [00:15:52]

    • The AI system uses tools (APIs) and decides when to call them and when to make actions [00:15:56].
    • The model decides whether to call a tool or a vector database for data retrieval [00:16:13].
    • Memory plays a key role for multi-threaded conversations, capturing context [00:16:23].
    • Evaluation is needed at every step to ensure correct decisions, tool usage, and accurate responses [00:16:35].
    • Workflows can be simple or complex with branching and many tools [00:16:50].
    • Many production-grade solutions currently fall within the L1 segment (e.g., Redfin, Headspace) [00:20:41]. Focus is on orchestration: how models interact with systems and data [00:21:01].
  3. L2: Structured Reasoning [00:17:12]

    • Workflows move beyond simple tool use to structured reasoning [00:17:26].
    • The workflow notices triggers, plans actions, and executes tasks in a structured sequence [00:17:28].
    • Can break down tasks, retrieve info, call tools, evaluate usefulness, refine, and continuously loop to generate output [00:17:37].
    • Agentic behavior becomes more intentional; the system actively decides what needs to be done and spends more time thinking [00:17:54].
    • The process is still finite, terminating after completing planned steps [00:18:16].
    • Most innovation this year is expected to happen at L2, with AI agents planning and reasoning using models like 01, 03, or DeepSeek [00:21:41].
  4. L3: Proactive Autonomy [00:18:33]

    • The system proactively takes actions without waiting for direct input [00:18:45].
    • Instead of terminating after a single request, it stays alive, continuously monitors its environment (e.g., email, Slack, Google Drive), and reacts as needed [00:18:52].
    • Can plan next moves, execute actions in real-time, or ask humans for input [00:19:11].
    • AI workflows become less of a tool and more of an independent system, truly easing work [00:19:19].
  5. L4: Fully Creative/Inventor [00:19:38]

    • Goes beyond automation and reasoning to become an inventor [00:19:44].
    • Creates its own new workflows, utilities (agents, prompts, function calls, tools) [00:19:56].
    • Solves problems in novel ways [00:20:06].
    • Currently out of reach due to model constraints like overfitting (models loving their training data) and inductive bias (models making assumptions based on training data) [00:20:09]. The goal is AI that invents, improves, and solves problems in unforeseen ways [00:20:30].

L3 and L4 levels are still limited by current models and surrounding logic [00:22:22].

Practical Example: SEO Agent

An SEO agent automates the entire SEO process, from keyword research to content analysis and creation [00:23:04]. It decides when to use tools and has an embedded evaluator [00:23:12]. This workflow lies between L1 and L2 agentic workflows [00:23:37].

Workflow Steps:

  1. SEO Analyst & Researcher:
    • Takes a keyword (e.g., “Chain of Thought prompting”) and parameters (writing style, audience) [00:24:43].
    • Calls Google search to analyze top-performing articles for the keyword [00:23:45].
    • Identifies strong components to amplify in the new article and missing segments/areas for improvement [00:23:52].
    • The researcher utilizes missing opportunities to perform further searches and gather more data, aiming for a superior article [00:25:38].
  2. Writer:
    • Receives all research and planning information as context [00:24:13].
    • Creates a comprehensive first draft, using context from analyzed articles [00:26:02].
    • Can connect to a RAG system to look into a database of articles and learnings [00:26:27].
  3. Editor:
    • An LLM-based judge that evaluates the first draft against predefined rules [00:24:19].
    • Provides feedback to the writer [00:26:36].
    • The feedback is passed back to the writer via a memory component (chat history) [00:24:31].
    • This feedback loop continues until specific criteria are met (e.g., editor deems the post “excellent,” or a set number of loops are completed) [00:24:33].
  4. Final Article: The process yields a useful, contextually rich article, not a generic, unhelpful piece [00:24:49].

This workflow saves significant time by providing a strong foundation for content creation [00:27:47]. Products like Vellum Workflows are designed to bridge product and engineering teams, speeding up AI development while adhering to a test-driven approach [00:27:50]. The Vellum workflow SDK offers building blocks, customizability, and self-documenting syntax, keeping UI and code in sync for alignment across teams [00:28:04].