From: aidotengineer
The success of AI products in production hinges not just on the models themselves, but on how systems are built and orchestrated around them [04:15:15]. This involves evolving techniques and patterns for orchestrating models, syncing with data, and ensuring effective production [01:26:00].

Evolution of AI Engineering and Orchestration

In 2023, while many built “AI wrappers,” the notion of defensibility was often questioned [00:44:00]. However, the rapid growth of AI-powered tools like Cursor AI demonstrates that significant advancements have occurred [00:52:00]. This progress is due to models improving at coding, increased AI adoption, and the development of new techniques and patterns for orchestrating models [01:11:00].

Why Orchestration Techniques are Essential

Despite model advancements, fundamental limitations persist:

  • Hallucinations are still a concern [01:41:00].
  • Overfitting remains a problem [01:43:00].
  • Developers require more structured outputs [01:45:00].
  • The “big jumps” in model performance, like between GPT-3.5 and GPT-4, have slowed down [01:53:00]. Models reached limits on existing tests, suggesting that making models bigger and adding more data hit a wall [02:01:00].

New training methods, such as real reinforcement learning (e.g., DeepSeek R1 model trained without labeled data), have pushed the field forward [02:38:00]. Reasoning models (e.g., OpenAI’s 01 and 03) utilize Chain of Thought thinking at inference time, allowing them to “think” before responding and solve complex reasoning problems [03:03:00]. Model providers are also adding more capabilities, such as tool use and near-perfect OCR accuracy (Gemini 2.0 Flash) [03:24:00]. However, traditional benchmarks are saturated, leading to the introduction of new, more difficult benchmarks like the Humanities Last Last Exam [03:41:00].

Key Techniques and Patterns in AI Orchestration

Parallel to model training, AI engineering has developed various techniques to build robust systems:

  • Prompting Techniques: Learning how to prompt models better, leading to advanced techniques like Chain of Thought [04:25:00].
  • Retrieval Augmented Generation (RAG): Grounding model responses with proprietary data became crucial [04:31:00].
  • Memory: Essential for multi-threaded conversations to capture context [04:42:00].
  • Long Context Models: Enabled new use cases due to extended context windows [04:47:00].
  • Graph RAG: Experimenting with hierarchical responses [04:52:00].
  • Reasoning Models: Utilizing models that take more time to think in real-time, opening new development areas [04:59:00].
  • Agentic RAG: Making workflows more powerful and autonomous [05:12:00].

Test-Driven Development for Reliable AI Products

To build AI products that truly work in production, a deep understanding of the problem and a test-driven development (TDD) approach are essential [05:24:00]. The best AI teams follow a structured approach involving experimentation, evaluation, deployment, and continuous improvement [05:46:00].

Stages of Test-Driven AI Development

1. Experimentation

Before building production-grade solutions, extensive experimentation is needed to prove if AI models can solve a specific use case [06:14:00].

  • Prompting Techniques: Try different approaches like few-shot or Chain of Thought prompting. Some work well for simple tasks, others for complex reasoning [06:23:00].
  • Workflow Techniques: Test prompt chaining (splitting instructions into multiple prompts) or agentic workflows (e.g., ReAct, which involves planning, reasoning, and refining) [06:33:00].
  • Domain Experts: Involve domain experts early to tweak prompts and save engineering time, ensuring the solution addresses the actual problem [06:53:00].
  • Model Agnosticism: Incorporate and test different models to identify which ones perform best for specific use cases (e.g., Gemini 2.0 Flash for OCR) [07:13:00].

2. Evaluation

Once initial proofs of concept are established, evaluation ensures scalability and reliability in production [07:44:00].

  • Data Set Creation: Create a dataset of hundreds of examples to test models and workflows against [07:57:00].
  • Trade-offs: Balance quality, cost, latency, and privacy. Define priorities early, as no AI system perfectly achieves all [08:03:00]. For example, high quality might sacrifice speed, or cost-critical applications might use lighter, cheaper models [08:16:00].
  • Ground Truth Data: Use ground truth data designed by subject matter experts to evaluate workflows [08:32:00]. Synthetic benchmarks are helpful but don’t fully capture real-world use case performance [08:46:00].
  • LLM as Evaluator: If ground truth data is unavailable, an LLM can reliably evaluate another model’s response [08:58:00].
  • Flexible Testing Framework: Use a framework that is dynamic and customizable to handle non-deterministic responses, define custom metrics, and allow scripting in languages like Python or TypeScript [09:14:00].
  • Multi-Stage Evaluation: Run evaluations at every stage, including internal nodes of the workflow, to ensure correctness and test during prototyping and with real data [09:48:00].

3. Deployment

Deployment requires careful monitoring and robust infrastructure [10:25:00].

  • Monitoring: Log all LLM calls, track inputs, outputs, and latency.The success of AI products in production hinges not just on the models themselves, but on how systems are built and orchestrated around them [04:15:15]. This involves evolving techniques and patterns for orchestrating models, syncing with data, and ensuring effective production [01:26:00].

Evolution of AI Engineering and Orchestration

In 2023, while many built “AI wrappers,” the notion of defensibility was often questioned [00:44:00]. However, the rapid growth of AI-powered tools like Cursor AI demonstrates that significant advancements have occurred [00:52:00]. This progress is due to models improving at coding, increased AI adoption, and the development of new techniques and patterns for orchestrating models [01:11:00].

Why Orchestration Techniques are Essential

Despite model advancements, fundamental limitations persist:

  • Hallucinations are still a concern [01:41:00].
  • Overfitting remains a problem [01:43:00].
  • Developers require more structured outputs [01:45:00].
  • The “big jumps” in model performance, like between GPT-3.5 and GPT-4, have slowed down [01:53:00]. Models reached limits on existing tests, suggesting that making models bigger and adding more data hit a wall [02:01:00].

New training methods, such as real reinforcement learning (e.g., DeepSeek R1 model trained without labeled data), have pushed the field forward [02:38:00]. Reasoning models (e.g., OpenAI’s 01 and 03) utilize Chain of Thought thinking at inference time, allowing them to “think” before responding and solve complex reasoning problems [03:03:00]. Model providers are also adding more capabilities, such as tool use and near-perfect OCR accuracy (Gemini 2.0 Flash) [03:24:00]. However, traditional benchmarks are saturated, leading to the introduction of new, more difficult benchmarks like the Humanities Last Last Exam [03:41:00].

Key Techniques and Patterns in AI Orchestration

Parallel to model training, AI engineering has developed various techniques to build robust systems:

  • Prompting Techniques: Learning how to prompt models better, leading to advanced techniques like Chain of Thought [04:25:00].
  • Retrieval Augmented Generation (RAG): Grounding model responses with proprietary data became crucial [04:31:00].
  • Memory: Essential for multi-threaded conversations to capture context [04:42:00].
  • Long Context Models: Enabled new use cases due to extended context windows [04:47:00].
  • Graph RAG: Experimenting with hierarchical responses [04:52:00].
  • Reasoning Models: Utilizing models that take more time to think in real-time, opening new development areas [04:59:00].
  • Agentic RAG: Making workflows more powerful and autonomous [05:12:00].

Test-Driven Development for Reliable AI Products

To build AI products that truly work in production, a deep understanding of the problem and a test-driven development (TDD) approach are essential [05:24:00]. The best AI teams follow a structured approach involving experimentation, evaluation, deployment, and continuous improvement [05:46:00].

Stages of Test-Driven AI Development

1. Experimentation

Before building production-grade solutions, extensive experimentation is needed to prove if AI models can solve a specific use case [06:14:00].

  • Prompting Techniques: Try different approaches like few-shot or Chain of Thought prompting. Some work well for simple tasks, others for complex reasoning [06:23:00].
  • Workflow Techniques: Test prompt chaining (splitting instructions into multiple prompts) or agentic workflows (e.g., ReAct, which involves planning, reasoning, and refining) [06:33:00].
  • Domain Experts: Involve domain experts early to tweak prompts and save engineering time, ensuring the solution addresses the actual problem [06:53:00].
  • Model Agnosticism: Incorporate and test different models to identify which ones perform best for specific use cases (e.g., Gemini 2.0 Flash for OCR) [07:13:00].

2. Evaluation

Once initial proofs of concept are established, evaluation ensures scalability and reliability in production [07:44:00].

  • Data Set Creation: Create a dataset of hundreds of examples to test models and workflows against [07:57:00].
  • Trade-offs: Balance quality, cost, latency, and privacy. Define priorities early, as no AI system perfectly achieves all [08:03:00]. For example, high quality might sacrifice speed, or cost-critical applications might use lighter, cheaper models [08:16:00].
  • Ground Truth Data: Use ground truth data designed by subject matter experts to evaluate workflows [08:32:00]. Synthetic benchmarks are helpful but don’t fully capture real-world use case performance [08:46:00].
  • LLM as Evaluator: If ground truth data is unavailable, an LLM can reliably evaluate another model’s response [08:58:00].
  • Flexible Testing Framework: Use a framework that is dynamic and customizable to handle non-deterministic responses, define custom metrics, and allow scripting in languages like Python or TypeScript [09:14:00].
  • Multi-Stage Evaluation: Run evaluations at every stage, including internal nodes of the workflow, to ensure correctness and test during prototyping and with real data [09:48:00].

3. Deployment

Deployment requires careful monitoring and robust infrastructure [10:25:00].

  • Monitoring: Log all LLM calls, track inputs, outputs, and latency. Since AI models are unpredictable, detailed monitoring is crucial for debugging and understanding behavior at every step, especially with agentic workflows [10:35:00].
  • API Reliability: Implement retries and fallback logic to maintain stability and prevent outages (e.g., switching models if one API is down) [11:09:00].
  • Version Control and Staging: Always deploy in controlled environments before public rollout to prevent regressions when updating prompts or workflows [11:35:00].
  • Decoupled Deployments: Decouple AI feature deployments from the main app deployment schedule, as AI features often require more frequent updates [12:00:00].

4. Continuous Improvement

After deployment, the work continues by capturing user feedback and refining the system [12:21:00].

  • Feedback Loop: Capture user responses to identify edge cases in production and continuously improve workflows [12:26:00]. Rerun evaluations with this new data to test solutions for identified issues [12:38:00].
  • Caching Layer: Implement caching for repeat queries to drastically reduce costs and improve latency by storing and serving frequent responses instantly instead of calling expensive LLMs repeatedly [12:47:00].
  • Fine-tuning Custom Models: Over time, use accumulated production data to fine-tune custom models, which can create better responses for specific use cases, reduce reliance on API calls, and lower costs [13:16:00].

Agentic Workflows

Agentic workflows are becoming increasingly important, especially since they can use a wide range of tools, call different APIs, and have multi-agent structures executing tasks in parallel [13:35:00]. For agentic workflows, evaluation not only measures performance but also assesses agent behavior, ensuring they make correct decisions and follow intended logic [13:53:00].

Every AI workflow has some level of agentic behavior; it’s a question of control, reasoning, and autonomy [14:36:00]. A framework defines different levels of agentic behavior:

Levels of Agentic Behavior

  • L0: Basic LLM Call with Retrieval [15:19:00]:
    • Involves an LLM call, data retrieval from a vector database, and inline evaluations [15:21:00].
    • No explicit reasoning, planning, or decision-making beyond what’s defined in the prompt [15:30:00]. The model performs all reasoning within the prompt [15:38:00].
  • L1: Tool Use [15:52:00]:
    • The AI system can use tools and decides when to call APIs or retrieve more data [15:59:00].
    • Memory becomes key for multi-threaded conversations, capturing context throughout the workflow [16:23:00].
    • Evaluation is needed at every step to ensure correct decisions and accurate responses when using tools [16:37:00]. These workflows can range from simple to highly complex with multiple branching paths and tools [16:50:00].
    • Many production-grade solutions currently fall within the L1 segment [20:40:00]. The focus here is on orchestration: how models interact with systems and data [21:01:00].
  • L2: Structured Reasoning [17:12:00]:
    • Workflows move beyond simple tool use to structured reasoning [17:26:00].
    • The system notices triggers, plans actions, and executes tasks in a structured sequence, breaking down tasks into multiple steps, retrieving information, calling tools, and refining its process in a continuous loop [17:28:00].
    • Agentic behavior is more intentional, actively deciding what needs to be done and spending more time to think [17:54:00].
    • The process is still finite, terminating once steps are completed [18:16:00].
    • This is where most innovation is expected to happen, with many AI agents developed to plan and reason using models like 01 or 03 [21:41:00].
  • L3: Proactive Autonomy [18:33:00]:
    • The system proactively takes actions without waiting for direct input [18:45:00].
    • Instead of terminating after a single request, it stays alive, continuously monitors its environment (e.g., email, Slack, Google Drive), and reacts as needed [18:50:00].
    • This makes AI workflows less of a tool and more of an independent system, capable of making work easier [19:19:00].
  • L4: Fully Creative / Inventor [19:38:00]:
    • The AI moves beyond automation and reasoning to become an inventor [19:44:00].
    • It can create its own new workflows, utilities (agents, prompts, function calls, tools), and solve problems in novel ways [19:50:00].
    • Currently, true L4 is “Out Of Reach” due to model constraints like overfitting and inductive bias [20:08:00]. The goal is AI that invents, improves, and solves problems in unforeseen ways [20:30:00].

L3 and L4 are still limited by current models and surrounding logic, though innovation is occurring in these areas as well [22:22:00].

Example: SEO Agent

An example of an AI agent that automates the SEO process (keyword research, content analysis, content creation) lies between an L1 and L2 type of agentic workflow [23:04:00].

The agent’s workflow components and process:

  1. SEO Analyst and Researcher: Takes a keyword, writing style, and audience parameters. Calls Google Search to analyze top-performing articles, identifying strong points to amplify and missing segments for improvement [23:43:00]. The researcher then conducts further searches to capture more data on missing pieces [25:53:00].
  2. Writer: Takes the gathered information as context to create a detailed first draft [26:02:00]. This content utilizes all the provided context intelligently, potentially integrating with a RAG system of internal articles and learnings [26:21:00].
  3. Editor: An LLM-based judge evaluates the first draft against predefined rules and provides feedback [24:19:00].
  4. Loop and Memory: The feedback is passed back to the writer via a memory component (chat history) between the writer and editor. This loop continues until a specific criterion is met (e.g., “excellent post” evaluation, or a set number of iterations) [24:31:00].
  5. Final Article: The process yields a useful, well-informed piece of content, saving significant time [24:49:00].

This example highlights the complexity of orchestrating multiple AI components and tools within a single workflow, where continuous improvement and evaluation are paramount [25:25:00]. Tools like Bellum Workflows and their SDK are designed to bridge the gap between product and engineering teams, speeding up AI development while adhering to a test-driven approach [27:49:00]. They provide building blocks, flexibility, and self-documenting syntax, keeping UI and code in sync for collaborative definition, debugging, and improvement of workflows [28:14:00].