From: aidotengineer
AI product success is not just about the underlying models, but how solutions are built around them [04:15:01]. Building reliable and stronger AI systems in production is achievable through a test-driven development (TDD) approach [00:18:01].
The Evolution of AI and its Current State
In 2023, many AI wrappers were developed, with debates about their defensibility [00:44:01]. Today, Cursor AI, an AI-powered IDE, has achieved significant growth, demonstrating the potential of AI applications [00:54:01]. This success is partly due to models improving at coding and a surge in AI adoption [01:11:01]. More importantly, new techniques and patterns have emerged to orchestrate these models, enabling them to work better with data and effectively in production [01:26:01].
Key challenges with model performance persist, including hallucinations and overfitting [01:41:01]. While model providers have shipped better tooling, significant leaps in raw model performance, similar to the jump from GPT-3.5 to GPT-4, have slowed down [01:48:01]. For years, making models bigger and adding more data made them smarter, but this approach has hit a wall, with improvements slowing and models reaching their limits on existing tests [02:01:01].
However, new training methods have emerged in recent months, pushing the field forward [02:32:01]:
- Deepseek R1 Model: The first model trained without labeled data, using “real reinforcement learning,” allowing it to learn autonomously [02:42:01].
- Reasoning Models: Models like OpenAI’s O1 and O3 use Chain of Thought thinking at inference time to generate answers, enabling them to “think” before responding and solve more complex reasoning problems [02:57:01].
- Enhanced Capabilities: Model providers are offering more capabilities, including tool use, research capabilities, and near-perfect OCR accuracy (e.g., Gemini 2.0 Flash) [03:24:01].
As traditional benchmarks become saturated, new ones are being introduced to capture the performance of these advanced reasoning models, such as the Humanities Last Exam, which measures performance on truly difficult tasks [03:41:01].
Evolution of AI Techniques
Alongside model training, techniques for interacting with and orchestrating models have evolved [04:20:01]:
- Prompting Techniques: More advanced methods like Chain of Thought were developed [04:25:01].
- Retrieval Augmented Generation (RAG): Important for grounding model responses with proprietary data [04:31:01].
- Memory: Critical for multi-threaded conversations in AI workflows [04:42:01].
- Long Context Models: Enabled new use cases [04:47:01].
- Graph RAG: Experimentation with hierarchy of responses [04:52:01].
- Agentic RAG: Making workflows more powerful and autonomous [05:12:01].
The Test-Driven Development Approach
To build an AI product that truly works in production, it’s crucial to understand the problem deeply and adopt a test-driven development approach to find the right mix of techniques, models, and logic [05:22:01].
The best AI teams follow a structured approach:
- Experiment [05:52:01]
- Evaluate [05:54:01]
- Scale [05:54:01]
- Deploy [05:56:01]
- Continuously Monitor, Observe, and Improve [06:01:01]
1. Experimentation Phase
Before building anything production-grade, extensive experimentation is needed to prove whether AI models can solve the use case [06:14:01].
- Try different prompting techniques: Few-shot or Chain of Thought, suitable for simple to complex reasoning tasks [06:23:01].
- Test various techniques: Prompt chaining (splitting instructions into multiple prompts) or agentic workflows like ReAct (plan, reason, refine) [06:33:01].
- Involve domain experts: Engineers shouldn’t be the sole prompt tweakers; domain experts save engineering time [06:53:01].
- Stay model agnostic: Incorporate and test different models, selecting those best suited for specific tasks (e.g., Gemini 2.0 Flash for OCR) [07:13:01].
2. Evaluation Phase
This stage ensures the workflow will work in production under high request volumes [07:44:01].
- Create a dataset: Develop a dataset of hundreds of examples for testing models and workflows [07:57:01].
- Balance trade-offs: Prioritize quality, cost, latency, and privacy, as no AI system perfects all [08:06:01]. Define priorities early [08:27:01].
- Use ground truth data: Subject matter experts designing databases and testing against them is highly beneficial [08:32:01]. Synthetic benchmarks are less effective for specific use cases [08:46:01].
- LLM evaluation: If ground truth data is unavailable, an LLM can evaluate another model’s response [08:58:01].
- Flexible testing framework: Ensure the framework is dynamic, capable of capturing non-deterministic responses, defining custom metrics, and writing metrics in Python or TypeScript [09:14:01]. Customizability is crucial [09:42:01].
- Run evaluations at every stage: Implement guard rails to check internal nodes and ensure correct responses at every step of the workflow [09:48:01]. Test during prototyping and re-evaluate with real data [10:03:01].
3. Scaling and Deployment
Once satisfied with the product, it’s ready for production [10:23:01].
- Monitor more than deterministic outputs: Log all LLM calls, track inputs, outputs, and latency due to AI models’ unpredictability [10:35:01]. This is especially important for agentic workflows [10:56:01].
- Handle API reliability: Maintain stability with retries and fallback logic to prevent outages (e.g., switching to another model during downtime) [11:09:01].
- Version control and staging: Deploy in controlled environments before wider public release to prevent regressions from prompt updates [11:35:01].
- Decouple deployments: AI features may need more frequent updates than the overall application, so deployment schedules should be separate [12:00:01].
4. Continuous Improvement
After deployment, capture user feedback to identify edge cases and continuously improve the workflow [12:21:01].
- Feedback loop: Use captured responses to run evaluations again and test new prompts for new cases [12:26:01].
- Caching layer: For repeat queries, caching can significantly reduce costs and improve latency by serving frequent responses instantly instead of calling expensive LLMs [12:46:01].
- Fine-tune custom models: Use reliable production data to fine-tune a custom model for specific use cases, reducing reliance on API calls and lowering costs [13:16:01].
Challenges and Evolution of Agentic Workflows
The TDD process becomes even more crucial with agentic workflows, which use a wide range of tools, call different APIs, and may have multi-agent structures executing tasks in parallel [13:34:01]. Evaluation must not only measure performance at every step but also assess the behavior of agents to ensure correct decisions and intended logic [13:53:01].
Every AI workflow has some level of agentic behavior, differing in control, reasoning, and autonomy [14:40:01]. A framework defines different levels of agentic behavior:
-
L0: Basic LLM Call with RAG
- An LLM call retrieves data from a vector database, with some inline evaluations [15:19:01].
- No external agenda organizes decisions or plans actions; the model performs all reasoning within the prompt [15:32:01].
-
L1: Tool Use
- The AI system can use tools and decide when to call APIs or retrieve data from a vector database [15:56:01].
- This shows more agentic behavior [16:10:01].
- Memory is key for multi-threaded conversations, capturing context throughout the workflow [16:23:01].
- Evaluation is needed at every step to ensure correct decisions and accurate responses [16:37:01]. These workflows can range from simple to complex, with multiple branching paths and numerous tools [16:50:01].
- Current production-grade solutions often fall within L1, focusing on orchestration [20:53:01].
-
L2: Structured Reasoning
- Workflows move from simple tool use to structured reasoning [17:15:01].
- The system notices triggers, plans actions, and executes tasks in a structured sequence, breaking down tasks into multiple steps [17:28:01].
- It can retrieve information, call tools, evaluate their usefulness, and refine as needed in a continuous loop [17:41:01].
- Agentic behavior becomes more intentional, with the system actively deciding what needs to be done and spending more time “thinking” [17:55:01].
- The process is still finite, terminating once steps are completed [18:16:01].
- L2 is expected to see significant innovation this year, with AI agents developed for planning and reasoning using advanced models [21:41:01].
-
L3: Proactive Autonomy
- Systems proactively take actions without waiting for direct input, continuously monitoring their environment and reacting as needed [18:33:01].
- They can access external services (email, Slack, Google Drive) to plan next moves and execute actions or ask for human input [19:02:01].
- AI workflows become less of a tool and more of an independent system, such as a marketer preparing a video [19:19:01].
-
L4: Fully Creative/Inventor AI
- This stage moves beyond automation and reasoning, with AI becoming an inventor [19:39:01].
- Instead of predefined tasks, it creates its own new workflows, utilities, agents, prompts, function calls, and tools [19:50:01].
- It solves problems in novel ways [20:06:01].
- True L4 is currently out of reach due to model constraints like overfitting (models loving their training data) and inductive bias (models making assumptions based on training data) [20:12:01].
- The goal is AI that invents, improves, and solves problems in unforeseen ways [20:30:01].
L3 and L4 are still limited by current models and surrounding logic, though innovation is occurring in these areas [22:22:01].
Practical Example: SEO Agent Workflow
An SEO agent automates the SEO process from keyword research to content analysis and creation [23:04:01]. This specific agent lies between an L1 and L2 type of agentic workflow [23:37:01].
Workflow Steps:
- SEO Analyst: Takes a keyword, writing style, and audience parameters. Calls Google Search to analyze top-performing articles, identifying strong points to amplify and missing segments for improvement [23:43:01].
- Researcher: Utilizes the identified missing opportunities to perform further searches and capture more data [25:53:01].
- Writer: Takes all collected information as input to create a comprehensive first draft. The content is not generic but intelligently uses context from analyzed articles; it can also connect to a RAG database of internal articles and learnings [26:02:01].
- Editor (LLM-based judge): Evaluates the first draft based on predefined rules in its prompt and provides feedback [24:19:01].
- Refinement Loop: The feedback is passed back to the writer via a memory component (chat history). This loop continues until a specific criterion is met (e.g., the editor designates the post as “excellent” or a maximum number of loops is reached) [24:31:01].
- Final Article: Produces a useful and impressive first draft [24:49:01].
This workflow saves significant time by providing a strong foundation for content creation [27:47:01].
Tools for AI Development
Platforms like Bellum Workflows are designed to bridge the gap between product and engineering teams, speeding up AI development while adhering to the test-driven approach [27:49:01]. An SDK offers building blocks, infinite customization, and self-documenting syntax for developers to own their definitions in the codebase [28:13:01]. Key features include:
- Code-UI Synchronization: The UI and code stay in sync for workflow definition, debugging, and improvement, ensuring team alignment [28:34:01].
- Open Source and Free: Available on GitHub [28:43:01].