Building effective AI agents

From: aidotengineer

To build effective AI agents that remain relevant and helpful over time, the focus should not be on simply powering them with the next biggest Large Language Model (LLM) [00:00:16]. Instead, they require simple data flywheels [00:00:21]. This article explores what data flywheels are, how they were applied to an internal agent at NVIDIA, lessons learned, and a framework for building data flywheels for AI agent use cases [00:00:24].

What Are AI Agents?

AI agents are currently generating significant buzz and are beginning to integrate into the workforce as new digital employees [00:00:48]. They manifest in various forms and sizes depending on their specific use case, such as customer service agents, software security agents, or research agents [00:00:57].

Fundamentally, agents are systems capable of perceiving, reasoning, and acting on an underlying task [00:01:09]. This means they can analyze data, formulate a reasonable plan to address a user query, and utilize tools, functions, and external systems to complete the task [00:01:20]. A complete cycle for AI agents involves capturing and learning from user feedback, continuously refining themselves based on user preferences and data to become more accurate and useful [00:01:36].

Challenges in AI Agent Development

Building effective agents can be difficult, and scaling them presents increasing technical challenges in AI agent development [00:01:56]. Key challenges include:

Rapid Data Change: Enterprise data, such as business intelligence, flows into systems constantly [00:02:04].
Evolving User Preferences: User preferences and customer needs are always changing [00:02:16].
Increased Inference Cost: Deploying larger, “chunkier” language models to support use cases drives up inference costs [00:02:20]. Greater usage directly leads to higher costs [00:02:31].

Data Flywheels: The Solution

Data flywheels are a continuous loop or cycle of data processing and curation [00:02:44]. They involve:

Model customization [00:02:48].
Evaluation [00:02:50].
Guardrailing for safer interactions [00:02:52].
Building state-of-the-art Retrieval-Augmented Generation (RAG) pipelines alongside enterprise data to provide relevant and accurate responses [00:02:54].

As AI agents operate in production environments, this data flywheel cycle continuously curates ground truth data using inference data, business intelligence, and user feedback [00:03:07]. This process enables continuous experimentation and evaluation of existing and newer models to identify efficient, smaller models that provide comparable accuracy to larger models but offer lower latency, faster inference, and reduced total cost of ownership [00:03:20].

NVIDIA Nemo Microservices

NVIDIA has announced Nemo Microservices, an end-to-end platform designed to build effective agents and generative AI systems, as well as powerful data flywheels around them [00:03:52]. Nemo Microservices offer various components for each stage of the data flywheel loop:

Nemo Curator: Helps curate high-quality training datasets, including multimodal data [00:04:13].
Nemo Customizer: Facilitates fine-tuning and customizing underlying models using state-of-the-art techniques like LoRa, Ptuning, and full SFT [00:04:21].
Nemo Evaluator: Used for benchmarking on academic and institutional benchmarks, as well as using LLMs as judges [00:04:34].
Nemo Guardrails: Provides guardrail interactions for privacy, security, and safety [00:04:47].
Nemo Retriever: Enables building state-of-the-art RAG pipelines [00:04:51].

These microservices are exposed as simple-to-use API endpoints, allowing users to customize, evaluate, and guardrail large language models with just a few API calls [00:04:57]. They offer flexibility to run on-prem, in the cloud, on data centers, or even at the edge, with enterprise-grade stability and support from NVIDIA [00:05:14].

Sample Data Flywheel Architecture

A data flywheel architecture can be constructed using Nemo Microservices like Lego pieces [00:05:32]. In a typical setup, an end-user interacts with the front end of an agent (e.g., a customer service agent) [00:05:43]. This interaction is guardrailed for safety, and on the back end, a model served as an NVIDIA NIM (NVIDIA Inference Microservice) provides optimized inference [00:05:53].

To identify the optimal model without compromising accuracy, a data flywheel loop is established to continuously curate data, store it in a Nemo data store, and use Nemo Customizer and Evaluator to trigger continuous retraining and evaluation [00:06:02]. Once a model meets the target accuracy, an IT admin or AI engineer can promote it to power the agent’s use case as the underlying NVIDIA NIM [00:06:20].

Real-World Case Study: NVInfo Agent

NVIDIA adopted and built this data flywheel for their NVInfo agent, an internal employee support agent [00:06:42]. This agent assists NVIDIA employees with access to enterprise knowledge across various domains, acting as a customer service or employee support chatbot [00:06:50]. It can answer queries spanning HR benefits, financial earnings, IT help, product documentation, and other internal employee needs [00:07:06].

The NVInfo agent’s data flywheel architecture involves an employee submitting a query, which is guardrailed for safety [00:07:26]. A crucial router agent, orchestrated by an LLM, manages multiple underlying expert agents [00:07:37]. Each expert agent specializes in a specific domain and is augmented with a RAG pipeline to retrieve relevant information [00:07:47].

To determine which models power these expert agents, a data flywheel loop is set up [00:08:02]. This loop continuously builds upon user feedback and production data inference logs generated when the router is active [00:08:12]. Using subject matter experts and human-in-the-loop feedback, ground truth data is curated [00:08:20]. Nemo Customizer and Evaluator then continuously evaluate multiple models to promote the most effective one as an NIM to power the router agent [00:08:27].

Router Agent Optimization

The router agent’s core problem is to accurately route a user query to the correct expert agent using a fast and cost-effective LLM [00:09:27]. Initial comparisons of various models to power the router agent revealed that a 70B variant achieved a 96% baseline accuracy for query routing without any fine-tuning [00:09:52]. In contrast, smaller variants, like the 8B model, showed subpar accuracy, falling below 14% [00:10:24].

The common misconception in enterprise evaluations is to solely rely on larger models due to their higher initial accuracy [00:10:33]. However, this is precisely where data flywheels provide a significant advantage [00:10:57].

NVIDIA employees were asked to submit queries and provide feedback on the usefulness of the responses generated by the 70B variant [00:11:02]. This process curated 1,224 data points, with 729 satisfactory and 495 unsatisfactory responses [00:11:24]. Using Nemo Evaluator and an LLM as a judge, 495 unsatisfactory responses were investigated [00:11:44]. It was found that 140 were due to incorrect routing, and further manual analysis by subject matter experts confirmed 32 were truly due to this [00:11:52].

A ground truth dataset of 685 data points was created, split 60/40 for training (fine-tuning smaller models) and testing/evaluation [00:12:09]. The results, achieved with only 685 data points, were outstanding, a testament to the data flywheel setup [00:12:27].

While the 70B variant offered 96% accuracy with a latency of 26 seconds to generate the first token, the 8B variant initially had 14% accuracy but 70% lower latency [00:12:36]. After fine-tuning, the 8B model was able to match the 70B variant’s accuracy [00:13:04]. Even the 1B variant achieved 94% accuracy, just 2% below the 70B model [00:13:22].

This demonstrates a trade-off between accuracy and cost/resource management [00:13:31]. Deploying a 1B model, for example, could result in 98% savings in lower inference costs, a 70x model size reduction, and 70x lower latency [00:13:42]. This is the power of data flywheels: an automated loop for continuous evaluation and fine-tuning, leveraging production logs and knowledge to train smaller, more efficient models that can replace larger ones [00:13:59].

Best Practices for Building AI Agents: A Framework for Data Flywheels

To build effective agents and data flywheels, consider this four-step framework:

1. Monitor User Feedback

Focus on intuitive ways to collect user feedback signals [00:14:48]. This includes intuitive user experience, privacy compliance, and both implicit and explicit signals to detect model drift or inaccuracies in the agentic system [00:14:57].

2. Analyze & Attribute Errors

Spend time analyzing and attributing errors or model drift to understand why the agent is behaving a certain way [00:15:12]. Classify these errors, attribute failures, and create a ground truth data set for subsequent steps [00:15:23].

3. Plan

Identify different models, generate synthetic datasets, experiment with them, fine-tune them, and optimize resources and costs [00:15:34].

4. Execute

Execution involves not only triggering a data flywheel cycle but also establishing a regular cadence or mechanism to track accuracy, latency, and monitor performance in production logs [00:15:48]. This effectively manages the end-to-end GenAI Ops pipeline [00:16:05].

This framework provides a robust approach for best practices for building AI agents and their data flywheels [00:16:12].

Tubegraph

Explorer

Table of Contents