Building AI agents using data flywheels

From: aidotengineer

AI agents are increasingly becoming a part of the workforce as new digital employees, appearing in various forms such as customer service, software security, and research agents, among others [00:50:53]. At their core, AI agents are systems capable of perceiving, reasoning, and acting on underlying tasks [01:12:14]. This means they can analyze data, develop a plan based on a user query, and utilize tools and external systems to complete the task [01:20:22]. For AI agents to complete their cycle effectively, they must also be able to capture and learn from user feedback, refining themselves to be more accurate and useful over time [01:36:19].

Challenges in Building and Scaling AI Agents

Building AI agents can be challenging, and scaling them presents increasing difficulties [01:56:19]. Key challenges include:

Rapidly changing data: Enterprise customers using agentic systems constantly encounter new data and business intelligence flowing into their systems [02:04:01].
Evolving user preferences: User preferences and customer needs are constantly changing [02:16:00].
High inference costs: Deploying larger language models to support use cases can lead to increased inference costs, as increased usage drives up expenses [02:20:20].

What are Data Flywheels?

Data flywheels are continuous loops or cycles crucial for AI agents to remain relevant and helpful [00:10:00]. They start with enterprise data and involve:

Data processing and curation [02:44:00].
Model customization [02:48:00].
Evaluation [02:50:00].
Guardrailing for safer interactions [02:52:00].
Building state-of-the-art RAG (Retrieval-Augmented Generation) pipelines alongside enterprise data to provide accurate responses [02:54:00].

As AI agents operate in production environments, the data flywheel cycle continuously curates ground truth data using inference data, business intelligence, and user feedback [03:07:00]. This enables continuous experimentation and evaluation of models, surfacing efficient, smaller models that offer comparable accuracy to larger models but with lower latency, faster inference, and reduced total cost of ownership [03:20:00].

NVIDIA Nemo Microservices

NVIDIA offers Nemo Microservices, an end-to-end platform designed for building powerful agentic and generative AI systems, as well as the data flywheels that support them [03:52:00]. These microservices are exposed as simple API endpoints for ease of use [05:00:00]. Key components include:

Nemo Curator: Helps curate high-quality training datasets, including multimodal data [04:13:00].
Nemo Customizer: Facilitates fine-tuning and customizing underlying models using state-of-the-art techniques like LoRa, P-tuning, and full SFT [04:21:00].
Nemo Evaluator: Used for benchmarking models on academic and institutional benchmarks, and for evaluation using LLM as a judge [04:34:00].
Nemo Guardrails: Provides guardrail interactions for privacy, security, and safety [04:47:00].
Nemo Retriever: Aids in building state-of-the-art RAG pipelines [04:51:00].

These microservices can be run anywhere, including on-premise, in the cloud, on data centers, or at the edge, with enterprise-grade stability and support from NVIDIA [05:14:00].

Sample Data Flywheel Architecture

A data flywheel architecture can be assembled using Nemo microservices like Lego pieces [05:32:00]. For an AI agent (e.g., a customer service agent) interacting with an end-user, the system is guardrailed for safer interactions [05:43:00]. On the backend, a model is served via NVIDIA NIM for optimized inference [05:55:00].

To identify the most suitable model without sacrificing accuracy, a data flywheel loop is set up to continuously:

Curate data [06:09:00].
Store it in a Nemo data store [06:12:00].
Use Nemo Customizer and Evaluator to trigger continuous retraining and evaluation [06:15:00].

Once a model meets target accuracy, an IT administrator or AI engineer can promote it to power the agentic use case as the underlying NVIDIA NIM [06:23:00].

Real-World Case Study: NV Info Agent

NVIDIA adopted and built a data flywheel for their NV Info agent, an internal employee support agent that provides access to enterprise knowledge across various domains, such as HR benefits, financial earnings, IT help, and product documentation [06:42:00].

In this system, when an employee submits a query, it’s guardrailed for safety and secure interaction [07:28:00]. A router agent, run by an LLM, orchestrates multiple expert agents [07:37:00]. Each expert agent excels in its specific domain and is augmented with a RAG pipeline to fetch relevant information [07:47:00].

To select the models powering these expert agents, a data flywheel loop is set up, constantly building on user feedback and production data inference logs [08:03:00]. Ground truth data is continuously curated using subject matter experts and human-in-the-loop feedback [08:20:00]. Nemo Customizer and Evaluator are used to evaluate multiple models, promoting the most effective one to power the router agent [08:25:00].

Router Agent Case Study

The router agent’s problem statement is to accurately route a user query to the correct expert agent using a fast and cost-effective LLM [09:27:00].

Initial comparisons showed:

70B Llama Variant: Achieved a 96% baseline accuracy for routing queries [09:55:00], but with a latency of 26 seconds to generate the first token response [12:47:00].
8B Variant: Showed a subpar accuracy of less than 14% without fine-tuning [10:24:00], but its latency was almost 70% lower [12:57:00].

Enterprises often mistakenly conclude that only larger models are viable due to their higher initial accuracy [10:33:00]. However, data flywheels can change this [10:57:00].

NVIDIA ran the 70B Llama variant and collected user feedback from employees via a feedback form [11:02:00]. Out of 1,224 data points, 495 were unsatisfactory responses [11:24:00]. Using Nemo Evaluator and LLM as a judge, 140 of these were attributed to incorrect routing [11:44:00]. Manual analysis with a subject matter expert confirmed 32 instances were truly due to incorrect routing [11:58:00].

This led to the creation of a ground truth dataset of 685 data points, split 60/40 for training (fine-tuning smaller models) and testing/evaluation [12:09:00].

Results After Fine-tuning with Data Flywheel:

The 8B variant, after fine-tuning with just 685 data points, was able to match the 96% accuracy of the 70B variant [13:04:00].
A 1B variant achieved 94% accuracy, only 2% below the 70B model [13:22:00].

By deploying a 1B model, NVIDIA achieved 98% savings in inference cost, 70x model size reduction, and 70x lower latency [13:42:00]. This demonstrates the power of data flywheels in enabling continuous evaluation and fine-tuning, allowing smaller, more efficient models to replace larger ones in production [13:59:00].

Framework for Building Effective Data Flywheels

To build effective data flywheels, consider the following framework:

1. Monitor User Feedback

Focus on intuitive ways to collect user feedback signals [14:50:00].
Consider intuitive user experience, privacy compliance, and both implicit and explicit signals [14:57:00].
This helps identify model drift or inaccuracies in the agentic system [15:05:00].

2. Analyze and Attribute Errors

Spend time analyzing and attributing errors or model drift to understand why the agent behaves in a certain way [15:12:00].
Classify errors and attribute failures [15:21:00].
Create a ground truth dataset from this analysis for further use [15:26:00].

3. Plan

Identify different models suitable for the task [15:34:00].
Generate synthetic datasets for experimentation [15:36:00].
Experiment with and fine-tune models [15:39:00].
Optimize resources and costs [15:41:00].

4. Execute

Trigger the data flywheel cycle [15:46:00].
Set up a regular cadence and mechanism to track accuracy, latency, and performance [15:53:00].
Monitor production logs and manage the end-to-end GenAI Ops pipeline [16:05:00].

By continuously learning from ongoing production logs and knowledge to train smaller models, the power of the data flywheel is truly unleashed [14:20:00].

Tubegraph

Explorer

Table of Contents