Framework for creating effective data flywheels

From: aidotengineer

Effective AI agents that remain relevant and helpful over time are not solely dependent on the largest available Large Language Models (LLMs) [00:16:18]. Instead, they require simple data flywheels [00:21:00]. This article explores what data flywheels are, how they can be applied, and a framework for building them [00:24:00].

What are AI Agents?

AI agents are systems capable of perceiving, reasoning, and acting on underlying tasks [01:12:00]. They process data, devise reasonable plans for user queries, and utilize tools, functions, and external systems [01:20:00]. A complete agent cycle involves capturing and learning from user feedback, preferences, and data, continuously refining themselves for accuracy and usefulness [01:36:00]. AI agents are emerging as new digital employees in various forms, such as customer service, software security, and research agents [00:53:00].

Challenges in Building and Scaling AI Agents

Building and scaling AI agents can be challenging [01:56:00]. Key difficulties include:

Rapidly changing data: Enterprise systems constantly receive new data and business intelligence [02:05:00].
Evolving user preferences: Customer needs and user preferences change over time [02:16:00].
Increased inference cost: Deploying larger LLMs leads to higher inference costs, as increased usage drives up expenses [02:20:00].

How Data Flywheels Help

Data flywheels offer a solution to these challenges [02:38:00]. At its core, a data flywheel is a continuous loop of data processing, curation, model customization, evaluation, and guardrailing [02:44:00]. It integrates state-of-the-art RAG (Retrieval Augmented Generation) pipelines with enterprise data to provide accurate responses [02:56:00].

As AI agents operate in production, this cycle is triggered, leading to:

Continuous data curation: Ground truth data is curated using inference data, business intelligence, and user feedback [03:10:00].
Model evaluation and selection: Existing and newer models are continuously experimented with and evaluated [03:20:00].
Efficiency: The process aims to surface smaller, more efficient models that match the accuracy of larger models but offer lower latency, faster inference, and a lower total cost of ownership [03:26:00].

NVIDIA Nemo Microservices

NVIDIA offers Nemo microservices, an end-to-end platform for building powerful agentic and generative AI systems, including data flywheels [03:55:00]. These microservices are exposed as simple API endpoints and include components for each stage of the data flywheel loop:

Nemo Curator: Curates high-quality training data, including multimodal data [04:13:00].
Nemo Customizer: Fine-tunes and customizes models using techniques like LoRa, P-tuning, and full SFT [04:21:00].
Nemo Evaluator: Benchmarks models against academic and institutional benchmarks, and can use LLMs as judges [04:34:00].
Nemo Guardrails: Provides guardrail interactions for privacy, security, and safety [04:47:00].
Nemo Retriever: Builds state-of-the-art RAG pipelines [04:51:00].

These microservices offer flexibility, allowing deployment on-prem, in the cloud, in data centers, or at the edge [05:14:00].

Case Study: NVIDIA NVInfo Agent

NVIDIA built a data flywheel for their internal NVInfo agent, an employee support agent that provides access to enterprise knowledge across various domains, such as HR benefits, financial earnings, and IT help [06:45:00].

Router Agent Problem

The NVInfo agent architecture features a main router agent, orchestrated by an LLM, that routes employee queries to multiple underlying expert agents [07:37:00]. Each expert agent specializes in a specific domain and uses a RAG pipeline to fetch relevant information [07:47:00]. The problem was to accurately route user queries to the correct expert agent using a fast and cost-effective LLM [09:27:00].

Initial Model Comparison and Data Curation

Initially, a 70B variant of an LLM achieved 96% baseline accuracy in routing queries [09:56:00]. However, smaller variants like the 8B model showed subpar accuracy, below 14% [10:24:00].

To improve smaller models, a feedback loop was implemented:

The 70B Llama variant was used in production [11:02:00].
NVIDIA employees submitted queries, and feedback was collected on whether responses were useful [11:11:00].
1,224 data points were curated, with 729 satisfactory and 495 unsatisfactory responses [11:24:00].
Nemo Evaluator, with an LLM as a judge, investigated the unsatisfactory responses [11:44:00].
140 errors were attributed to incorrect routing, and further manual analysis by subject matter experts confirmed 32 were truly due to this [11:52:00].
A ground truth dataset of 685 data points was created, split 60/40 for training/fine-tuning and testing/evaluation [12:09:00].

Results

With only 685 data points and the data flywheel setup, the results were significant:

The 70B variant had 96% accuracy but a latency of 26 seconds to generate the first token [12:36:00].
The 8B variant initially had 14% accuracy but much lower latency [12:54:00].
After fine-tuning, the 8B model was able to match the accuracy of the 70B variant [13:04:00].
Even the 1B variant achieved 94% accuracy, only 2% below the 70B model [13:22:00].

Deploying a 1B model resulted in 98% savings in lower inference cost, a 70x model size reduction, and 70x lower latency [13:42:00]. This demonstrates the power of a data flywheel in continuously learning from production logs and surfacing smaller, more efficient models to replace larger ones [14:02:00].

Framework for Building Effective Data Flywheels

Building effective data flywheels involves a four-step framework:

1. Monitor User Feedback

Start by establishing intuitive ways to collect user feedback signals [14:48:00]. This includes:

Intuitive user experience: Making it easy for users to provide feedback [14:57:00].
Privacy compliance: Ensuring feedback collection adheres to privacy standards [15:00:00].
Implicit and explicit signals: Gathering both indirect (e.g., usage patterns) and direct (e.g., ratings, comments) feedback [15:00:00]. This helps understand if the agent’s models are experiencing model drift or inaccuracies [15:05:00].

2. Analyze and Attribute Errors

Dedicate time to analyze and attribute the errors or model drift observed [15:12:00].

Classify errors: Categorize the types of inaccuracies [15:21:00].
Attribute failures: Determine the root causes of the agent’s undesirable behavior [15:23:00].
Create ground truth data: Develop a high-quality ground truth dataset from the analysis, which will be used in subsequent steps [15:26:00].

3. Plan

This stage involves strategizing the improvements based on the analyzed data [15:34:00].

Identify different models: Determine which models could be used or experimented with [15:36:00].
Generate synthetic data sets: Create additional data points to augment training [15:38:00].
Experiment and fine-tune: Test different approaches and fine-tune models using the curated data [15:39:00].
Optimize resource and cost: Plan for efficient resource allocation and cost management [15:41:00].

4. Execute

The final step is to put the plan into action and manage the ongoing process [15:46:00].

Trigger data flywheel cycle: Initiate the continuous process of data curation, model training, and evaluation [15:51:00].
Set up regular cadence: Establish a routine for tracking accuracy, latency, and performance [15:53:00].
Monitor production logs: Continuously monitor the agent’s behavior in production [16:02:00].
Manage end-to-end GenAI Ops pipeline: Oversee the entire pipeline for generative AI operations [16:05:00].

This framework provides a structured approach to building effective data flywheels for AI agents [16:12:00].

Tubegraph

Explorer

Table of Contents