Data flywheels and their importance in AI agents

From: aidotengineer

AI agents are gaining significant attention and are being integrated into the workforce as new digital employees [00:00:48]. They serve various functions, including customer service, software security, and research [00:01:02]. An AI agent is defined as a system capable of perceiving, reasoning, and acting on an underlying task [00:01:12]. They process data, develop plans based on user queries, and utilize tools, functions, and external systems to achieve their goals [00:01:20]. A complete AI agent cycle involves capturing and learning from user feedback to continuously refine performance, ensuring accuracy and usefulness [00:01:36].

Challenges in Building and Scaling AI Agents

Building and scaling AI agents can be difficult [00:01:56] due to several factors:

Rapid Data Change: Enterprise customers constantly receive new data and business intelligence [00:02:05].
Evolving User Preferences: User preferences and customer needs frequently change [00:02:16].
Increased Inference Cost: Deploying larger language models (LLMs) to support use cases leads to higher inference costs, where increased usage directly drives up expenses [00:02:20].

This is where data flywheels provide a solution [00:02:37].

What are Data Flywheels?

A data flywheel is a continuous loop or cycle that ensures AI agents remain relevant and helpful over time [00:00:07]. It’s not about relying solely on the latest LLM, but rather about incorporating simple data flywheels [00:00:16].

At its core, a data flywheel starts with enterprise data and involves:

Data processing and curation: Continuously refining data [00:02:44].
Model customization: Adapting models to specific needs [00:02:48].
Evaluation: Benchmarking and assessing model performance [00:02:50].
Guardrailing: Ensuring safer interactions for privacy, security, and safety [00:02:52].
Building state-of-the-art RAG (Retrieval Augmented Generation) pipelines: To provide relevant and accurate responses [00:02:56].

As AI agents operate in production, this cycle is triggered, leading to continuous data curation from inference data, business intelligence, and user feedback [00:03:07]. This enables experimentation and evaluation of existing and newer models to identify efficient, smaller models that maintain accuracy comparable to larger LLMs but offer lower latency, faster inference, and reduced total cost of ownership [00:03:20].

NVIDIA Nemo Microservices

NVIDIA has introduced Nemo Microservices as an end-to-end platform for building powerful agentic and generative AI systems and robust data flywheels around them [00:03:52]. These microservices offer various components for each stage of the data flywheel loop:

Nemo Curator: Helps curate high-quality training data sets, including multimodal data [00:04:13].
Nemo Customizer: Facilitates fine-tuning and customizing models using advanced techniques like LoRA, P-tuning, and full SFT [00:04:21].
Nemo Evaluator: Used for evaluating models on academic and institutional benchmarks, as well as using LLM as a judge [00:04:34].
Nemo Guardrails: Provides guardrail interactions for privacy, security, and safety [00:04:47].
Nemo Retriever: Enables the creation of state-of-the-art RAG pipelines [00:04:51].

These microservices are exposed as simple API endpoints, allowing users to customize, evaluate, and guardrail LLMs with minimal calls [00:04:59]. They can be deployed anywhere – on-premises, in the cloud, in data centers, or at the edge – with enterprise-grade stability and support from NVIDIA [00:05:14].

Sample Data Flywheel Architecture

A typical data flywheel architecture leveraging Nemo Microservices can be visualized as “Lego pieces” [00:05:32]. An end-user interacts with the front end of an agent (e.g., a customer service agent), which is guarded for safety [00:05:43]. On the backend, a model served as an NVIDIA NIM (NVIDIA Inference Microservice) provides optimized inference [00:05:57].

To identify the most suitable model without compromising accuracy, a data flywheel loop is established [00:06:02]. This loop continuously curates data, stores it in Nemo data store, and uses Nemo Customizer and Evaluator to trigger cycles of retraining and evaluation [00:06:09]. Once a model meets the desired accuracy, an IT admin or AI engineer can promote it to power the agentic use case [00:06:23].

Case Study: NVIDIA NVInfo Agent

NVIDIA adopted and built a data flywheel for its internal employee support agent, NVInfo, which provides access to enterprise knowledge across various domains like HR, finance, IT help, and product documentation [00:06:42].

The NVInfo agent’s data flywheel architecture involves:

An employee submits a query to the agent [00:07:28].
The query is guardrailed for safety and security [00:07:31].
A “router agent,” powered by an LLM, orchestrates multiple “expert agents” [00:07:37]. Each expert agent specializes in a specific domain and uses a RAG pipeline to fetch relevant information [00:07:47].
A data flywheel loop continuously builds on user feedback and production inference logs to determine which models power these expert agents [00:08:03].
Ground truth data is curated using subject matter experts and human-in-the-loop feedback [00:08:20].
Nemo Customizer and Evaluator are used to constantly evaluate models and promote the most effective one as an NIM to power the router agent [00:08:27].

Focus: The Router Agent

The router agent, an example of a mixture-of-agents architecture, understands user intent and context to route queries to the appropriate expert agent, which then uses a RAG pipeline [00:09:03]. The goal is to ensure accurate routing using a fast and cost-effective LLM [00:09:35].

Initially, a 70B variant LLM achieved a 96% baseline accuracy in routing without fine-tuning, but smaller variants (e.g., 8B) had subpar accuracy (below 14%) [00:09:55]. This often leads enterprises to choose larger, more expensive models [00:10:33].

Using a data flywheel, the team:

Collected User Feedback: Employees submitted queries and feedback on response usefulness [00:11:02].
Curated Data: From 1224 data points, 495 were unsatisfactory [00:11:24].
Investigated Errors: Nemo Evaluator, using LLM as a judge, found 140 unsatisfactory responses were due to incorrect routing [00:11:44]. Manual analysis by subject matter experts confirmed 32 of these were true routing errors [00:11:58].
Created Ground Truth Dataset: An 868-data-point dataset was created, split 60/40 for training/fine-tuning and testing/evaluation [00:12:08].

Results

With just 685 data points for fine-tuning, the results were significant [00:12:27]:

The 70B variant had 96% accuracy but a latency of 26 seconds for the first token response [00:12:36].
The 8B variant initially had 14% accuracy but much lower latency (70% lower) [00:12:54].
After fine-tuning, the 8B variant matched the 70B variant’s accuracy [00:13:04].
Even the 1B variant achieved 94% accuracy, only 2% below the 70B model [00:13:22].

This demonstrated a trade-off between accuracy and cost/resource management [00:13:31]. Deploying a 1B model, for example, could lead to:

98% savings in lower inference cost [00:13:42].
70x model size reduction [00:13:50].
70x lower latency [00:13:56].

This highlights the power of data flywheels in automating continuous evaluation and fine-tuning, surfacing smaller, more efficient models that can replace larger, more costly ones in production workflows [00:13:59].

Framework for Creating Effective Data Flywheels

To build effective data flywheels, consider this four-step framework [00:14:43]:

Monitor User Feedback: Implement intuitive ways to collect user feedback signals (implicit or explicit) to detect model drift or inaccuracies in your agentic system [00:14:48].
Analyze & Attribute Errors: Investigate and classify errors or model drift to understand why the agent is behaving in a certain way [00:15:12]. Attribute failures and create a ground truth dataset from this analysis [00:15:23].
Plan: Identify different models, generate synthetic datasets, and experiment with fine-tuning them [00:15:34]. Optimize resource and cost considerations during this phase [00:15:43].
Execute: Put the plan into action by triggering the data flywheel cycle [00:15:46]. This includes setting up a regular cadence to track accuracy, latency, performance monitoring, and production logs, effectively managing the end-to-end GenAI Ops pipeline [00:15:51].

This framework helps in building and improving AI agents that are continuously learning and adapting based on real-world interactions [00:16:12].

Tubegraph

Explorer

Table of Contents