From: aidotengineer

AI agents are rapidly becoming integral digital employees within the workforce, taking various forms such as customer service, software security, and research agents [00:00:48]. At their core, agents are systems designed to perceive, reason, and act on specific tasks, utilizing tools, functions, and external systems [00:01:12]. What completes the cycle for effective AI agents is their ability to continuously learn from user feedback, preferences, and data, thereby refining themselves to be more accurate and useful over time [00:01:36]. This continuous learning and refinement process is enabled by data flywheels [00:00:21].

Challenges in Building and Scaling AI Agents

Building and scaling AI agents presents several significant challenges [00:01:56]:

  • Rapidly Changing Data: Enterprise data and business intelligence continuously flow into systems [00:02:04].
  • Evolving User Preferences: Customer needs and user preferences are constantly changing [00:02:16].
  • High Inference Costs: Deploying larger language models (LLMs) to support use cases can lead to increased inference costs, as increased usage drives higher expenses [00:02:20].

These challenges highlight the necessity of data flywheels [00:02:38].

What are Data Flywheels?

A data flywheel is a continuous loop or cycle that starts with enterprise data [00:02:41]. It encompasses:

  1. Data processing and curation [00:02:44]
  2. Model customization [00:02:48]
  3. Evaluation [00:02:50]
  4. Guardrailing for safer interactions [00:02:52]
  5. Building state-of-the-art RAG pipelines alongside enterprise data [00:02:54]

This cycle aims to provide relevant and accurate responses [00:03:00]. As AI agents operate in production, the data flywheel continuously curates ground truth data using inference data, business intelligence, and user feedback [00:03:07]. This process allows for continuous experimentation and evaluation of existing and newer models, enabling the identification and promotion of efficient, smaller models that achieve accuracy comparable to larger LLMs but with lower latency, faster inference, and reduced total cost of ownership [00:03:20].

NVIDIA’s AI and Data Flywheel Tools

NVIDIA has developed Nemo microservices as an end-to-end platform for building powerful agentic and generative AI systems, including robust data flywheels [00:03:52]. These microservices offer various components for each stage of the data flywheel loop:

  • Nemo Curator: Helps curate high-quality training datasets, including multimodal data [00:04:13].
  • Nemo Customizer: Facilitates fine-tuning and customizing models using techniques like LoRa, P-tuning, and full SFT [00:04:21].
  • Nemo Evaluator: Used for benchmarking on academic and institutional standards, as well as using LLMs as judges [00:04:34].
  • Nemo Guardrails: Provides guardrail interactions for privacy, security, and safety [00:04:47].
  • Nemo Retriever: Aids in building state-of-the-art RAG pipelines [00:04:51].

These microservices are exposed as simple API endpoints, allowing users to customize, evaluate, and guardrail LLMs with minimal calls [00:05:02]. They offer deployment flexibility across on-premise, cloud, data center, and edge environments, with enterprise-grade stability and support [00:05:14].

Sample Data Flywheel Architecture

A typical data flywheel architecture leveraging Nemo microservices can be conceptualized as Lego pieces assembled to form a complete system [00:05:29]. An end-user interacts with the front end of an agent (e.g., a customer service agent) [00:05:43]. This interaction is guardrailed for safety, and the underlying model is served as an NVIDIA NIM for optimized inference [00:05:53].

To determine the optimal model without compromising accuracy, a data flywheel loop is established to:

  • Continuously curate data and store it in a data store [00:06:09].
  • Use Nemo Customizer and Evaluator to trigger continuous retraining and evaluation [00:06:17].
  • Once a model meets target accuracy, IT administrators or AI engineers can promote it to power the agentic use case as the underlying NIM [00:06:23].

Real-World Case Study: NVIDIA’s NV Info Agent

NVIDIA adopted and built a data flywheel for its internal employee support agent, “NV Info Agent,” which provides access to enterprise knowledge across various domains like HR benefits, financial earnings, IT help, and product documentation [00:06:45].

Architecture of the NV Info Agent’s Data Flywheel

When an employee submits a query, it is guardrailed for safety [00:07:26]. A router agent, orchestrated by an LLM, routes the query to one of multiple expert agents [00:07:37]. Each expert agent specializes in a specific domain and is augmented with a RAG pipeline to fetch relevant information [00:07:47].

To select and power these expert models, a data loop is set up that builds on user feedback and production data inference logs [00:08:03]. Ground truth data is continuously curated using subject matter experts and human-in-the-loop feedback [00:08:20]. Nemo Customizer and Evaluator are used to continually assess multiple models and promote the most effective one as a NIM to power the router agent [00:08:27].

Optimizing the Router Agent

The problem statement for the router agent was to accurately route user queries to the correct expert agent using a fast and cost-effective LLM [00:09:27].

  • Initial deployment of a 70B variant LLM showed a 96% baseline accuracy in routing but had a latency of 26 seconds to generate the first token response [00:09:55].
  • Smaller 8B variants showed subpar accuracy (below 14%) without fine-tuning [00:10:24].

Many enterprises mistakenly choose larger models solely based on initial high accuracy [00:10:33]. However, data flywheels enable significant improvements for smaller models [00:10:57].

Data Flywheel in Action: The Results

  1. Feedback Collection: A feedback form was circulated among NVIDIA employees to capture user feedback on query usefulness [00:11:08].
  2. Data Curation: 1,224 data points were curated, with 729 satisfactory and 495 unsatisfactory responses [00:11:24].
  3. Error Analysis: Nemo Evaluator, using LLM as a judge, investigated the 495 unsatisfactory responses, identifying 140 due to incorrect routing [00:11:44]. Further manual analysis with subject matter experts confirmed 32 actual incorrect routings [00:11:58].
  4. Ground Truth Dataset: A ground truth dataset of 685 data points was established, split 60/40 for training/fine-tuning and testing/evaluation [00:12:05].

The Results: With just 685 data points and the data flywheel setup, the results were outstanding [00:12:27]:

  • The 70B variant maintained 96% accuracy but with 26 seconds latency [00:12:36].
  • The 8B variant, initially 14% accurate, after fine-tuning, was able to match the 70B variant’s accuracy [00:13:04].
  • A 1B variant achieved 94% accuracy, only 2% below the 70B model [00:13:22].

This demonstrates that by deploying a 1B model, significant savings can be achieved:

The power of data flywheels lies in building an automated, continuous cycle of periodic evaluation and fine-tuning, allowing smaller models to replace larger ones in production while continuously learning from ongoing production logs and knowledge [00:13:59].

Framework for Building Effective Data Flywheels

To build effective data flywheels, consider the following framework [00:14:43]:

1. Monitor User Feedback

  • Implement intuitive ways to collect user feedback signals [00:14:50].
  • Consider intuitive user experience, privacy compliance, and both implicit and explicit signals [00:14:57].
  • This helps identify model drift or inaccuracies in the agentic system [00:15:05].

2. Analyze and Attribute Errors

  • Spend time analyzing and attributing errors or model drift to understand why the agent behaves a certain way [00:15:12].
  • Classify errors, attribute failures, and create ground truth datasets [00:15:23].

3. Plan for Improvement

  • Identify different models for experimentation [00:15:34].
  • Generate synthetic datasets and fine-tune models [00:15:36].
  • Optimize resource utilization and cost [00:15:41].

4. Execute the Plan

  • Trigger the data flywheel cycle [00:15:48].
  • Establish a regular cadence and mechanism to track accuracy, latency, and performance [00:15:53].
  • Monitor production logs and manage the end-to-end GenAI Ops pipeline [00:16:02].

By implementing these strategies for effective AI implementation, organizations can build resilient AI workflows and ensure their agents remain relevant and efficient.