Case study on AI agent development at NVIDIA

From: aidotengineer

Silendrin, from Nvidia’s generative AI platforms team, details the approach to building effective AI agents that remain relevant and helpful over time, focusing on data flywheels rather than relying solely on the latest large language models (LLMs) [00:00:07]. This article explores what data flywheels are, how they were applied to an internal agent at NVIDIA, and the lessons learned [00:00:24].

What Are AI Agents?

AI agents are systems capable of perceiving, reasoning, and acting on an underlying task [00:01:12]. They process data, develop plans to address user queries, and utilize tools, functions, and external systems to achieve their goals [00:01:20]. Crucially, effective AI agents capture and learn from user feedback, continuously refining themselves to improve accuracy and usefulness based on user preferences and data [00:01:38].

Challenges in Developing AI Agents

Building and scaling agents can be challenging due to several factors [00:01:56]:

Rapid Data Change: Enterprise data, including business intelligence, constantly evolves [00:02:05].
Evolving User Preferences: Customer needs and user preferences change over time [00:02:16].
High Inference Costs: Deploying larger LLMs for complex use cases leads to increased inference costs, where greater usage directly translates to higher expenses [00:02:20].

Data Flywheels: A Solution

Data flywheels offer a solution to these challenges [00:02:38]. At their core, a data flywheel is a continuous cycle involving [00:02:41]:

Data Processing and Curation: Gathering and organizing data.
Model Customization: Adapting models for specific tasks.
Evaluation: Assessing model performance.
Guardrailing: Ensuring safe and secure interactions.
RAG Pipelines: Building state-of-the-art Retrieval Augmented Generation (RAG) pipelines alongside enterprise data to provide accurate responses [00:02:56].

As AI agents operate in production, this cycle continuously curates ground truth data using inference data, business intelligence, and user feedback [00:03:07]. This enables continuous experimentation and evaluation of existing and newer models, leading to the identification of smaller, more efficient models that match the accuracy of larger LLMs but offer lower latency, faster inference, and reduced total cost of ownership [00:03:20].

Nvidia Nemo Microservices

Nvidia provides Nemo Microservices, an end-to-end platform designed to build powerful agentic and generative AI systems and the data flywheels around them [00:03:55]. These microservices offer components for each stage of the data flywheel loop:

Nemo Curator: For curating high-quality training datasets, including multimodal data [00:04:13].
Nemo Customizer: For fine-tuning and customizing models using techniques like LoRa, P-tuning, and full SFT [00:04:21].
Nemo Evaluator: For benchmarking models against academic and institutional standards, and using LLMs as judges [00:04:34].
Nemo Guardrails: For providing safe interactions related to privacy, security, and safety [00:04:47].
Nemo Retriever: For building state-of-the-art RAG pipelines [00:04:51].

These microservices are exposed as easy-to-use API endpoints, allowing users to customize, evaluate, and guardrail LLMs with minimal calls [00:05:02]. They offer deployment flexibility across on-prem, cloud, data centers, and even the edge, with enterprise-grade stability and support [00:05:14].

Sample Data Flywheel Architecture

A typical data flywheel architecture using Nemo Microservices involves an end-user interacting with an agent’s front end, which is guarded for safe interactions [00:05:40]. Behind the scenes, an optimized model served as an Nvidia NIM powers the agent [00:05:55]. The data flywheel continuously curates data, stores it in a Nemo data store, and uses Nemo Customizer and Evaluator to trigger continuous retraining and evaluation [00:06:09]. Once a model meets target accuracy, it can be promoted by an IT admin or AI engineer to power the agent’s use case [00:06:23].

Case Study: NV Info Agent

Nvidia adopted and built a data flywheel for their internal NV Info Agent, an employee support agent designed to provide Nvidia employees with access to enterprise knowledge across multiple domains [00:06:45]. This chatbot answers queries ranging from HR benefits, financial earnings, and IT help to product documentation [00:07:06].

NV Info Agent Architecture

The NV Info Agent’s data flywheel architecture involves an employee submitting a query, which is then guardrailed for safety [00:07:26]. A router agent, orchestrated by an LLM, guides the query to one of multiple underlying expert agents [00:07:37]. Each expert agent specializes in a specific domain and uses a RAG pipeline to fetch relevant information [00:07:47].

A data flywheel loop is set up to determine which models power these expert agents [00:08:03]. This loop continuously incorporates user feedback and production data inference logs [00:08:12]. Ground truth data is curated using subject matter experts and human-in-the-loop feedback [00:08:22]. Nemo Customizer and Evaluator are used to continually assess models and promote the most effective one as a NIM to power the router agent [00:08:27].

Router Agent Problem Statement and Solution

The core challenge for the router agent was to accurately route a user query to the correct expert agent using a fast and cost-effective LLM [00:09:27].

Initially, a 70B variant of a model achieved a 96% baseline accuracy in routing queries to the correct expert agent without fine-tuning [00:09:56]. However, smaller variants like the 8B model showed subpar accuracy, below 14% [00:10:24].

To address this, Nvidia implemented the data flywheel approach [00:11:00]:

User Feedback Collection: The 70B model was run, and Nvidia employees were asked to submit queries and provide feedback on response usefulness [00:11:02].
Data Curation: This led to 1,224 data points, with 729 satisfactory and 495 unsatisfactory responses [00:11:24].
Error Analysis: Nemo Evaluator, using an LLM as a judge, investigated the 495 unsatisfactory responses [00:11:44]. It was found that 140 were due to incorrect routing, and further manual analysis by subject matter experts confirmed 32 were truly due to this issue [00:11:52].
Ground Truth Dataset: A ground truth dataset of 685 data points was created, split 60/40 for training (fine-tuning smaller models) and testing/evaluation [00:12:09].

Results and Benefits

The results from fine-tuning with just 685 data points were significant [00:12:27]:

The 70B model had 96% accuracy but a latency of 26 seconds to generate the first token [00:12:36].
The 8B model initially had 14% accuracy but a latency 70% lower than the 70B model [00:12:54].
After fine-tuning, the 8B model was able to match the 70B model’s accuracy [00:13:08].
Even the 1B variant achieved 94% accuracy, only 2% below the 70B model [00:13:22].

By deploying smaller models, such as the 1B model, Nvidia achieved [00:13:42]:

98% savings in lower inference cost [00:13:44].
10x to 70x model size reduction [00:13:50].
70x lower latency [00:13:56].

This demonstrates the power of a data flywheel: continuously learning from production logs and knowledge to train smaller, more efficient models that can replace larger ones while maintaining or even improving performance [00:14:20].

Framework for Building and Improving AI Agents

To build effective data flywheels, consider this four-step framework [00:14:40]:

Monitor User Feedback:
- Implement intuitive ways to collect user feedback signals (implicit and explicit) [00:14:50].
- Identify model drift or inaccuracies in the agent’s behavior [00:15:08].
Analyze and Attribute Errors:
- Spend time analyzing and attributing errors or model drift to their root causes [00:15:12].
- Classify errors, attribute failures, and create ground truth datasets [00:15:21].
Plan:
- Identify different models suitable for the task [00:15:34].
- Generate synthetic datasets and experiment with them [00:15:36].
- Fine-tune models and optimize resource allocation and cost [00:15:41].
Execute:
- Trigger the data flywheel cycle [00:15:48].
- Set up a regular cadence or mechanism to track accuracy, latency, and monitor performance and production logs [00:15:53].
- Manage the end-to-end GenAI Ops pipeline [00:16:05].

This framework helps in developing and optimizing AI agents and maintaining their effectiveness over time [00:16:12].

Tubegraph

Explorer

Table of Contents