Evaluating AI agents and assistance

From: aidotengineer

Evaluating AI agents and assistants is a crucial topic, especially when these agents are deployed into production environments. Understanding how they perform in the real world is paramount for their success and reliability [00:00:53]. This includes evaluating not only text-based agents but also more complex multimodal systems [00:02:29].

Evolution of AI Agents

Initially, many discussions revolved around text-based agents, such as chatbots that perform actions and figure out tasks [00:01:38]. However, the field is rapidly advancing beyond this:

Voice AI: Voice AI is becoming a significant frontier, already transforming call centers that handle over a billion calls worldwide [00:01:48]. Real-time voice APIs are enabling agents to revolutionize customer service [00:02:03]. An example is the Price Line Penny bot, a production travel agent that allows users to book entire vacations hands-free [00:02:13].
Multimodal Agents: Beyond text and voice, agents are increasingly multimodal, requiring specific evaluation strategies due to their complexity [00:02:28].

Core Components of an AI Agent

Regardless of the framework used (e.g., Lan graph, Crei, Llama index workflow), AI agents typically consist of three common patterns [00:03:25]:

Router: This component acts like a “boss,” deciding the next step an agent will take [00:03:03]. For e-commerce agents, a router directs user queries (e.g., “I want to make a return,” “Are there any discounts?”) to the appropriate skill [00:04:06].
Skills: These are the logical chains that perform the actual work [00:03:10]. A skill might involve LLM calls or API calls depending on its implementation [00:05:09].
Memory: This component stores the context and history of the conversation. It’s crucial for multi-turn interactions to prevent the agent from forgetting previous statements [00:03:14].

Agent Execution Example: Code-Based Agent Trace

An agent’s inner workings can be observed through “traces,” which show the sequence of actions taken [00:05:58]. For instance, in a code-based agent that responds to a question like “What trends do you see in my trace latency?” [00:06:16]:

Router Call: The agent first calls a router to decide how to tackle the question [00:06:28]. Multiple router calls can occur as the application grows [00:06:43].
Tool/Skill Call: The router makes a tool call to run a SQL query, collecting application traces [00:06:55].
Second Router Call & Skill: It then returns to the router, which calls a second skill—the data analyzer. This skill takes the collected traces and application data to analyze them [00:07:11].
Memory: Throughout this process, memory stores everything happening behind the scenes [00:07:30].

This example illustrates the interplay of the router, skills, and memory components [00:07:36].

Evaluating AI System Performance

Every step within an agent’s execution flow is an area where it can go wrong [00:07:52]. Therefore, evaluation needs to be comprehensive and cover each component:

Evaluating the Router

For routers, teams primarily care about whether the agent called the right skill [00:07:56]. If a user asks for “leggings” but is sent to customer service, the router failed [00:08:04]. Key evaluation points include:

Correct Skill Selection: Did the router correctly identify and call the appropriate skill (e.g., product search vs. customer service)? [00:08:18]
Correct Parameter Passing: Did the router pass the correct parameters into the chosen skill (e.g., specific material, cost range for a product search)? [00:08:47]

Evaluating a Skill

Evaluating a skill is complex due to its many internal components [00:09:35]. For a Retrieval-Augmented Generation (RAG) type of skill, evaluation involves:

Relevance of Chunks: Assessing the relevance of information chunks pulled by the skill [00:09:46].
Correctness of Answer: Verifying the correctness of the generated answer [00:09:52].
LLM as a Judge Evals: Using Large Language Models (LLMs) to evaluate the skill’s output [00:10:00].
Code-Based Evals: Running code-based evaluations to assess skill performance [00:10:03].

Evaluating Agent Path and Convergence

One of the most challenging aspects to evaluate is the agent’s execution path and its “convergence” [00:10:10]. This refers to the consistency and efficiency of the steps an agent takes to complete a task [00:10:20].

Consistent Steps: Ideally, an agent should consistently take a similar number of steps (e.g., five or six) to query, use correct parameters, and call the necessary skill components [00:10:27].
Reliability: Different LLM models (e.g., OpenAI vs. Anthropic) can lead to vastly different numbers of steps for the same skill [00:10:52]. The goal is to achieve succinctness and reliability in the number of steps an agent takes to consistently complete a task [00:11:04].

Evaluating Voice AI Applications

Voice applications are among the most complex to deploy and require additional evaluation pieces [00:11:54]:

Audio Chunk Evaluation: Beyond just the text transcript, the actual audio chunk needs to be evaluated [00:12:07].
User Sentiment: Assessing the user’s sentiment from the audio [00:12:30].
Speech-to-Text Accuracy: Verifying the accuracy of the speech-to-text transcription [00:12:33].
Tone Consistency: Ensuring the tone remains consistent throughout the conversation [00:12:36].
Audio-Specific Evals: Defining evaluations specifically for audio chunks, focusing on intent, speech quality, and speech-to-text accuracy [00:12:50].

Comprehensive Evaluation Strategy

A key takeaway for building effective AI agents is to implement evaluations throughout the entire application trace [00:14:44]. This layered approach allows for precise debugging if something goes wrong, identifying whether the issue occurred at the router level, skill level, or elsewhere in the flow [00:14:48].

For example, a company like Arise dogfoods its own tool to evaluate its co-pilot agent [00:13:48]:

Overall Response Eval: Evaluating the overall correctness of the response (e.g., for a search question) [00:14:05].
Router Eval: Assessing if the search router picked the correct route and passed the correct arguments [00:14:21].
Skill Completion Eval: Checking if the task or skill was completed correctly during execution [00:14:32].

By having evaluations at every step, teams can effectively identify and resolve issues, ensuring the reliability and performance of their AI agents [00:14:48].

Tubegraph

Explorer

Table of Contents