Evaluation techniques for multimodal and voice AI agents

From: aidotengineer

Evaluating AI agents and assistance is crucial to ensure they function correctly in the real world once deployed into production [00:00:53]. While many discussions focus on building agents and the tools available, understanding how to evaluate their performance is equally vital for all stakeholders, including leadership [00:01:08].

Evolution of AI Agents

Initially, many AI discussions centered on text-based agents like chatbots [00:01:38]. However, the “next frontier” is voice AI, which is already revolutionizing call centers globally, handling over 1 billion calls with voice APIs [00:01:48]. A real-world example is Price Line Pennybot, a voice-activated travel agent that allows users to book entire vacations hands-free [00:02:13].

This shift means conversations are no longer just about text-based agents but also multimodal agents [00:02:29]. Evaluating these types of agents requires specific considerations beyond standard agent evaluation, especially for voice and multimodal interactions [00:02:39].

Components of an AI Agent

Regardless of the framework (e.g., LangGraph, CrewAI, LlamaIndex), AI agents typically consist of common patterns [00:03:25]:

Router: This component acts as the “boss” [00:03:52], deciding the agent’s next step [00:03:04]. For an e-commerce agent, it funnels a user query (e.g., “I want to make a return”) to the appropriate skill [00:04:12].
Skills: These are the logical chains that perform the actual work [00:03:12]. A skill flow might involve a series of LLM calls or API calls to execute a user’s request [00:05:02].
Memory: This component stores the context of multi-turn conversations, ensuring the agent remembers previous interactions and maintains state [00:03:16].

Example of Agent Workflow

An example of an agent’s inner workings, often visualized as “traces,” shows how these components interact [00:05:54]. If a user asks, “What trends do you see in my Trace latency?”, the router first makes a tool call to run a SQL query to collect application traces [00:07:00]. It then goes back to the router, which calls a second skill—a data analyzer—to process this data [00:07:11]. All these steps are stored in memory [00:07:30].

Evaluation Points within the Agent Workflow

Every step of an agent’s workflow is a potential point of failure and thus requires evaluation [00:07:52].

1. Router Evaluation

For routers, the primary concern is whether it called the right skill and passed the correct parameters [00:07:58]. For instance, if a user asks for leggings but is routed to customer service or discount information, the router failed [00:08:07]. Teams must evaluate the router’s control flow to ensure it correctly directs queries to the intended skills with appropriate arguments [00:08:41].

2. Skill Evaluation

Evaluating a skill can be complex due to its many internal components [00:09:38]. Key metrics include:

Relevance: For Retrieval Augmented Generation (RAG) type skills, evaluating the relevance of retrieved information chunks [00:09:46].
Correctness: Assessing the accuracy of the generated answer [00:09:52].
Methodology: Skills can be evaluated using LLM-as-a-judge evaluations or code-based evaluations [00:10:00].

3. Path / Convergence Evaluation

One of the most challenging aspects to evaluate is the agent’s path or “convergence” [00:11:14]. This involves ensuring:

Consistency: The agent ideally takes a consistent and succinct number of steps to complete a task, even if the same skill is called hundreds of times [00:10:20].
Reliability: Different LLM providers (e.g., OpenAI vs. Anthropic) might result in wildly different numbers of steps for the same skill, highlighting the importance of evaluating path reliability [00:10:50].
Efficiency: The goal is to ensure the agent is succinct and reliably completes tasks [00:11:04].

Specifics for Voice and Multimodal Applications

Voice applications are among the most complex applications ever built [00:11:55]. Evaluating voice agents requires additional considerations beyond just text:

Audio and Text Evaluation: It’s not just the generated transcript that needs evaluation; the audio chunk itself must also be assessed [00:12:06].
Post-Audio Processing: Transcripts are often generated after the audio chunk is processed, adding another dimension to evaluation [00:12:22].
Voice-Specific Metrics: Key metrics for voice include:
- User sentiment [00:12:30]
- Speech-to-text transcription accuracy [00:12:31]
- Tone consistency throughout the conversation [00:12:36]
- Intent and speech quality [00:12:53]

Comprehensive Agent Evaluation

The goal of building AI agents is to incorporate evaluations throughout the entire application’s trace [00:14:40]. This allows for pinpoint debugging, determining if an issue occurred at the router level, skill level, or elsewhere in the flow [00:14:52]. For example, in a co-pilot, evaluations can be run at the overall response level, the router selection, the argument passing to the router, and the correct task completion within the skill execution [00:14:02].

Tubegraph

Explorer

Table of Contents