Evaluation of voice applications including audio and text

From: aidotengineer

Evaluating AI agents and assistants is crucial for ensuring they function correctly in the real world once deployed into production [00:00:53] [00:01:00]. While many discussions focus on building agents and the tools available, understanding their performance post-production is equally important [00:00:47] [00:00:57].

The Rise of Voice and Multimodal Agents

Historically, discussions often centered on text-based agents like chatbots [00:01:38] [00:01:41]. However, the next frontier is voice AI, which is already revolutionizing call centers, handling over a billion calls worldwide with voice APIs [00:01:48] [00:02:10].

Modern applications are moving beyond just text to multimodal agents, which combine different input and output modalities [00:02:28] [00:02:29]. An example is the Price Line Pennybot, a real production application where users can book an entire vacation hands-free without text [00:02:13] [00:02:23]. Evaluating these multimodal and voice AI agents requires specific approaches beyond traditional agent evaluation [00:02:33] [00:02:45].

Core Components of an AI Agent

Regardless of the framework (e.g., LangGraph, CrewAI, LlamaIndex Workflow), AI agents typically consist of common components:

Router: Acts as the “boss,” deciding the agent’s next step [00:03:04] [00:03:07]. For instance, in an e-commerce agent, it directs a user query (like “I want to make a return” or “Are there any discounts?”) to the appropriate skill [00:04:06] [00:04:16].
Skills: These are the actual logical chains that perform the work, often involving LLM calls or API calls [00:03:12] [00:05:09].
Memory: Stores the context of multi-turn conversations, preventing the agent from forgetting previous interactions [00:03:16] [00:05:24].

These components can be observed through “traces,” which show the inner workings and execution flow of an agent, useful for engineers in troubleshooting [00:05:58] [00:06:09].

Evaluating AI Agent Components

Every step within an agent’s operation is a potential point of failure, necessitating thorough evaluation [00:07:52].

Router Evaluation

For routers, the primary concern is whether it called the right skill with the right parameters [00:07:58] [00:08:47]. If a user asks for leggings but is routed to customer service or discounts, the router has failed [00:08:07] [00:08:14]. Teams should evaluate the router’s control flow and ensure it correctly passes arguments like material type or cost range [00:08:33] [00:09:06].

Skill Evaluation

Evaluating skills is more complex due to multiple internal components [00:09:39] [00:09:41]. Key metrics include:

Relevance: Especially for RAG (Retrieval Augmented Generation) type skills, evaluating the relevance of the pulled information chunks [00:09:43] [00:09:51].
Correctness: The accuracy of the generated answer [00:09:52] [00:09:55].
LLM as a judge evals or code-based evals can be used to assess skill performance [00:10:00] [00:10:04].
Convergence: Evaluating the agent’s path consistency and the number of steps it takes to complete a task [00:10:15] [00:11:07]. The goal is succinctness and reliability in the number of steps, as different LLMs (e.g., OpenAI vs. Anthropic) can lead to vastly different path lengths for the same skill [00:11:04] [00:10:52].

Specifics of Voice Application Evaluation

Voice applications are among the most complex applications to deploy, requiring additional evaluation considerations [00:11:54] [00:11:59]. Beyond evaluating the text or transcript, the audio chunk itself needs evaluation [00:12:06].

Key aspects for voice-first AI evaluation include:

User Sentiment: Assessing the user’s emotional state [00:12:30].
Speech-to-text Transcription Accuracy: Verifying the correctness of the generated transcript from audio [00:12:31] [00:12:34].
Tone Consistency: Ensuring a consistent tone throughout the conversation [00:12:36].
Intent and Speech Quality: Defining and evaluating these metrics specifically for audio [00:12:53] [00:12:57].

The challenge arises because the generated transcript often appears after the audio chunk is sent, adding a new dimension to evaluation [00:12:19] [00:12:25].

Multi-Layered Evaluation in Practice

Effective evaluation involves setting up metrics throughout the entire application flow, not just at a single layer [00:14:40] [00:14:48]. This allows for precise debugging if an issue arises, pinpointing whether it occurred at the router level, skill level, or elsewhere in the flow [00:14:52] [00:14:59].

For example, a co-pilot feature might be evaluated at multiple points during a user’s interaction:

An overall evaluation to check if the generated response to a search query was correct [00:14:05].
Evaluation to confirm the correct router was selected and the right arguments were passed to it [00:14:21].
Finally, an evaluation to ensure the task or skill was completed correctly during its execution [00:14:32].

Tubegraph

Explorer

Table of Contents