From: aidotengineer

When building effective AI agents and putting them into production, it is crucial to evaluate their performance to ensure they function effectively in real-world scenarios [00:00:53]. This evaluation is important even at a leadership level to understand if deployed agents are working as intended [00:01:10].

While many discussions focus on text-based agents, the “next frontier” is voice AI, which is already revolutionizing call centers worldwide [00:01:48]. Examples like the Price Line Pennybot demonstrate real-world applications where users can book entire vacations hands-free using voice [00:02:13]. This shift to multimodal agents necessitates specific evaluation methods, as different modalities require additional considerations [00:02:28].

Components of an AI Agent

Regardless of the specific framework used (e.g., Lan graph, Crei, Llama index workflow), AI agents are typically built with common architectural patterns [00:03:20]. The three primary components are:

  1. Router [00:03:03]
  2. Skills [00:03:12]
  3. Memory [00:03:14]

These components have different evaluation requirements [00:03:44].

The Router

The router acts like the “boss” of an AI agent, deciding the next step an agent will take [00:03:03]. Its primary goal is to determine which specific “skill” to call based on a user’s query [00:04:16].

Example: In an e-commerce agent, a user query like “I want to make a return” or “Are there any discounts?” is funneled to the router [00:04:06]. The router then decides whether to call a skill related to customer service, discounts, or product suggestions [00:04:19].

A router might not always make the correct decision, but accuracy is vital as it dictates the subsequent path within the agent [00:04:40]. Agents can have multiple router calls as an application grows, where the agent repeatedly decides its next action [00:06:41].

Skills

Skills are the actual logical chains that perform the work requested by the user [00:03:12]. Once the router calls a specific skill, the agent executes a flow of actions to fulfill the user’s request [00:05:02]. These actions can involve tool calls, API calls, or LLM calls [00:05:09].

Example: If a user asks for “the best leggings to buy,” the router might call a “product search” skill [00:04:54]. This skill then executes a flow that might involve running a SQL query to collect product data and passing it to a data analyzer skill [00:07:03].

Memory

Memory is a crucial component that stores the context of previous interactions [00:03:14]. Since most agent interactions are multi-turn conversations, memory prevents the agent from forgetting what was previously said [00:05:24]. It maintains the agent’s state throughout the conversation [00:05:35]. This is a vital aspect of memory management and delegation in AI for effective agents.

Evaluating Agent Components

Every step within an AI agent’s flow presents an opportunity for error, necessitating comprehensive evaluation [00:07:52].

Evaluating the Router

Teams should focus on whether the router calls the correct skill with the right parameters [00:07:58]. For example, if a user asks for leggings but is routed to customer service, the router made an incorrect decision [00:08:07]. Ensuring the router correctly identifies the appropriate skill and passes necessary details (e.g., material type, cost range for product search) is key [00:08:47].

Evaluating Skills

Evaluating skills is complex due to their multi-component nature [00:09:38]. For a Retrieval-Augmented Generation (RAG) type skill, evaluations include [00:09:43]:

Skills can also utilize LLM as a judge evaluations or code-based evaluations [00:10:00].

Evaluating the Agent Path (Convergence)

One of the most challenging aspects to evaluate is the “path” the agent takes, also known as convergence [00:10:11]. Ideally, when the same skill is called multiple times, it should consistently take a similar, succinct number of steps to complete the task [00:10:20].

However, different models (e.g., OpenAI vs. Anthropic) can lead to wildly different numbers of steps for the same task [00:10:50]. The goal is to ensure reliability and consistency in the number of steps an agent takes to complete a task [00:11:04].

Evaluating Voice AI Agents

Voice applications are among the most complex applications ever built and require additional evaluation pieces [00:11:54]. Beyond text transcription, the audio chunk itself needs to be evaluated [00:12:06]. This includes assessing:

Evaluations should be defined for intent, speech quality, and speech-to-text accuracy of the audio chunks [00:12:53].

Comprehensive Evaluation Approach

To effectively debug and optimize AI agents, evaluations should be integrated throughout the application, not just at a single layer [00:14:40]. This means having evaluations at the router level, skill level, and throughout the entire flow [00:14:48]. This allows for pinpointing exactly where an issue occurred within the agent’s execution trace [00:14:52].