Evaluating AI agents and assistants

From: aidotengineer

Evaluating AI agents and assistants is a critical topic for ensuring their effectiveness in real-world applications [00:00:51]. While much attention is given to building agents and the tools available, it is equally important to evaluate them once they are in production [00:00:53]. This process helps confirm that agents function as intended and provide reliable performance [00:01:00].

Types of AI Agents and Evaluation Considerations

Initially, many discussions revolved around text-based agents, such as chatbots [00:01:38]. However, the field is rapidly advancing to include voice AI [00:01:48] and multimodal agents [00:02:29]. Voice AI is already transforming call centers, handling billions of calls globally [00:01:52]. An example is the Price Line Pennybot, a real production application allowing users to book an entire vacation hands-free using voice [00:02:13].

The method of evaluating these agents changes based on their modality:

Text-based agents require specific evaluation techniques [00:02:34].
Voice agents necessitate additional evaluations specific to voice [00:02:39].
Multimodal agents require considering further types of evaluations [00:02:43].

Components of AI Agents

Regardless of the framework used (e.g., Lan graph, Crei, Llama index workflow) [00:03:25], common patterns define the components of an agent [00:03:38]:

Router: This component acts as the “boss,” deciding the next step an agent will take [00:03:03]. For instance, in an e-commerce agent, a router funnels a user query (e.g., “I want to make a return,” “Are there any discounts?”) to the appropriate skill [00:04:06].
Skills: These are the logical chains that perform the actual work [00:03:12]. A skill flow of execution might involve LLM calls or API calls to complete a user’s request [00:05:02].
Memory: This component stores past interactions and context, enabling multi-turn conversations and ensuring the agent remembers previous statements [00:03:16].

Example: An Agent’s Inner Workings (Traces)

To understand how these components interact, one can examine agent “traces,” which show the inner workings of an agent’s execution [00:05:58]. For example, in a code-based agent asked about “trace latency trends” [00:06:16]:

The router first determines how to tackle the question [00:06:30].
It might make a tool call (skill) to run a SQL query and collect application traces [00:06:55].
The router might then call a second skill, like a data analyzer, to process the collected data [00:07:11].
Throughout this process, memory stores everything that occurs [00:07:30].

How to Evaluate AI Agents

Every step within an agent’s flow is a potential point of failure, necessitating distinct evaluation strategies [00:07:52].

Evaluating the Router

For routers, the primary concern is whether it called the correct skill [00:07:57]. If a user asks for “leggings” but is routed to “customer service” or “discounts,” the router failed [00:08:07]. Evaluation should confirm the router correctly calls the right skill with the appropriate parameters, ensuring the agent passes the correct information (e.g., material type, cost range for a product search) [00:08:44].

Evaluating Skills

Skill evaluation is complex due to the varied components within a skill [00:09:39]. For a Retrieval-Augmented Generation (RAG) skill, evaluations might include:

Relevance of pulled chunks: Assessing if the retrieved information is pertinent [00:09:46].
Correctness of the generated answer: Verifying the accuracy of the agent’s output [00:09:52].
LLM-as-a-judge evaluations: Using one LLM to evaluate the output of another [00:10:00].
Code-based evaluations: Running specific code to assess skill performance [00:10:03].

Evaluating the Agent’s Path (Convergence)

One of the most challenging aspects to evaluate is the path an agent takes to complete a task [00:10:14]. Ideally, an agent should consistently converge, taking a reliable number of steps (e.g., five or six) to query, use parameters, and execute skills to generate the correct answer [00:10:20]. Different LLM providers (e.g., OpenAI, Anthropic) can lead to vastly different numbers of steps for the same skill [00:10:50]. The goal is to ensure succinctness and reliability in the number of steps an agent takes to consistently complete a task [00:11:04].

Evaluating Voice AI Applications

Voice applications are among the most complex applications ever deployed [00:11:54]. Evaluating them requires additional considerations beyond text or transcript evaluation [00:11:59]. The audio chunk itself needs to be evaluated [00:12:10]. This includes:

User sentiment [00:12:30].
Speech-to-text transcription accuracy [00:12:31].
Consistency of tone throughout the conversation [00:12:36].
Specific evaluations for intent, speech quality, and speech-to-text accuracy of audio chunks [00:12:50].

Example: Evaluating a Co-pilot Agent

One approach to implementing evaluation platforms for AI agents is to dogfood one’s own tools. For instance, the speaker’s company, Arise, uses its own platform to evaluate its co-pilot [00:13:48]. This co-pilot assists users within the product by helping with debugging, summarizing, and natural language search [00:13:27].

For this co-pilot, evaluations are run at every single step of its traces in the wild [00:13:57]:

An overall evaluation assesses the correctness of the generated response (e.g., for a search query) [00:14:05].
Evaluations check if the search router picked the correct router and passed the right arguments [00:14:21].
Finally, evaluations ensure the task or skill was completed correctly during execution [00:14:32].

The key takeaway is that evaluations should not be limited to a single layer but should be present throughout the entire application flow [00:14:40]. This enables precise debugging to identify if issues originate at the router level, skill level, or elsewhere in the agent’s process [00:14:52].

Tubegraph

Explorer

Table of Contents