From: aidotengineer
AI agents and assistants are becoming increasingly prevalent, moving beyond text-based interactions to include multimodal capabilities like voice AI [00:01:48]. These advanced agents require careful evaluation once they are put into production to ensure they function correctly in the real world [00:00:53]. Understanding the fundamental components of an AI agent is crucial for effective evaluation and building and improving AI agents [00:02:55].
Core Components of an Agent
Regardless of the specific framework used (e.g., LangGraph, CrewAI, LlamaIndex Workflow), AI agents typically exhibit three common patterns or components [00:03:30]:
- Router [00:03:04]
- Skills [00:03:12]
- Memory [00:03:16]
Different architectures for AI agents might implement these components in various ways, but their underlying functions remain consistent [00:03:21].
Router
The router acts as the “boss” of the agent, deciding the next step an agent will take [00:03:07] [00:03:52]. Its primary goal is to determine which “skill” to call based on the user’s query [00:04:16].
For example, in an e-commerce agent, when a user types a question like “I want to make a return,” “give me an idea of what to go buy,” or “are there any discounts on this?”, the user query funnels into a router. The router then decides whether to call a customer service skill, a discount suggestion skill, or a product suggestion skill [00:04:06].
Evaluation of the Router
For routers, teams typically focus on whether the router called the right skill [00:07:56]. It’s also important to verify that the router not only calls the correct skill but also passes the right parameters into that skill [00:08:47].
Skills
Skills are the actual logical chains that perform the work an agent needs to do [00:03:12]. These can involve LLM calls, API calls, or other computational steps [00:05:09]. When a router directs a query to a specific skill, that skill executes a flow to fulfill the user’s request, such as performing a product search [00:04:58] [00:05:02].
Evaluation of Skills
Evaluating a skill can be complex due to its many internal components [00:09:39]. Key evaluation points for skills include:
- Relevance of pulled chunks: For RAG-type skills, assessing the relevance of information chunks retrieved [00:09:43].
- Correctness of the generated answer: Verifying that the skill produces the correct output [00:09:52].
- Convergence/Path taken: Evaluating the path the agent takes to complete a task. Ideally, the agent should converge, meaning it consistently takes a similar, succinct number of steps to achieve its goal [00:10:15]. Different models (e.g., OpenAI vs. Anthropic) can lead to wildly different numbers of steps for the same skill [00:10:50]. This aspect is considered one of the hardest to evaluate [00:11:14].
Memory
Memory is the component responsible for storing the agent’s context and past interactions [00:03:16]. This is critical for multi-turn conversations, as agents need to remember what was previously said to maintain a coherent dialogue [00:05:24]. It ensures the agent keeps the conversation in some semblance of state [00:05:38].
Traces: Visualizing Agent Components
Understanding the “traces” of an agent provides insight into its inner workings [00:05:58]. Traces allow engineers to see what happened under the scenes when an agent processes a query [00:06:12].
For instance, in a code-based agent asked “what trends do you see in my Trace latency?”, the trace reveals:
- Router Call 1: The router decides how to tackle the question [00:06:28].
- Skill Execution (Tool Call): The router makes a tool call to run a SQL query and collect application traces [00:06:55].
- Router Call 2: After the first skill, the process returns to the router to decide the next step [00:07:11].
- Skill Execution (Data Analyzer): The router calls a second skill, the data analyzer, which takes the collected traces and application data to analyze them [00:07:14]. Throughout this process, memory stores everything that is happening behind the scenes [00:07:30].
Multimodal Agents and Additional Evaluation
The “next frontier” for AI agents includes voice AI, which is already revolutionizing call centers [00:01:48]. Examples like the PriceLine Penny bot allow users to book entire vacations hands-free [00:02:15].
These complex applications require additional evaluation considerations beyond just text-based interactions [00:11:54]. For voice agents, evaluations must encompass:
- Audio chunks: Not just the generated transcript, but the actual audio itself needs evaluation [00:12:07].
- User sentiment: Assessing the user’s emotional state [00:12:30].
- Speech-to-text transcription accuracy: Checking the quality of the transcription [00:12:31].
- Tone consistency: Ensuring the agent’s tone remains consistent throughout the conversation [00:12:36].
Comprehensive Evaluation Strategy
It is crucial to have evaluations defined throughout the entire application trace, not just at one layer [00:14:40]. This allows for effective debugging, pinpointing whether an issue occurred at the router level, skill level, or elsewhere in the flow [00:14:52]. For example, a co-pilot might have evaluations at the top for overall response correctness, at the search router for picking the right router and passing correct arguments, and at the skill execution for task completion [00:14:05].