From: aidotengineer

Agent evaluation is a combination of art and science, essential to ensure that AI agents perform as expected, especially when deployed in production environments [00:00:12].

The Agent Evaluation Map

The agent evaluation process can be divided neatly into two main categories: semantic and behavioral parts [00:00:41].

Semantic Evaluation

Semantic evaluation focuses on how an agent’s representations of reality relate to actual reality [00:00:50]. Truthfulness in semantic evaluation is ultimately achieved by grounding representations in data, often with Retrieval Augmented Generation (RAG) [00:02:06].

It is further divided into:

  • Single-Turn Items: These involve universal virtues, which are non-agenic but covered for completeness to understand agent-specific parts through contrast [00:03:10]. Examples include:
    • Coherence & Consistency: Whether the agent’s reply is consistent [00:03:32].
    • Safety: If the content provided is safe [00:03:40].
    • Alignment & Policies: Whether the agent’s statements align with organizational values or stakeholder policies [00:03:47].
    • RAG/Attention Management: This requires specific evaluators to measure aspects like:
      • Correctness of the retrieved context [00:04:11].
      • Comprehensive recall of the context [00:04:20].
      • Faithfulness: Relating answers to external reality, which is distinct from answer-question relevance and general factfulness [00:04:22]. RAG evaluations can take many forms and exhibit symmetries, essentially looking at relationships between parts of the RAG pipeline [00:04:51].
  • Multi-Turn Aspects: These relate to conversational histories and reasoning [00:01:40].
    • Chat Conversation Histories: Looking for consistency and adherence to topics (or allowing changes when appropriate) [00:05:24].
    • Reasoning: Evaluating reasoning traces, such as “Chain of Thought,” which represents sequential activities and world representations before any actions are taken [00:05:54].

Behavioral Evaluation

The behavioral part focuses on how an agent’s actions and tool usage contribute to achieving its goals within its environment, and the effects it has on that environment [00:01:05]. Goal achievement is realized by grounding the agent’s activities in the tools it has available [00:02:21].

This is also distinguished by:

  • Individual Tool Selection & Usage: Evaluating aspects even before a chain of behaviors [00:01:53]:
  • Task Progression & Planning (Multi-Step): Evaluating the overall sequence of actions [00:01:51]:
    • Whether actions converge towards achieving the agent’s goal [00:07:04].
    • Consistency and quality of the agent’s plan [00:07:22].

Symmetry in Evaluation

There is an intentional symmetry between semantic and behavioral evaluations, stemming from the analogy between representations and behaviors [00:02:29]. Representing the world is considered a type of activity, meaning representations can be seen as a special case of behaviors or tools [00:02:43]. Both semantic (truthfulness) and behavioral (goal achievement/utility) aspects are grounded in external reality, with goal achievement being the ultimate metric [00:07:33]. Other metrics are often considered proxies [00:08:02].

Other Practical Considerations

Beyond the main evaluation map, several practical aspects are crucial for agent evaluations [00:08:07]:

  • Cost and Latency Optimization: Agents should progress towards their goals as quickly and cheaply as possible, including optimizing the number of steps [00:08:24].
  • Tracing and Debugging: The ability to identify where an agent went wrong [00:08:42].
  • Error Management: Specifically dealing with errors in tool usage, distinct from semantic errors during inference [00:08:49].
  • Offline vs. Online Testing: A critical distinction between evaluations during development and those performed during live agent activity [00:09:06]. These are distinct dimensions that could further complicate the evaluation map [00:09:25].
  • Special Cases & Tool-Specific Metrics: More refined and advanced evaluation methods, potentially research-oriented [00:09:41]. Tool-specific metrics, often simple to implement for API calls, can be added [00:09:59].

EvalOps: Optimizing the Evaluator LLM

Many measurements in AI evaluation are implemented using “LLM as a judge” techniques [00:10:26]. A common pitfall is the “single-tier” approach, where optimization focuses only on the agent’s operational flow, overlooking the cost, latency, and uncertainty of the “judge” LLM itself [00:10:40].

A “double-tier” optimization is necessary, optimizing both the operative LLM flow powering the agent and the “chargement” flow powering the evaluations [00:11:10]. This complex situation is termed EvalOps [00:11:31]. EvalOps represents a distinct category of activity because evaluations can be so complicated, expensive, and slow [00:11:43]. It is a special case of LLM Ops, operating on different entities, requiring different thinking, software implementations, and resourcing [00:12:08].

The goal of these evaluations is to make agents measurable and controllable, ensuring they adhere to their intended purposes [00:13:15].