From: aidotengineer
Evaluating AI agents and assistants and workflows is crucial for achieving stability and reliability in complex AI applications [00:00:27]. While AI agent swarms theoretically could solve any knowledge work problem, their instability with increasing complexity is a significant limitation [00:00:43]. This instability stems from challenges such as imperfect observability, dynamic environments, comprehensive testing and optimization of AI coding agents, and difficulty in tracking consistent progress towards goals [00:00:57]. The general solution lies in robust evaluations [00:01:29].
Challenges in AI Agent Evaluation
Many approaches to evaluations tend to oversimplify the process [00:01:37]. Simply applying an evaluation stack without systematic use will not lead to significant improvement [00:01:50]. Effective evaluation must not only improve AI agents and workflows but also continuously develop and align the evaluation stacks themselves with business requirements [00:02:02].
Evaluating all aspects of agent behaviors and internal representations presents a complex landscape [00:02:25]. This includes:
- Agent’s Representation of Reality: How the agent models, discusses, and grounds reality (i.e., what is true) [00:02:48].
- Behavioral Aspects: Whether the agent infers the correct goals, makes progress towards them, and selects the right tools [00:03:04].
While ideally, all these aspects would be evaluated, starting with a consistent framework for some aspects can yield progress [00:03:18].
Setting Up Evaluators
A clear framework is essential for setting up evaluators [00:03:34]. For instance, in a hotel reservation agent scenario, evaluators might include:
- Policy adherence to the specific hotel’s reservation rules [00:03:52].
- Accuracy of agent outputs to the user [00:03:58].
- Overall appropriate behavior [00:04:04].
It is critical to have good visibility for creating, maintaining, and systematically improving large stacks of evaluators over time [00:04:17].
The Stabilization Loop and Model Context Protocol (MCP)
The goal is to achieve a stabilization loop:
- An agent attempts a task [00:04:35].
- The task’s output is evaluated by an evaluation engine [00:04:40].
- Feedback, in the form of a numeric score and an explanation of what went right or wrong, is sent back to the agent [00:04:42].
- The agent uses this information to improve its own performance [00:04:57].
The Model Context Protocol (MCP) is a method for attaching agents to this evaluation engine [00:05:08]. MCP allows agents to access evaluators and “charges” (collections of evaluators) that can measure and provide feedback for improvement [00:06:06].
Practical Examples of MCP
Optimizing Text Output
A simple example involves optimizing text output, such as a marketing message [00:05:27]. By using an MCP server, an agent can:
- List available evaluators or charges [00:06:02].
- Select a relevant charge, like a “marketing message quality judge” [00:06:42].
- Measure the initial message [00:07:21].
- Figure out how to improve the score [00:07:28].
- Suggest an improved version based on scores for persuasiveness, writing quality, and engagingness [00:07:54].
This process demonstrates how an agent can iteratively improve its output based on continuous evaluation [00:08:38].
Stabilizing an AI Hotel Reservation Agent
Consider a hotel reservation agent for “Shire Hotel” which should never recommend or mention “Akma Hotel” next door [00:09:47].
- Without MCP: The agent might politely offer rooms at Shire Hotel but then unacceptably suggest Akma Hotel as “a great option” [00:10:34].
- With MCP: When the MCP server is enabled, the agent calls the MCP server for evaluation [00:11:13]. During this process, evaluators like the “hotel sh booking policy evaluator” are invoked [00:11:46]. As a result, the agent stops mentioning the Akma Hotel, correcting its behavior [00:11:40].
This illustrates how an agent, through the MCP, can pick relevant evaluators to evaluate AI agent performance and reliability and self-correct its behavior, even without specific instructions to use certain evaluators [00:11:53].
Conclusion
To successfully develop and optimize AI agents and workflows:
- Ensure your evaluation library or platform is powerful enough to support diverse evaluators, their lifecycle maintenance, and optimization [00:12:36]. This includes optimizing both the agent itself and the evaluators, as they can incur costs [00:12:53].
- Begin by running MCP manually offline to understand its workings and gain transparency [00:13:05].
- Finally, attach evaluators to agents via MCP [00:13:24]. This approach promises more controllability, transparency, and dynamically self-improving and self-correcting agents [00:13:30].