Evaluating and optimizing AI agents and workflows

From: aidotengineer

Evaluating AI agents and assistants and workflows is crucial for achieving stability and reliability in complex AI applications [00:00:27]. While AI agent swarms theoretically could solve any knowledge work problem, their instability with increasing complexity is a significant limitation [00:00:43]. This instability stems from challenges such as imperfect observability, dynamic environments, comprehensive testing and optimization of AI coding agents, and difficulty in tracking consistent progress towards goals [00:00:57]. The general solution lies in robust evaluations [00:01:29].

Challenges in AI Agent Evaluation

Many approaches to evaluations tend to oversimplify the process [00:01:37]. Simply applying an evaluation stack without systematic use will not lead to significant improvement [00:01:50]. Effective evaluation must not only improve AI agents and workflows but also continuously develop and align the evaluation stacks themselves with business requirements [00:02:02].

Evaluating all aspects of agent behaviors and internal representations presents a complex landscape [00:02:25]. This includes:

Agent’s Representation of Reality: How the agent models, discusses, and grounds reality (i.e., what is true) [00:02:48].
Behavioral Aspects: Whether the agent infers the correct goals, makes progress towards them, and selects the right tools [00:03:04].

While ideally, all these aspects would be evaluated, starting with a consistent framework for some aspects can yield progress [00:03:18].

Setting Up Evaluators

A clear framework is essential for setting up evaluators [00:03:34]. For instance, in a hotel reservation agent scenario, evaluators might include:

Policy adherence to the specific hotel’s reservation rules [00:03:52].
Accuracy of agent outputs to the user [00:03:58].
Overall appropriate behavior [00:04:04].

It is critical to have good visibility for creating, maintaining, and systematically improving large stacks of evaluators over time [00:04:17].

The Stabilization Loop and Model Context Protocol (MCP)

The goal is to achieve a stabilization loop:

An agent attempts a task [00:04:35].
The task’s output is evaluated by an evaluation engine [00:04:40].
Feedback, in the form of a numeric score and an explanation of what went right or wrong, is sent back to the agent [00:04:42].
The agent uses this information to improve its own performance [00:04:57].

The Model Context Protocol (MCP) is a method for attaching agents to this evaluation engine [00:05:08]. MCP allows agents to access evaluators and “charges” (collections of evaluators) that can measure and provide feedback for improvement [00:06:06].

Practical Examples of MCP

Optimizing Text Output

A simple example involves optimizing text output, such as a marketing message [00:05:27]. By using an MCP server, an agent can:

List available evaluators or charges [00:06:02].
Select a relevant charge, like a “marketing message quality judge” [00:06:42].
Measure the initial message [00:07:21].
Figure out how to improve the score [00:07:28].
Suggest an improved version based on scores for persuasiveness, writing quality, and engagingness [00:07:54].

This process demonstrates how an agent can iteratively improve its output based on continuous evaluation [00:08:38].

Stabilizing an AI Hotel Reservation Agent

Consider a hotel reservation agent for “Shire Hotel” which should never recommend or mention “Akma Hotel” next door [00:09:47].

Without MCP: The agent might politely offer rooms at Shire Hotel but then unacceptably suggest Akma Hotel as “a great option” [00:10:34].
With MCP: When the MCP server is enabled, the agent calls the MCP server for evaluation [00:11:13]. During this process, evaluators like the “hotel sh booking policy evaluator” are invoked [00:11:46]. As a result, the agent stops mentioning the Akma Hotel, correcting its behavior [00:11:40].

This illustrates how an agent, through the MCP, can pick relevant evaluators to evaluate AI agent performance and reliability and self-correct its behavior, even without specific instructions to use certain evaluators [00:11:53].

Conclusion

To successfully develop and optimize AI agents and workflows:

Ensure your evaluation library or platform is powerful enough to support diverse evaluators, their lifecycle maintenance, and optimization [00:12:36]. This includes optimizing both the agent itself and the evaluators, as they can incur costs [00:12:53].
Begin by running MCP manually offline to understand its workings and gain transparency [00:13:05].
Finally, attach evaluators to agents via MCP [00:13:24]. This approach promises more controllability, transparency, and dynamically self-improving and self-correcting agents [00:13:30].

Tubegraph

Explorer

Table of Contents