From: aidotengineer

The primary challenge with agent swarms is their inherent instability when attempting to solve increasingly complex knowledge work problems, despite their potential to solve nearly any such problem if stable [00:00:36]. This instability stems from difficulties in perfectly observing them, the dynamic environments they operate in, and challenges in comprehensive pre-deployment testing and optimization of AI coding agents [00:00:57]. It is often unclear whether agents consistently progress towards their goals, especially when they cannot achieve them in a single step [00:01:12].

The Role of Evaluations (Evals)

The general solution to stabilizing AI agents and workflows is through systematic evaluations, or “evals” [00:01:27]. Simply adopting an eval stack without a systematic approach is unlikely to yield significant results [00:01:45]. Effective use of evaluations not only facilitates improvement of agents and workflows but also ensures the continuous development and alignment of the evaluation stacks themselves with business requirements [00:02:02].

Evaluating all aspects of agent behaviors and internal representations presents a complex landscape [00:02:25]. This includes:

  • Representational Aspects: How the agent models reality, discusses it with the user, and its grounding to reality (truthfulness) [00:02:48].
  • Behavioral Aspects: Whether the agent infers the correct goals, makes progress towards them, and selects the appropriate tools [00:03:04].

To begin, a clear framework for setting up evaluators is necessary [00:03:31]. For instance, a hotel reservation agent might require evaluators for policy adherence, accuracy of outputs, and overall appropriate behavior [00:03:43]. It’s crucial to have visibility into creating and maintaining large stacks of evaluators, systematically improving them over time [00:04:17].

The Stabilization Loop with Model Context Protocol (MCP)

The desired outcome is a stabilization loop where an agent attempts a task, its output is evaluated by an evaluation engine, and feedback (numeric score and explanation) is returned to the agent [00:04:30]. This feedback enables the agent to improve its performance [00:04:57].

The Model Context Protocol (MCP) is the latest method for attaching agents to these evaluation engines [00:05:04].

Practical Examples of Stabilization

1. Text Optimization Example

Using a simple text example, like “routing MCP service awesome let you access nicely” [00:05:35], one can use an MCP server (e.g., Fruit Signals MCP server [00:13:43]) via a UI like Cursor [00:05:57]. This allows listing available evaluators or “charges” (collections of evaluators) [00:06:06]. By selecting a relevant charge, such as a “marketing message quality judge,” the system can score the initial message, identify issues, suggest improvements, and then re-score the improved version [00:06:42]. This process of scoring and re-scoring can happen automatically within an agent [00:08:38].

2. Hotel Reservation Agent Example

Consider a hotel reservation agent (e.g., built with Pyantic AI [00:09:22]) that is explicitly instructed not to recommend competitor hotels [00:09:50].

  • Without MCP: If a user subtly hints at interest in a competitor, the agent might inadvertently suggest the competitor, like the “Akma hotel” [00:10:07].
  • With MCP: By enabling the MCP server for evaluation [00:10:51], the agent invokes evaluators, such as a “hotel policy adherence evaluator” [00:11:46]. This allows the agent to self-correct and avoid mentioning the competitor, even if not specifically instructed to invoke that evaluator [00:11:40]. This demonstrates how an agent can receive feedback and improve its own behavior [00:12:17], showing the potential for scaling AI agents in production to more complex scenarios [00:12:22].

Approach for Implementing Stabilization

To effectively implement this stabilization, follow these steps:

  1. Ensure a Powerful Evaluation Platform: Your evaluation library or platform must be robust enough to support a diversity of evaluators, manage their lifecycle, and facilitate their maintenance and optimization [00:12:36]. This includes optimizing both the agent and the evaluators themselves, as running many evaluators can be costly [00:12:55].
  2. Start Manually/Offline: Begin by running the MCP manually and offline, as demonstrated with the marketing message example [00:13:05]. This helps in understanding its behavior and gaining transparency into how it works (or doesn’t) [00:13:11].
  3. Attach via MCP: Finally, integrate evaluators to the agents through the MCP [00:13:24].

This approach promises to make agent behavior more controllable, transparent, dynamically self-improving, and self-correcting [00:13:30]. The Root Signals MCP server is available for free, and more such servers are expected to emerge, enabling various evaluation platforms and frameworks to be integrated into AI agent stacks [00:13:43].