From: aidotengineer

Evaluating AI systems, particularly those that provide recommendations, presents unique challenges, especially within complex multi-agent architectures [00:19:09]. Determining the quality of a recommendation in a system with many interconnected engines and rounds of conversation requires robust mechanisms [00:19:13].

Evaluation Challenges and Approaches

The primary challenge lies in understanding how to definitively assess if a recommendation is good [00:19:03]. To address this, a closed-loop system incorporating human scoring, structured feedback, and iterative revision cycles is crucial [00:19:26].

Human Evaluation

At early stages of development, human evaluation is considered the most effective method for assessing AI recommendations [00:19:34], [00:19:38]. While LLM-based evaluations are useful, they often do not provide the specific insights needed to drive necessary improvements [00:19:42], [00:19:46], [00:19:50].

Internal Evaluation Tool: Eagle Eye

An internal human evaluation tool, named “Eagle Eye,” was developed to facilitate this process [00:19:55], [00:19:59]. This tool allows for detailed inspection of specific cases within the AI system, including:

Through “Eagle Eye,” evaluators can conduct studies on relevance, visibility, and clarity, assigning scores to inform decisions on areas needing improvement [00:20:22], [00:20:26], [00:20:29], [00:20:32].

Key Learnings on AI Evaluation

Identifying Issues: Hallucination Example

The evaluation tool helps identify issues like hallucination. For instance, an early case involved a “staff architect network security” agent requesting a workshop schedule from a “requirements retriever” agent, providing specific dates, which was an instance of the agent hallucinating an action outside its defined capabilities [00:22:05], [00:22:07], [00:22:10], [00:22:13], [00:22:17], [00:22:21], [00:22:24], [00:22:26]. Handling such cases is crucial for system improvement [00:22:31].

Effective evaluation is an ongoing process of experimentation, learning which patterns work best with the available data, and refining agent interactions and autonomy [00:24:16], [00:24:18], [00:24:21], [00:24:23], [00:24:25], [00:24:28], [00:24:39], [00:24:42], [00:24:46], [00:24:48].