Evaluation metrics for AI conversational systems

From: aidotengineer

Building robust conversational AI voice agents presents numerous challenges beyond those typically encountered with large language models (LLMs) [00:00:11]. While LLMs alone struggle with hallucination, evaluation, and latency [00:00:17], voice agents add complexities like transcription in a streaming environment [00:00:51]. This makes evaluation particularly difficult, especially given the lack of objective metrics for conversational quality [00:00:28].

Case Study: Automating Consulting-Style Interviews

A practical application for voice AI agents involves automating consulting-style interviews within large companies [00:01:26]. Consultants traditionally conduct these in-depth qualitative research interviews to understand how work is done [00:01:50]. This process is expensive, time-consuming, and inefficient due to scheduling overhead [00:01:56].

While forms might suffice in some cases, the “human touch” is often necessary [00:02:24]. A human interviewer can improvise, build trust, ask relevant follow-up questions, and encourage interviewees to speak freely, which often yields more information than typed answers [00:02:26].

The goal was to build an AI interview agent that could conduct interviews like a human, feel like a conversation rather than a form, and enable interviewing hundreds of people simultaneously without scheduling issues or high costs [00:02:49]. A key benefit would be automatic transcription for later data extraction and aggregation [00:03:12].

Development Iterations and Challenges

Initially, a monolithic prompt with the OpenAI real-time API was used to conduct the interview [00:05:01]. However, this approach lacked control, making it difficult to know which question the LLM was asking or to steer the conversation [00:05:41].

Introducing Tool Use for Control

To address this, the system evolved to feed one question at a time into the prompt and introduce “tool use” [00:05:53]. The LLM was given a tool to explicitly “move on to the next question” [00:06:03]. Additional prompts were injected if the user navigated questions manually, allowing the agent to acknowledge shifts (e.g., “let’s move on to XYZ”) [00:06:15].

A new problem emerged: LLMs’ tendency to “chitchat,” ask excessive follow-up questions, and go down “rabbit holes,” making them reluctant to move to the next question [00:07:01]. Forcing progression too rigidly eliminated the LLM’s ability to improvise [00:07:25].

Drift Detector Agent

To combat “rabbit-holing,” a separate “drift detector agent” was introduced [00:07:32]. This background agent, running a non-voice, text-based LLM, listened to the conversation transcript [00:07:39]. Its task was to decide if the conversation was off-track or if the current question had been sufficiently answered, allowing it to force the main LLM to use the “next question” tool and move on [00:07:49].

Goals, Priorities, and the Next Question Agent

Despite these improvements, achieving human-like interview flow remained difficult [00:08:22]. Agents either followed up too little or too much, rephrased questions poorly, and the linear “next question” flow felt restrictive [00:08:26].

To enhance naturalness and guide the LLM, “goals and priorities” were added as a first-class concept [00:09:04]. Instead of just the question, the LLM was informed of the “why” behind it, guiding rephrasing and follow-up questions (e.g., a high-priority goal to understand daily activities, a medium-priority goal to identify AI use cases) [00:09:16].

A further “next question agent” was added, also running in the background on the transcript [00:10:29]. This bot was “taught how to be a good interviewer” and could guide the conversation flow by determining what to ask next [00:10:38].

Transcription Quality and User Experience

Transcription, a core component, posed its own challenges [00:10:54]. While the OpenAI real-time API provides transcription via Whisper, Whisper operates in the text domain and can misinterpret non-speech sounds (e.g., silence, background noise) into erroneous text [00:11:27].

To mitigate poor user experience from garbled transcripts, a separate agent was implemented at the UX level [00:12:37]. This agent, with full conversation context, determined whether to hide a piece of the transcript from the user if it was likely a transcription error [00:12:40]. The core model still received the raw input, allowing it to correctly identify when it didn’t understand and re-ask the question, which improved the user’s perception [00:13:06].

The Critical Role of Systematic Evaluation (Evals)

The iterative process of adding multiple “Band-Aid” agents led to a complex system with many prompts [00:13:27]. This “Vibes-driven” development made it hard to identify the root cause of issues, determine if fixes worked, and prevent regressions [00:13:43].

The solution is evals—a systematic way to measure system performance [00:14:20]. For conversational AI, this involves defining a set of desired attributes and creating an automated test suite [00:14:29].

Metrics were established, with an LLM acting as a “judge” to score conversations on attributes like:

Clarity [00:14:46]
Completeness [00:14:48]
Professionalism of the agent [00:14:49]

Each metric is backed by a tuned prompt [00:14:54]. While not perfectly objective (lacking a perfect ground truth) and open to prompt improvements, this method significantly moves development from a purely “Vibes-driven” style to a metrics-driven approach [00:15:01].

Overcoming the Lack of Ground Truth with Synthetic Conversations

Unlike systems with objective historical data or clear ground truth, conversational AI lacks straightforward objective metrics [00:15:26]. To avoid discovering issues only after deployment, synthetic conversations were introduced [00:16:13].

This involves using LLMs to act as “fake users” or interviewees [00:16:17]. Personas are created and used as prompts to guide the LLM’s responses (e.g., “snarky teenager,” various job functions, personalities) [00:16:41]. A roster of these personas allows for running many automated interviews against them [00:17:01].

After these synthetic conversations, the same eval suite can be run, providing average metrics across a broad population of expected user types [00:17:11]. This method not only helps measure anticipated performance but also automates much of the tiring manual testing process [00:17:30].

In conclusion, while initial development can leverage basic API calls and prompt engineering, building robust voice applications requires additional strategies:

Out-of-band checks: Employing separate text-domain agents for decision-making and course correction [00:17:55].
Tool use: A powerful method for constraining LLM behavior and instrumenting its actions [00:18:10].
Evals: Critical for measuring success and guiding development, even when objective ground truth is elusive [00:18:28].

Tubegraph

Explorer

Table of Contents