From: aidotengineer
Developing AI voice agents presents unique challenges beyond those found in traditional Large Language Model (LLM) applications, particularly concerning user interfaces and the accurate handling of speech-to-text transcription. While LLMs themselves can hallucinate, are difficult to evaluate, and face latency issues, voice agents magnify these problems by operating in a streaming audio environment and requiring robust transcription capabilities [00:00:17].
Case Study: Automating Consulting-Style Interviews
To illustrate these challenges and their solutions, Fractional AI developed an AI interview agent designed to automate the process of conducting qualitative interviews within large organizations [00:01:15]. Consultants typically perform these costly and inefficient interviews to gather information about job functions or internal processes [00:01:47]. The goal was to create an AI system that could conduct these interviews naturally, feeling more like a conversation than a form, while being able to interview hundreds of people simultaneously and provide automatic transcriptions [00:02:49].
User Interface (UI) Challenges and Solutions
Initially, a naive approach using OpenAI’s real-time API with a monolithic prompt was attempted [00:05:01]. This prompt would include all interview questions and instructions on how to navigate them [00:05:10].
Enabling Navigation and Control
A key desired UI feature was a roadmap displaying interview questions, indicating the current question, and allowing users to jump between them [00:05:24]. The monolithic prompt approach proved unsuitable because:
- There was no way for the system to know which question the LLM was currently asking [00:05:42].
- It was difficult to make the LLM move to a different question if the user clicked around [00:05:47].
The solution involved a more modular approach:
- One Question at a Time: Only one question is sent to the LLM at a time [00:05:53].
- Tool Use: The LLM is given access to a tool to explicitly request moving to the next question. When this tool is called, the next question is fed deterministically [00:06:00].
- Contextual Prompts: Additional prompts are injected into the LLM’s stream when a user clicks around the roadmap, informing the agent that the user has navigated to a new or revisited question [00:06:15]. This allows the agent to respond appropriately, such as saying, “sure, let’s move on to XYZ, we can come back to that later” [00:06:45].
Managing Conversation Flow with Side Agents
Even with tool use, LLMs tended to “chitchat” or “dig down rabbit holes,” being reluctant to move to the next question [00:07:01]. Forcing the LLM too hard would eliminate its ability to improvise, which was undesirable [00:07:25].
To address this, a “Drift Detector Agent” was introduced [00:07:33]. This is a separate, non-voice, text-based LLM running in a side thread that listens to the conversation’s transcript [00:07:35]. Its task is to decide if the conversation is off-track, if the current question has been answered, and if it’s time to move on [00:07:51]. If the Drift Detector strongly indicates it’s time to move, it can force the main LLM to use its “next question” tool, preventing “rabbit holing” [00:08:04].
To further refine the human-like interview flow:
- Goals and Priorities: Instead of just the question, the LLM is given the why behind each question as “goals” (e.g., “high priority goal of getting a clear picture of responsibilities”) [00:09:04]. This guides rephrasing and follow-up questions [00:09:16].
- Next Question Agent: Another side agent, the “Next Question Agent,” runs in the background on the transcript [00:09:27]. It is “taught how to be a good interviewer” and decides what question should be asked next, guiding the conversation [00:10:38].
Addressing Transcription Errors
The interview system displays a live transcript of the conversation [00:11:00]. While OpenAI’s API provides a transcript of user input via its Whisper model, Whisper solely converts audio to text [00:11:26]. Unlike the core model, which operates in the sound domain and understands non-speech sounds, Whisper can produce surprising or erroneous transcriptions from silence or background noise [00:11:45]. Examples include silence being transcribed as a language change [00:11:55] or background noise as nonsensical words like “creeping Dippity Dippity Dippity Dippity” [00:12:07]. There are no API knobs to directly tune Whisper’s behavior for this [00:12:32].
UI-Level Solution for Transcription Errors
To improve the user experience and avoid displaying embarrassing transcription errors, an entire separate agent was added to the system [00:12:37]. This agent’s sole task is to take the entire conversation context and determine if a specific piece of the transcript should be hidden from the user due to a likely transcription error [00:12:40]. While the incorrect transcript is still captured internally, hiding it prevents a poor user experience [00:12:54]. Crucially, the core model still processes the original audio, so if a user’s input was unintelligible, the core model’s response (e.g., “I didn’t really get that, could you rephrase?”) still aligns with the user’s perception, even if the erroneous text isn’t displayed [00:13:06].
Development Challenges and Evaluation
The iterative process of adding multiple “Band-Aid” agents to fix issues creates complexity, making it difficult to know which prompt to update or if changes introduce regressions [00:13:28]. This is a common problem in LLM-based development [00:14:13].
To address this, systematic evaluation (“evals”) is critical [00:14:20]:
- Automated Test Suite: A set of metrics is used to measure attributes like clarity, completeness, and professionalism of the conversation [00:14:29]. These metrics are evaluated by an LLM acting as a judge against tuned prompts [00:14:36]. While not perfectly objective due to the lack of ground truth, this provides a more metrics-driven approach to iteration [00:15:01].
- Synthetic Conversations: To test robustness across a broad population and automate testing, LLMs are used to simulate users with specific personas (e.g., a “snarky teenager” or various job functions and personalities) [00:16:15]. The AI agent interviews these synthetic users, and the same evaluation suite can then be run over the generated conversations to get average metrics [00:17:11].
Conclusion
Building robust AI voice applications goes beyond simply calling an API and performing prompt engineering [00:17:31]. Key additions for a robust application include:
- Out-of-Band Checks: Using separate, text-domain agents to make decisions about the conversation’s progress and keep it on track [00:17:55].
- Tool Use: Empowering the LLM with specific tools can constrain its behavior and allow for better instrumentation and understanding of its actions [00:18:10].
- Evaluations: Evals are critical for measuring success and guiding development, even in domains without objective ground truth [00:18:28].