From: aidotengineer

Building AI voice agents presents significant challenges, as they inherit common issues with large language models (LLMs) like hallucination and difficulty in evaluation, plus additional complexities specific to audio processing like transcription and streaming environments [00:00:17]. This case study details the development of an AI interview agent designed to automate consulting-style interviews, highlighting the practical challenges encountered and the iterative solutions implemented.

Automating Consulting-Style Interviews

Traditionally, consultants engage in resource-intensive, qualitative interviews with employees within large companies to gather information about processes and jobs [01:26:00]. This method is expensive, inefficient, and fraught with scheduling overhead [01:56:00]. While forms might seem like an alternative, the “human touch” of an interviewer is often crucial for navigating improvisation, building trust, and asking nuanced follow-up questions [02:24:00]. People also tend to provide richer, more detailed answers when speaking freely rather than typing responses [02:44:00].

The goal was to develop an AI interview agent that could replicate this human-like interaction [02:49:00]. This system would feel more like a natural conversation than a form, but with the ability to interview hundreds of people simultaneously, eliminating scheduling conflicts and high costs [02:57:00]. A key byproduct would be automatic transcription for data extraction and aggregation [03:12:00].

Iterative Development and Challenges

The development process involved several iterations to overcome fundamental limitations of LLM-based voice agents:

Initial Approach: Monolithic Prompting

The first attempt involved integrating with OpenAI’s real-time API using a single, large prompt that described the interview’s purpose and included all questions [04:51:00].

Challenge: This “monolithic prompt” approach made it impossible to track which question the LLM was currently asking or to guide it to specific questions, which was crucial for a user interface featuring a roadmap allowing users to jump between questions [05:36:00].

Iteration 1: Structured Questions and Tool Use

To address navigation and tracking, the system was redesigned to feed the LLM one question at a time [05:53:00]. Tool use was introduced, allowing the LLM to signal when it was ready to move to the next question. Specific prompts were also injected when a user clicked to a new question, informing the LLM of the jump [06:00:00]. This allowed the agent to acknowledge skips and revisits naturally [06:41:00].

Challenge: Despite the new structure, LLMs tend to “chitchat,” ask excessive follow-up questions, and “dig down rabbit holes,” making them reluctant to move on [07:01:00]. Forcing it too hard would stifle its ability to improvise [07:25:00].

Iteration 2: The Drift Detector Agent

To manage the LLM’s tendency to stray, a separate background agent, called the “drift detector,” was introduced [07:33:00]. This agent, operating on a separate text-based LLM, listened to the conversation transcript and decided if the interview was going off-track or if the current question had been sufficiently answered [07:35:00]. If the drift detector strongly indicated it was time to move on, it could force the main LLM to use the “next question” tool [08:04:00].

Challenge: Even with the drift detector, tuning for human-like interviews remained difficult [08:22:00]. Agents either followed up too little or too much, and rephrased questions in unhelpful ways. The linear “next question” flow was restrictive, limiting natural conversation [08:30:00].

Iteration 3: Goals, Priorities, and the Next Question Agent

To enhance the interview quality and natural flow, two further additions were made:

  • Goals and Priorities: Instead of just providing the question, the LLM was given the “why” behind each question, including high and medium priority goals (e.g., “get a clear picture of this person’s regular activities” or “sus out where AI might be useful”) [09:04:00]. This guided the LLM’s rephrasing and follow-up questions [09:16:00].
  • Next Question Agent: Another side agent, the “next question agent,” was specifically tasked with determining what question should be asked next [09:27:00]. This bot, trained on principles of good interviewing, could actively guide the conversation path [10:31:00].

Addressing Transcription and User Experience

While OpenAI’s real-time API provides user transcripts, it relies on a separate “Whisper” model for transcription [11:05:00]. Unlike the core LLM, Whisper operates purely on text and can produce surprising or inaccurate results when encountering non-speech sounds or silence [11:27:00]. For example, silence might be transcribed as a non-English phrase, or background noise as random words [11:54:00].

Solution: An additional agent was introduced to improve user experience [12:37:00]. This agent takes the entire conversation context and decides whether to hide a piece of the transcript from the user if it suspects a transcription error [12:40:00]. While the incorrect transcript is still captured internally, hiding it prevents embarrassing errors from showing up in the UI [12:54:00]. Crucially, the core LLM still understands the audio input, allowing it to respond appropriately (e.g., “I didn’t really get that, what do you hope to accomplish?“) [13:06:00].

The Importance of Evaluation (Eval)

The iterative process of adding multiple side agents to “Band-Aid” over issues led to increased complexity [13:28:00]. With numerous prompts and agents, it became difficult to diagnose problems, implement fixes without introducing regressions, or measure overall performance [13:48:00].

Solution: LLM-as-Judge Evals and Synthetic Conversations To systematically measure performance, a set of metrics was developed and evaluated by an LLM acting as a judge [14:20:00]. These metrics, such as “clarity,” “completeness,” and “professionalism,” are derived from prompts that analyze conversation transcripts [14:36:00]. While not perfectly objective due to the lack of ground truth, this approach provides a metrics-driven style of iteration over purely “vibes-driven” development [15:01:00].

To further refine evaluation and prevent real-world user issues, “synthetic conversations” were introduced [16:11:00]. This involves using LLMs to create “fake users” with specific personas (e.g., “snarky teenager” or various job functions and personalities) who then interact with the AI interview agent [16:17:00]. The same evaluation suite can then be run over these synthetic conversations, allowing for automated testing and the generation of average metrics across a broad population of potential interviewees [17:00:00].

Key Takeaways

Building robust AI voice applications requires more than just basic API calls and prompt engineering [17:31:00]. Key strategies include:

  • Out-of-Band Checks: Employing separate agents operating in the text domain to make decisions and guide the main conversation [17:55:00]. This is an example of AI in workflow automation.
  • Tool Use for Constraint: Leveraging tools to constrain LLM behavior and instrument its actions, providing insights into its decision-making process [18:10:00].
  • Critical Evals: Systematically measuring success and guiding development using evaluation metrics, even in domains without objective ground truth [18:28:00].