From: aidotengineer
Building AI voice agents, particularly for conversational user interfaces (UIs), presents significant challenges beyond those typically encountered with large language models (LLMs) alone [00:00:17]. LLMs are inherently difficult to manage, prone to hallucination, and hard to evaluate with objective metrics, especially in conversational contexts [00:00:18]. When audio is involved, the complexity increases, requiring handling transcription and streaming environments [00:00:51].
This article details various techniques and patterns in AI orchestration and prompt engineering employed to overcome these hurdles in the case study of AI interview agent development for automated consulting-style interviews.
Case Study: Automating Consulting-Style Interviews
A specific application built is an AI interview agent designed to automate “consulting-style” interviews [00:01:26]. This differs from job interviews; it involves interviewing employees within large companies (e.g., Fortune 500) to gather information about how they perform their jobs [00:01:31].
Human-led consulting interviews are:
- Expensive and inefficient [00:01:56].
- Require significant time for consultants and complex scheduling [00:02:05].
While forms might seem like an alternative, the human touch is crucial for these types of interviews [00:02:24]. An interviewer needs to:
- Improvise and build trust [00:02:30].
- Ask relevant follow-up questions [00:02:32].
- Encourage interviewees to speak freely and ramble, often yielding more information than typed answers [00:02:37].
The goal was to build an AI interview agent that could:
- Conduct interviews like a human [00:02:53].
- Feel like a conversation, not a form [00:02:59].
- Interview hundreds of people concurrently [00:03:03].
- Provide automatic transcription for data extraction and aggregation [00:03:12].
Initial Approach and Its Limitations
The initial development leveraged the OpenAI real-time API, using a monolithic prompt that outlined the interview’s purpose, details, and all questions [00:05:01].
However, this monolithic approach quickly revealed a limitation: the inability to display a dynamic “roadmap” of questions [00:05:23]. It was impossible to know which question the LLM was currently addressing or to steer it to a different question if a user wanted to jump around [00:05:42].
Key Techniques and Solutions
To address the challenges and enhance the AI’s conversational ability and control, several techniques were implemented:
1. Structured Questioning with Tool Use
Instead of a single large prompt, the system was redesigned to:
- Feed one question at a time to the LLM [00:05:53].
- Introduce tool use, allowing the LLM to signal when it wanted to move to the next question [00:06:00]. This gave programmatic control over the conversation flow [00:06:05].
- Inject additional prompts when users clicked around the roadmap, informing the LLM about the navigation change [00:06:15]. This enabled the agent to acknowledge the user’s action, e.g., “Sure, let’s move on to XYZ, we can come back to that later” [00:06:41].
2. Drift Detector Agent for Contextual Steering
A challenge arose where the LLM, despite tool use, tended to “chitchat” and delve into “rabbit holes,” asking too many follow-up questions and being reluctant to move on [00:07:01]. Forcing it too hard would eliminate its ability to improvise [00:07:25].
The solution was to introduce a background “Drift Detector Agent” [00:07:33]:
- It runs a separate side thread, listening to the conversation’s transcript [00:07:35].
- It uses a separate, non-voice text-based LLM call [00:07:41].
- Its task is to decide if the conversation is “off track” or if the current question has been sufficiently answered, prompting a move to the next question [00:07:44].
- When the Drift Detector indicates a strong need to move on, it can force the tool use (e.g., “move to next question”), preventing further deviation [00:08:04].
3. Goals, Priorities, and Next Question Agent
Achieving truly human-like interviews proved difficult; the AI would either follow up too little or too much, or rephrase questions in unhelpful ways [00:08:22]. The linear flow (only “dig in” or “move next”) was restrictive [00:08:45].
To enhance natural flow and guidance:
- Goals and priorities were introduced as first-class concepts for each question [00:09:04]. Instead of just the question itself, the LLM is given the why behind the question [00:09:16]. For example, for “What are your main responsibilities?”, the goal might be “Get a clear picture of this person’s regular activities” (high priority) and “Start to sus out where AI might be useful” (medium priority) [00:10:07]. This guides rephrasing and follow-up questions [00:10:25].
- Another side agent, the “Next Question Agent,” was added [00:09:27]. This agent, running on the transcript in the background, is “taught” how to be a good interviewer and decides what should be asked next, guiding the conversation flow [00:10:29].
4. Transcript Hiding Agent for User Experience
The transcription of user input, handled by OpenAI’s Whisper model (a side model to the core LLM), presented challenges [00:11:26]. Whisper converts everything to text and doesn’t operate in the sound domain like the core model [00:11:47]. This led to surprising and embarrassing transcription errors for silence or background noise [00:11:51].
To improve user experience (UX) without affecting the core model’s understanding:
- An entirely separate agent was added [00:12:37].
- This agent reviews the entire conversation context and decides whether to hide a piece of the transcript from the user if a transcription error is suspected [00:12:40].
- The raw transcript is still captured for backend use, but users are spared the visual display of errors [00:12:54].
- Crucially, the core model still receives and understands the actual audio, so it can respond appropriately (e.g., “I didn’t really get that, can you rephrase?”) even if the transcript is hidden from the user [00:13:06].
Overarching Development Challenges
The iterative process of adding multiple “Band-Aid” agents led to significant complexity:
- Numerous prompts to update [00:13:49].
- Difficulty in identifying which agent to update when an issue arose [00:13:54].
- Risk of introducing regressions (fixing one issue but worsening another) [00:13:58].
This highlights the challenges in AI-driven architecture design and the need for systematic evaluation.
Improving AI Evaluation Methods
To mitigate these challenges and guide development effectively, improving AI evaluation methods became critical [00:14:19]:
Automated Test Suite with LLM as Judge
- A set of metrics was defined to measure desired attributes like clarity, completeness, and professionalism [00:14:29].
- An automated test suite was developed, where another LLM acts as a “judge” to evaluate these metrics over a conversation [00:14:36]. Each metric is backed by a finely tuned prompt for evaluation [00:14:52].
- While not perfectly objective due to the lack of ground truth in conversational AI, this method provides a more metrics-driven approach to iteration, moving beyond purely “vibes-driven” development [00:15:01].
Synthetic Conversations for Comprehensive Testing
Since objective ground truth is often absent, a key technique was the introduction of use of synthetic conversations in AI testing [00:16:13].
- LLMs are used to simulate “fake users” or interviewees [00:16:17].
- A “Persona” is created as a prompt to the LLM, defining the interviewee’s characteristics (e.g., “snarky teenager” [00:16:43]). In practice, a roster of diverse personas based on expected interviewees (different personalities, job functions) is used [00:16:52].
- The AI agent then interviews these synthetic personas [00:17:03].
- The same evaluation suite is run over these synthetic conversations to obtain average metrics across a broad population of user types [00:17:11]. This automates testing and helps detect edge cases before deployment [00:16:30].
Conclusion
Building robust AI voice agents, especially for complex conversational tasks, goes beyond simple prompt engineering with foundational APIs [00:17:31]. Key takeaways for design process improvements with AI include:
- Orchestration with Out-of-Band Checks: Employing separate agents operating in the text domain to make decisions and guide the core AI back on track [00:17:53].
- Strategic Tool Use: Leveraging tools within the LLM’s capabilities to constrain behavior and provide instrumented insights into its actions [00:18:10].
- Critical Role of Evals: Implementing systematic evaluations, even without perfect objective ground truth, is essential for measuring success and guiding development [00:18:28]. This includes using LLMs as judges and generating synthetic conversations for broad testing [00:18:37].
By combining these techniques, developers can move beyond “vibes-driven” development to create more robust and effective AI-driven conversational systems.