From: aidotengineer
Eddie Seagull, CTO at Fractional AI, discusses the development of AI voice agents, focusing on the challenges encountered and solutions implemented [00:00:06]. Developing AI voice agents presents significant difficulties beyond typical Large Language Model (LLM) issues [00:00:17].
General Challenges with AI Voice Agents
Standard LLM challenges include:
- Difficulty with reliability [00:00:18]
- Hallucinations [00:00:21]
- Difficulty in evaluation [00:00:21]
- Lack of objective metrics for conversational systems [00:00:28]
- Latency, especially for fluid conversational user interfaces [00:00:37]
When dealing with audio and voice agents, these issues become “hard mode” [00:00:53]. Additional complexities include:
- Transcription challenges [00:01:00]
- Operating in a streaming environment instead of a batched back-and-forth interaction [00:01:03] These factors make development very difficult [00:01:10].
Case Study: Automating Consulting-Style Interviews
A recent application built by Fractional AI aims to automate “consulting-style” interviews [00:01:26]. This involves interviewing employees within large companies to gather information, distinct from job interviews [00:01:31].
The Problem with Human Interviews
- Expense: Sending consultants to interview numerous employees is very costly [00:01:56].
- Inefficiency: Requires significant time to schedule and conduct interviews, leading to high overhead [00:02:05].
- Human Touch: While forms might suffice in some cases, the human element is often crucial for:
- Improvisation and navigation [00:02:28]
- Building trust [00:02:31]
- Asking effective follow-up questions [00:02:32]
- Answer Quality: People answer differently when typing versus speaking freely, often providing more information when allowed to ramble [00:02:37].
Goal of the AI Interview Agent
The objective was to build an AI interview agent that could:
- Conduct interviews like a human [00:02:53].
- Feel more like a conversation than a form [00:02:59].
- Interview hundreds of people simultaneously, avoiding scheduling issues and high costs [00:03:03].
- Automatically transcribe conversations for data extraction and aggregation [00:03:12].
A demo showed the agent conducting a basic interview, illustrating the conversational flow and the agent’s ability to understand and respond to user input [00:03:24].
Initial Development Approach and Challenges
The initial approach involved a “naive” integration with the OpenAI real-time API [00:04:51].
- A large, monolithic prompt was used to explain the interview goal, provide questions, and guide navigation [00:05:03].
A key early requirement was to display a roadmap of questions, allowing users to see current progress and jump between questions [00:05:24]. This was not well-suited to the monolithic prompt because:
- It was difficult to know which question the LLM was currently asking [00:05:41].
- Coaxing the LLM to move between questions if the user clicked around was challenging [00:05:47].
Iterative Improvements and Agent-Based Solutions
One Question at a Time & Tool Use
To address the roadmap issue, the system was redesigned to:
- Send only one question at a time to the LLM [00:05:53].
- Introduce “tool use” by giving the LLM a tool to signal when it wanted to move to the next question [00:06:00].
- Deterministically feed the next question once the tool was called [00:06:08].
- Inject additional prompts when users skipped or revisited questions via the roadmap, informing the LLM of the user’s action [00:06:14]. This allowed the agent to respond appropriately, e.g., “Sure, let’s move on to XYZ” [00:06:45].
The “Rabbit Hole” Problem and Drift Detector Agent
A new problem emerged: LLMs tended to go down “rabbit holes,” chitchatting, asking too many follow-up questions, and being overly encouraging <a class=“yt=“yt-timestamp” data-t=“00:07:01”>[00:07:01]. This led to a reluctance to move on to the next question [00:07:22]. Forcing it too hard would eliminate the desired improvisation [00:07:25].
To counter this, a Drift Detector Agent was introduced [00:07:33]:
- A separate background agent listens to the conversation [00:07:35].
- It runs a side thread with a non-voice, text-based LLM call [00:07:39].
- Its task is to assess from the transcript if the conversation is on or off track [00:07:44].
- It determines if the current question has been answered and if it’s time to move on [00:07:58].
- If the agent strongly indicates it’s time to move, it can force the tool use, preventing further rabbit-holing [00:08:04].
Tuning Human-like Interviews
Achieving human-like interviews remains challenging [00:08:22]. Agents tend to:
- Follow up too little or too much [00:08:26].
- Rephrase questions in unhelpful ways [00:08:30].
- The “one question at a time” approach limits the LLM’s full understanding of the interview flow [00:08:38].
- The linear flow (only option is “next question”) restricts natural conversation [00:08:44].
To correct for this:
- Goals and Priorities: Added as a first-class concept to the interview plan [00:09:04]. The LLM is told the “why” behind each question, informing its rephrasing and follow-up questions [00:09:16]. For example, a high-priority goal might be “get a clear picture of this person’s regular activities,” and a medium priority “start to sus out where AI might be useful” [00:10:07].
- Next Question Agent: Another side agent was introduced to determine what question should be asked next, running on the transcript in the background [00:09:25]. This bot is “taught” to be a good interviewer and can guide the conversation flow [00:10:37].
Transcription Challenges
The system uses Whisper for transcription, provided by OpenAI’s real-time API [00:11:27]. While the core model understands sound (e.g., claps, coughs), Whisper converts everything to text [00:11:39]. This leads to issues:
- Silence can be misinterpreted as non-English speech [00:11:55].
- Background noise can result in garbled or nonsensical transcriptions [00:12:07].
- The OpenAI API does not offer granular control or “knobs” to tune Whisper’s behavior [00:12:30].
To manage this user experience (UX) issue:
- Transcript Hiding Agent: An additional agent was added to the UX layer [00:12:38]. This agent takes the full conversation context and decides whether a piece of the transcript should be hidden from the user due to a suspected transcription error [00:12:40].
- The incorrect transcript is still captured for internal use, but it’s hidden from the user, preventing embarrassing displays [00:12:54].
- The core model still understands what’s happening (e.g., “I didn’t really get that, could you rephrase?”), and hiding the transcript improves the user experience by reducing confusion [00:13:06].
Development Challenges and Evaluation
The development process became complex, involving many agents added as “Band-Aids” to fix issues discovered during testing [00:13:27]. This “Vibes driven” approach, while acceptable initially, led to:
- Increased complexity and numerous prompts to update [00:13:48].
- Uncertainty about which agent to update when an issue arose [00:13:54].
- Introduction of regressions (fixing one issue, breaking another) [00:13:58]. This is a common challenge in LLM-based development, but it’s particularly acute in the voice domain [00:14:13].
Evaluation (Evals)
To systematically measure performance, evals are crucial [00:14:22].
- A set of metrics was developed to measure various attributes [00:14:29].
- An automated test suite runs over conversations, asking an LLM acting as a “judge” to measure attributes like clarity, completeness, and professionalism [00:14:36]. Each metric is backed by a tuned prompt [00:14:52].
- This approach, while not perfectly objective, moves development from a purely “Vibes driven” style to a “metrics driven” one [00:15:01].
Lack of Ground Truth & Synthetic Conversations
For systems like this, there’s no perfect ground truth or historical data to objectively measure success [00:15:26]. To prevent shipping a system that annoys users or has edge cases [00:15:56]:
- Synthetic Conversations were introduced [00:16:13].
- LLMs are used to simulate various “users” (interviewees) and conduct fake interviews [00:16:17].
- Personas are created (e.g., “snarky teenager in charge of a Fortune 500 company”) and used as prompts for the LLM playing the interviewee [00:16:41]. A roster of different personality types and job functions is maintained [00:16:51].
- The interview agent interacts with these personas, and the same eval suite is run afterward [00:17:11]. This allows for measuring average metrics across a broad population of expected users [00:17:14], and aids automation, reducing the need for manual testing [00:16:32].
Key Takeaways for AI Agent Development
The development of AI voice agents, as demonstrated by this case study, highlights that simply calling an OpenAI API with voice capabilities is insufficient for a robust application [00:17:31].
Key lessons learned include:
- Prompt Engineering: While helpful for initial development, it’s not enough for robustness [00:17:40].
- Out-of-Band Checks with Separate Agents: Using separate agents operating in the text domain (not audio) to make decisions and guide the conversation is crucial for getting the system back on track [00:17:54].
- Tool Use: Highly powerful for constraining LLM behavior and instrumenting it to understand its actions, as the LLM must call specific tools to perform desired actions [00:18:10].
- Evals are Critical: Essential for measuring success and guiding development, even when there’s no objective source of truth [00:18:28]. They enable a more robust development process [00:18:41].