Building a reliable conversation system for voice agents

From: aidotengineer

Nick, an engineer at Superdial, highlights the exciting yet challenging landscape of voice AI in 2025. While new smart, fast, and affordable Large Language Models (LLMs) support complex conversational use cases, and low-latency, realistic text-to-speech models are available, there are still significant hurdles to overcome to build reliable voice AI agents [00:00:51]. Key challenges include dealing with audio hallucinations from generative text-to-speech models, managing pronunciation and spelling, and the current immaturity of speech-to-speech models for production applications due to their unreliable output [00:01:09].

Superdial’s Approach: Agents as a Service

Superdial operates on an “Agents as a Service” model, focusing on the “Last Mile Problem” of ensuring reliability once an MVP (Minimum Viable Product) voice agent is built [00:02:04]. The company specializes in automating phone calls for mid to large-sized healthcare administration businesses, particularly the often-annoying calls to insurance companies [00:02:22].

Business Model and Agentic Contract

Superdial’s platform allows customers to design conversation scripts and send calls via CSV, API, or EHR software integrations [00:02:39]. The core of their “agentic contract” with customers is payment for results: customers specify who to call and what questions to ask, and Superdial provides the answers in a structured format [00:03:05].

The Internal Agentic Loop

Internally, Superdial employs an agentic loop to ensure call completion and learning [00:03:13]:

Call Attempts: The system waits for offices and call centers to open before attempting calls with the voice bot [00:03:17].
Human Fallback: If the voice bot cannot complete the call after a certain number of attempts, it is seamlessly handed off to a human fallback team [00:03:30]. This ensures the call gets made regardless of whether a human or bot completes it, providing reliable answers to customers [00:03:40].
Continuous Learning: The system learns from every call by updating office hours for specific phone numbers and refining phone tree traversal strategies to improve future call efficiency [00:04:01]. Random call audits are also conducted to ensure system integrity, especially given the sensitive nature of healthcare phone calls [00:04:15].

Superdial’s system has saved over 100,000 hours of human phone and calling time, with projections to save millions more [00:06:15]. This was achieved with a lean team of four engineers handling the full stack, EHR integrations, and bot development [00:06:23].

The Role of the Voice AI Engineer

The success of Superdial is attributed to embracing the role of the voice AI engineer [00:06:41].

Unique Constraints

Voice AI engineers must deal with:

Multimodal data: Including MP3s, audio bites, and transcripts [00:06:55].
Real-time latency: This becomes a critical factor for application performance [00:07:07].
Product constraint: The application’s core function is a voice conversation, requiring high expectations for natural interaction [00:07:18].

Guiding Principles

Superdial’s internal sayings for grappling with these challenges in building reliable AI agents are:

“Say the right thing at the right time” [00:07:44].
“Build this plane while we fly it” [00:07:46].

The most unique aspect of a voice bot is its conversational content and design, along with vertical integrations that make the agent’s work valuable, rather than just its voice or interruption handling [00:08:22].

Ethical Considerations

Given the lack of AI regulation in the US, the onus is on AI engineers and leaders to consider potential biases in voice AI apps against certain accents or dialects, and the “spooky” effect of realistic voices saying strange things [00:09:07]. Developing AI should be accessible and collaborative, ensuring that the work AI does is for everyone. This requires choosing tooling and infrastructure that allows a diverse set of stakeholders to be involved from the start [00:09:42].

Addressing the Last Mile Problems in Voice AI

While it’s easy to build an MVP for voice AI applications, scaling them faces immediate challenges, many of which are not new to voice UI, a field with 20 years of conversation design experience [00:10:18].

Conversation Design Paradigms

A significant shift in voice UI development is from prescriptive to descriptive development [00:11:00]. Instead of mapping every possible conversational direction, developers describe what they want the bot to do and hope it happens [00:11:08].

Open-ended vs. Constrained Questions: For existing conversations like healthcare calls, Superdial found it’s often better to go general and hope the call center representative provides ample information, then adapt to whatever they say, rather than trying to prevent “wrong” responses [00:11:30].
Conversation Designers: Hiring conversation designers is recommended as they are experts in this field [00:11:56].
Table Reads: A practical tip for voice AI engineers is to perform “table reads,” where one person pretends to be the bot and another a user, reading out a script. This immediately reveals gaps and awkwardness in the conversation [00:12:11].

Orchestration Frameworks

Superdial found its stride by using PipeCat for voice AI orchestration [00:12:39]. PipeCat is an open-source framework, easy to extend, and allows for self-hosting and scaling, which is crucial for long phone calls (up to 1.5 hours) [00:12:41].

LLM Integration and Tooling

For LLM work, Superdial made specific tooling choices:

Owning OpenAI Endpoint: This provides a better interface with new voice AI tools and allows routing to different models for latency-sensitive responses [00:13:27].
TensorZero: All generative responses are routed through TensorZero, an open-source tool that provides structured and typed LLM endpoints for production experimentation [00:13:44].
Logging and Observability: Superdial self-hosts LaneFuse for logging and observability [00:14:11]. Self-hosting is also preferred for HIPAA compliance, given the rapid growth of the space [00:14:16]. This setup facilitates anomaly detection, evaluations, and dataset management [00:14:26].

Text-to-Speech Customization

A significant challenge is ensuring the Text-to-Speech (TTS) system accurately pronounces specific information, like names or long character strings (e.g., member IDs) [00:14:36].

LLM Output vs. TTS Input: What the LLM outputs is not always what should be fed directly into the TTS engine, and neither may match the final recording [00:14:52].
Pronunciation and Spelling: Tools like Rhyme allow explicit spelling out of pronunciations (e.g., using SSML syntax) and managing pauses and breaks for long words [00:15:22]. Audio recordings are reviewed in addition to transcripts to ensure correct output [00:15:46].

Operational Best Practices (Mini Last Mile Problems)

Several practical “mini Last Mile” problems arise when deploying voice AI agents [00:15:54]:

Persona: Choose a bot persona carefully. A previous bot name, “Billy,” caused confusion over the phone due to similar-sounding pronunciations [00:16:09].
Leveraging Existing Tools: Don’t build from scratch initially. The bot’s uniqueness comes from conversation, and tools like PipeCat provide a quick jump start [00:16:30].
Latency Monitoring: Track latency everywhere, with “Time to First Byte” being the new most important metric for real-time voice applications [00:16:40].
Upgrade Paths: Ensure clear upgrade paths for critical components. For speech-to-text, Superdial uses Deep and can fine-tune models for improved accuracy [00:16:51].
Fallbacks: Have fallbacks ready for every part of the stack (e.g., for LLM outages). Tools like TensorZero can help set this up [00:17:09].
End-to-End Testing: This is particularly unique for voice AI. Telephony is often used as a boundary layer for testing [00:17:26]. Methods include:
- Fake Phone Numbers: Creating a fake number that plays an MP3 to test basic bot interaction [00:17:42].
- Simulated Voice Trees: Creating a simulated phone tree for the bot to navigate [00:17:52].
- Bot-to-Bot Testing: Using generative services like Koval and V to have the bot converse with another bot [00:18:01].

Key Takeaways

For voice AI engineers, key takeaways for building and improving AI agents are:

Choose your stack wisely: Better decisions here allow focus on unique conversational experiences [00:18:12].
Laser focus on the Last Mile: This is where significant value can be added and agents can be put to work [00:18:23].
Ride the wave: Stay current with new models and advancements to integrate them quickly and safely [00:18:30].

Tubegraph

Explorer

Table of Contents