Voice AI engineering challenges and solutions

From: aidotengineer

Nick from Superdial discussed the evolving landscape of voice AI, highlighting key challenges in developing AI agents and practical solutions for building reliable voice AI applications, drawing from Superdial’s experience as an “agents as a service” platform [00:00:44].

Voice AI in 2025: Current Landscape

The field of voice AI is dynamic and exciting [00:00:51]. Key developments and ongoing challenges in building AI voice agents include:

Large Language Models (LLMs): New, smart, fast, and affordable LLMs are emerging, supporting more complex conversational use cases [00:00:54]. However, turning a chat agent into a voice agent still requires specific techniques [00:01:04].
Text-to-Speech (TTS) Models: While low-latency, realistic, and highly generative TTS models exist, they can suffer from audio hallucinations and issues with pronunciation and spelling [00:01:09].
Infrastructure and Tooling: There’s an explosion in voice AI infrastructure, tooling, and evaluation systems, leading to questions about what components are truly worth owning [00:01:21].
Speech-to-Speech/Voice-to-Voice Models: These models are not yet production-ready for many applications due to their tendency to output non-speech or unreliable conversational elements [00:01:33]. Superdial prioritizes reliability over realism for these models [00:01:58].

Superdial’s “Agents as a Service” Approach

Superdial specializes in automating “annoying phone calls,” specifically to insurance companies for healthcare administration businesses [00:02:26].

Platform Features

Superdial’s platform allows customers to:

Build conversational scripts to ask necessary questions [00:02:39].
Send call requests via CSV, API, or EHR software integrations [00:02:47].
Receive structured results within hours or a day [00:02:54].

Agentic Loop and Human Fallback

Superdial employs an internal agentic loop:

Calls are made when offices or call centers are open [00:03:17].
A voice bot attempts the call [00:03:26].
If the bot cannot complete the call after a certain number of attempts, it’s sent to a human fallback team [00:03:30]. This ensures calls are completed regardless of bot or human involvement, providing reliable answers in a structured format [00:03:50].
The system learns from each call, updating office hours and optimizing phone tree traversal for future attempts [00:04:01]. Random audits ensure system reliability for sensitive healthcare calls [00:04:15].

Impact

Superdial has saved over 100,000 hours of human phone and calling time, with projections to save millions more [00:06:15]. This was achieved with a lean team of four engineers building the full stack, EHR integrations, and the bot, while rapidly onboarding customers and supporting new use cases [00:06:23].

The Role of a Voice AI Engineer

The success of Superdial is partly attributed to the team embracing the unique role of a voice AI engineer [00:06:39].

Unique Characteristics

A voice AI engineer typically deals with:

Multimodal Data: Including MP3s, audio bites, and transcripts [00:06:55].
Real-time Latency: Crucial for real-time applications [00:07:08].
Asynchronous Programming: Heavy use of async Python [00:07:12].
Product Constraint: Voice conversations with high user expectations [00:07:18].

Core Principles

Superdial’s key principles for voice AI engineering include:

“Say the right thing at the right time” [00:07:44].
“Build this plane while we fly it” [00:07:46].
Focus on conversational content, design, and vertical integrations to make agents valuable [00:08:32].

Ethical Considerations

Given the nature of generative AI, ethical concerns are significant [00:08:41]. Voice AI apps can be biased against certain accents or dialects, or sound “spooky” if too realistic but say strange things [00:08:57]. Without strong AI regulation, the onus is on engineers and leaders to prioritize:

Accessibility: Ensuring AI development is accessible and collaborative [00:09:42].
Inclusivity: Designing AI that works for everyone by choosing tooling and infrastructure that allows diverse stakeholders to be involved from the start [00:09:46].

Last Mile Problems in Voice AI

Building a basic voice AI MVP is relatively easy, but making it reliable and production-ready involves significant “last mile” challenges [00:10:03].

Conversation Design

One of the biggest changes in voice UI development is the shift from prescriptive (mapping every possible conversational direction) to descriptive (describing desired behavior and hoping the generative model executes) [00:11:00].

Challenge: Deciding between open-ended questions or constraining user choices [00:11:30].
Solution: For existing conversations, Superdial found it better to be general, hoping for comprehensive information from the call center representative, and adapting to their responses rather than preventing “wrong” things [00:11:39].
Recommendation: Hire a conversation designer, or perform “table reads” (role-playing the bot and user) to identify conversational gaps and awkwardness [00:11:56].

Orchestration Framework

Challenge: Dealing with technical debt from initial, scrap-together pipelines [00:12:32].
Solution: Using PipeCat, an open-source framework by Daily, which is easy to extend, hack upon, and allows for self-hosting and scaling, crucial for long phone calls (up to 1.5 hours) [00:12:39].

LLM Backbone (General AI Engineering)

While not unique to voice AI, choices for the LLM backbone are critical:

LLM Endpoint: Superdial owns its OpenAI endpoint to better interface with new voice AI tools, allowing routing to latency-sensitive models [00:13:27].
Generative Responses: All generative responses are routed through TensorZero, which provides structured and typed LLM endpoints for experimentation in production [00:13:44].
Logging and Observability: LaneFuse is self-hosted for logging and observability, enabling anomaly detection, evaluations, and dataset management, which is essential for HIPAA compliance in healthcare calls [00:14:11].

Text-to-Speech (TTS) System

Challenge: Ensuring the LLM output, TTS engine output, and actual recording match, especially for sensitive data like member IDs (e.g., 12-digit strings) [00:14:52]. Pronouncing names correctly (e.g., “Kotus” vs. “Koutus”) is a common issue [00:15:09].
Solution: Using custom pronunciation syntax (e.g., from Rhyme) to spell out exact pronunciations, and “spell functions” to manage pauses and breaks for long words or character strings [00:15:22]. Reviewing audio recordings is crucial [00:15:46].

Mini Last Mile Problems and Solutions

Bot Persona: Avoid names that are easily misunderstood over the phone (e.g., “Billy” vs. “Billi”) [00:16:06]. Dial in the bot’s persona early [00:16:27].
Building from Scratch: Don’t build from scratch; leverage existing tools like PipeCat for a quick start, as the bot’s uniqueness lies in the conversation [00:16:30].
Latency Tracking: Track “Time to First Byte” everywhere, as it’s a critical metric for real-time voice AI [00:16:40].
Upgrade Paths: Plan for system improvements. Superdial uses Deepgram for speech-to-text, knowing they can fine-tune models for better transcription accuracy [00:16:51].
Fallbacks: Have fallbacks ready for each part of the stack (e.g., when OpenAI goes down) to prevent service interruptions [00:17:09]. Tools like TensorZero can help set this up [00:17:20].
End-to-End Testing: This is unique for voice AI. Telephony is often used as a boundary layer. Methods include:
- Calling a fake phone number that plays an MP3 [00:17:42].
- Creating a simulated voice tree for the bot to navigate [00:17:53].
- Using generative services like Koval and V to have the bot talk to another bot [00:18:01].

Takeaways for Voice AI Engineers

For those in vertical voice AI engineering:

Choose your stack wisely: Good decisions here allow focus on unique conversational experiences [00:18:12].
Laser focus on the last mile: This is where significant value can be provided and agents put to work reliably [00:18:23].
Ride the wave: Stay current with new models to use them quickly and safely [00:18:30].

Tubegraph

Explorer

Table of Contents