From: aidotengineer

Conversation is considered the oldest interface, with voice being our original API, mastered even before fire [00:00:11]. While humans interact live, AI has historically been locked out of direct conversational assistance [00:00:29]. This article explores how more powerful AI agents can keep humans in the loop through voice, their most natural interface [00:00:45].

The Horizon of Voice AI and Agents

The convergence of two significant developments makes this future possible:

  1. Highly Specialized Agents: These agents are becoming capable of performing complex tasks over longer time horizons [00:01:11]. This includes better Rag systems, multi-step tool calling, and enhanced agent orchestration [00:02:16].
  2. The Voice AI Wave: Conversational agents are making AI highly accessible, allowing users to interact via calls, search for information, and receive responses [00:01:21]. Improvements include reduced time to first token and latency, with full-duplex speech-to-speech models on the horizon [00:02:32].

While these “explosions” are ongoing, the user experience for ambient agents that respond to events rather than text chats is still being defined [00:01:39].

Voice-first AI Overlays

A voice-first AI overlay sits alongside human-to-human calls, providing real-time assistance without becoming a third speaker [00:05:33]. It is native to voice, enhancing and augmenting dialogue passively [00:05:45]. Unlike typical voice AI interactions where a human speaks directly with an AI, the overlay paradigm involves an AI operating between two humans [00:06:08].

How Overlays Work

The overlay passively listens to natural dialogue and surfaces relevant help under specific contexts, such as language or phrase suggestions and definitions [00:06:18]. It remains out of the way until needed, effectively enabling an ambient agent that is conversationally aware [00:06:35].

Demonstration Example

A demo illustrates real-time conversational assistance in a live foreign language call, where the user may not speak the language fluently [00:03:02]. The overlay pipeline involves:

  • Caption scraping [00:03:13]
  • Smart debouncing [00:03:17]
  • Managing context to provide relevant foreign language suggestions from the LM [00:03:17]
  • An entire LM pipeline for suggestion and translation endpoints [00:03:27]

Placement in the AI Landscape

Overlays fit into the AI stack by sitting above core speech models (speech-to-text, text-to-speech) and below intent and agent frameworks [00:06:58]. They decide when and where an agent’s help surfaces, without being concerned with the agent’s internal workings [00:07:07]. Unlike meeting bots or notetakers that operate after a conversation, overlays function during live interactions [00:07:27]. They do not participate directly in the dialogue but amplify the humans in the room [00:07:43].

Design Challenges

Designing effective overlays requires significant UX research, focusing on cognitive load, overlay design, and timing [00:07:55]. The challenge lies in ensuring that assistance is provided optimally in a live conversation:

  • Too Early: If help arrives too early, it becomes an interruption [00:08:40].
  • Too Late: If it comes too late, the opportunity for highest value is missed, making it useless [00:08:49].
  • Wrong Context: If help is loaded with the wrong context, it becomes spam [00:08:57].
  • Derailment: Even if timely and relevant, if it derails the conversation, it’s not usable as it hasn’t respected the conversational flow [00:09:06].
  • Latency: Throughout all of this, latency must still be well-managed [00:09:20].

In summary, the key challenges are timing, relevance, attention, and latency [00:09:26].

Engineering Challenges: The Four Horsemen of Overlay Engineering

When building these systems, developers will likely encounter four main technical hurdles [00:10:47]:

  1. Jitterbug Input: Speech-to-text systems can pause when speakers take breaths, leading to inconsistent input. Debouncing is crucial to manage these moments [00:10:59].
  2. Context Repair: Providing live assistance requires a sub-second speed limit, meaning the entire processing pipeline must be highly optimized [00:11:15].
  3. Premature Interrupt or No Show: Help can arrive too early (premature interrupt) or too late/not at all (no show) [00:11:28]. Good conversational awareness is needed to know the right moment to intervene [00:11:40].
  4. Glancible Ghost: Hints or suggestions tax a user’s attention [00:11:53]. The overlay should not obstruct the view, must be flexible, and dismissible, adhering to strong user interface principles [00:12:07].

Design Principles for Overlays

For any overlay, several principles should be considered:

  • Transparency and Control: Users should be able to decide how much the overlay intervenes in the conversation [00:09:44].
  • Minimum Cognitive Load: Even the most intelligent system is unusable if it overloads speakers or derails their conversation [00:09:56].
  • Progressive Autonomy: The system should allow users to moderate the amount of help they receive over time, supporting learning and natural progression [00:10:18].

Exciting Aspects of the Space

Several aspects make this field particularly promising:

  • Latency Improvements: Roundtrip calls to fast-provisioned LM providers can now be threaded within 500-700 milliseconds, with very low time to first token [00:12:25].
  • Privacy by Design: Models are becoming increasingly capable while being smaller, raising the possibility of running them entirely on-device to ensure privacy by default [00:12:49].
  • Strong User Experience Ethos: Injecting a UX dance that values and respects human conversation, treating it as something native and needing protection [00:13:10].
  • Voice as a Linkable Surface: The speculative idea of ambient agents in live human conversations being linked or orchestrated in new ways [00:13:37].

Open Questions and Future Directions for Voice-first AI

Several challenges and areas for future exploration remain:

  • ASR Errors Cascading: Automatic Speech Recognition (ASR) errors can lead to incorrect advice (e.g., “don’t” transcribed as “do”) [00:14:05]. Pairing ASR with extensive conversational context could be a solution [00:14:27].
  • Prosody and Timing Complexity: Human ears are hardwired to detect micro-intonation signals, which are lost when converting speech straight to text [00:14:36]. The impact of this information loss and whether relevant assistance is still possible needs further investigation [00:15:01].
  • Security Surface: Agents interacting in live conversations introduce a new security surface that requires careful consideration [00:15:10].

Extensions and Future Directions

  • Full Duplex Speech Models: These models, on the horizon, would process raw audio through a speech model directly without converting it to text, potentially offering contextual suggestions based on audio features [00:15:36].
  • Multimodal Understanding: Integrating live video alongside audio could provide additional information to make AI interaction more helpful [00:15:59].
  • Speculative Execution and Caching: These techniques could further enhance responsiveness and efficiency [00:16:12].

The field is dynamic and interesting, with the future appearing conversational given the explosion of voice AI [00:16:19]. While the technology seems ready, the interfaces are still evolving [00:16:31].