The Future of Voice Interfaces in AI

From: redpointai

The future of human-computer interaction is moving beyond traditional keyboards and mice towards more natural, human-like interfaces, primarily driven by advancements in voice AI [00:00:07]. This shift envisions computers interacting with users through “eyes” (cameras), “ears” (microphones), and “mouths” (speakers), mirroring human communication [00:00:14].

Current State and Key Applications

Companies like OpenAI are at the forefront of this transformation, with products such as ChatGPT Voice demonstrating the potential of voice interfaces [00:00:27]. LiveKit, for instance, powers applications like ChatGPT Voice, acting as the “nervous system” connecting human users with machines [00:00:43].

Russ D’sa, CEO of LiveKit, personally uses ChatGPT Voice as a tutor, asking questions about frontier technology like quantum theory or basic science topics like how lightning works [00:01:21]. This provides a “judgment-free” environment to ask “dumb questions” and access the “entire world’s knowledge” [00:02:17].

How LiveKit Powers Voice AI

LiveKit employs an SDK on the user’s device to access the camera and microphone [00:03:00]. When a user speaks, the speech is captured and sent over LiveKit’s global Edge Network to a backend agent [00:03:08].

For traditional voice mode:

Audio from the user’s device is converted to text [00:04:18].
This text is sent to a Large Language Model (LLM) [00:04:24].
As tokens stream from the LLM, they are converted back into speech [00:04:30].
The speech is sent back over LiveKit’s network to the client device and played out [00:04:35].

For advanced voice (like GPT-4o):

Speech goes directly from the client device over the network to the agent [00:04:51].
The agent sends the audio directly via a real-time API (websocket) to a GPU machine [00:04:54].
The audio is directly processed by GPT-4o, which is trained with joint embeddings between text and speech [00:05:05].
Speech is generated directly by GPT-4o and returned through the network to the device [00:05:22].

Emergent and Existing Use Cases

Emergent use cases for voice AI include:

Voice interfaces for information lookup [00:19:08].
Tutors or therapists [00:19:11].
New capabilities like Anthropic’s “computer use API” [00:19:39].

Existing, large-scale applications where voice AI is driving margin optimization:

Telephony and telecom space [00:20:20].
Customer support, replacing IVR phone tree systems with AI [00:20:32].
Automated tasks like insurance eligibility lookups, where AI agents can both receive and make calls to humans [00:30:18].

The Vision: AGI and Human-like Interaction

If companies are building AGI (Artificial General Intelligence) – “tool builders building tool builders” [00:07:17] – the primary interface will shift from keyboards and mice to microphones and cameras [00:00:29]. This is because a computer resembling a human brain would naturally interact like a human [00:07:53]. LiveKit aims to be the “nervous system” that transports sensory information (from cameras/microphones) to the AI “brain” and delivers the brain’s output back out [00:08:11].

The Office of the Future

The nature of work will change drastically [00:08:54]. Instead of traditional offices, future interactions might resemble “Jarvis” from Iron Man, with voice-controlled interfaces permeating everything [00:08:58]. Creative tools will become more voice-based, interactive, and multimodal, with AI serving as the orchestrator or “Maestro” for tasks, reducing the mechanical work for humans [00:09:18].

The future will likely see a hybrid of “co-pilots” and “agents,” mimicking human collaboration in the workplace, where some tasks are owned autonomously by agents, while others involve pairing or collaborative meetings [00:10:15].

Challenges and Hybrid Modalities

While voice is powerful, it won’t entirely replace other modalities [00:11:25]. Text will retain its place for certain scenarios (e.g., privacy, personal preference for reading, messaging apps) [00:11:34]. Computer vision will also be crucial as humans are visual creatures [00:11:58].

When Voice Makes Sense

Voice is a natural modality for:

Hands-free interfaces (driving, cooking) [00:13:12].
Interactions with devices that are far away (like Siri or Alexa) [00:14:42].
Situations where reading an entire menu, for example, would be a poor user experience [00:13:00].

The “Thin Client Dream” and Fluid Modalities

The idea of a “Thin Client” – a less powerful device relying on cloud processing – could manifest in a universal chat interface [00:14:21]. Just as humans communicate frequently via chat (texting, Telegram, WhatsApp), a chat interface could serve as a single UI for all applications, incorporating voice, on-the-fly generated UI, and text [00:15:15].

Future AI interfaces will treat modalities more fluidly [00:17:29]. This is akin to human pair programming, where multiple modalities are mixed on the fly: looking at a screen (computer vision), typing (text), and asking questions or giving instructions verbally (voice) [00:17:52]. This blending of experiences, especially with computer vision integration, is expected to become prevalent [00:18:28].

Technical Hurdles in Voice AI Deployment

Latency in voice AI has drastically improved. From approximately 4 seconds in early 2023, conversational turn-around time with GPT-4o and LiveKit is now around 320 milliseconds on average, close to human-level conversational speed (300ms) [00:23:43]. In some cases, inference can be so fast (e.g., Cerebras at ~100ms) that models respond too quickly, interrupting users [00:24:26].

However, the main challenge isn’t latency but rather systems integration [00:25:06]. AI models are not yet perfect; they can hallucinate, requiring human-in-the-loop oversight [00:22:52]. Integrating AI into existing, often bespoke, backend systems (like updating records in Salesforce or tracking tickets) presents a significant hurdle [00:25:28].

Giving AI the Sense of “Touch”

LiveKit is working on giving AI the ability to “touch” applications, going beyond seeing, hearing, and speaking [00:28:40]. This involves allowing AI agents to interact with virtualized browser instances in the cloud, using a Playwright interface to load web pages, click buttons, and fill forms [00:26:49]. If an agent gets stuck (e.g., needing a password or user choice), it can stream the browser video to a human user, who can then interact by clicking on “video pixels” to unblock the agent [00:27:27]. This capability aligns with Anthropic’s “computer use API” [00:28:22].

Architecture: On-Device vs. Cloud Inference

The split between on-device and cloud model inference is crucial. In humanoid robotics, for example, planning and reasoning might occur in the cloud, while “reflex action” and movement (kinematics) run on-device to ensure immediate responses to physical world events [00:32:53].

Drawing an analogy to humans, who don’t possess all world information in their heads, AI models will likely always have some inference occurring in the cloud for knowledge lookup or complex problem-solving [00:34:11]. Ideally, both local and cloud models could run in parallel, with the fastest accurate responder being used [00:35:41].

Privacy-sensitive use cases may push for purely local, on-device processing [00:36:45]. However, even with initiatives like Apple Intelligence, data often still goes to a highly secure cloud for advanced processing [00:36:59]. Data will generally be sent to the cloud for legal reasons, generating updated training data, and addressing erroneous examples [00:36:36].

Emerging Trends and Predictions

Multimodal Models: The advent of fully multimodal models like GPT-4o, trained on speech and text, capable of taking any combination of modalities as input and outputting any combination, is a significant development [00:31:18].
Underhyped Architectures: While Transformers are currently overhyped, “spiking neural networks” are underhyped and under-researched [00:37:20]. These analog neural networks are modeled more closely after the human brain and are potentially perfect for audio and video signal processing, though harder to train [00:37:40].
AI Penetration Speed: The penetration of AI into everyday usage has been slower than anticipated, despite rapid growth, with many people still not using contemporary AI forms like ChatGPT [00:40:02].
AI in Video Games: Voice AI applied to video games is underhyped [00:44:21]. The future of video games will feature open worlds filled with dynamic, lifelike characters that players can interact with using natural human inputs, leading to infinite possibilities and changing storylines [00:43:42].
Tesla’s Self-Driving: Tesla’s self-driving technology is seen as a “marvel of technology” and a “visceral experience,” demonstrating science fiction-like capabilities that don’t receive enough recognition [00:41:52].

Tubegraph

Explorer

Table of Contents