Voicedriven humancomputer interaction

From: redpointai

The way humans interface with advanced, human-like computers is likely to mirror how humans communicate with each other: using eyes, ears, and mouths [00:00:02]. In this analogy, the computer’s eye is a camera, its ears are microphones, and its mouth is a speaker [00:00:14]. Russ D’Sa, founder and CEO of LiveKit, discusses the future of voice AI and how LiveKit facilitates these interactions, powering applications like ChatGPT Voice [00:00:20].

Current Applications of Voice AI

LiveKit currently powers sophisticated applications, including OpenAI’s ChatGPT Voice [00:00:43]. Russ D’Sa uses ChatGPT Voice personally, often while driving, by using one earbud for safety [00:01:01]. He leverages it as a personal tutor to learn about diverse topics such as Frontier Tech, quantum theory, quantum entanglement, quantum computers, CRISPR, or even basic phenomena like how lightning works [00:01:21]. This allows for a judgment-free learning experience, enabling users to ask any “dumb question” without embarrassment, accessing vast knowledge without judgment [00:01:53].

How LiveKit Powers Voice AI

LiveKit’s technology underpins voice AI interactions through a specific workflow:

Device-side SDK: A LiveKit SDK sits on the user’s device, accessing the camera and microphone [00:02:55].
Audio Transmission: When the user speaks, the SDK captures the speech and sends it over LiveKit’s global Edge Network [00:03:07]. This network consists of servers worldwide that communicate to form a mesh fabric [00:03:16].
Agent Processing: The audio is transmitted through the network to an “agent” on the backend [00:03:32]. LiveKit’s agents framework functions like an application server [00:03:38].
Speech-to-Text (Traditional Voice Mode): In the traditional voice mode (before advanced voice), the agent converts the audio to text [00:04:14]. This text is then sent to a Large Language Model (LLM) [00:04:26].
Text-to-Speech (Traditional Voice Mode): As tokens stream out of the LLM, they are converted back into speech and sent back over LiveKit’s network to the client device, where the SDK plays the audio [00:04:30].
Direct Audio Inference (Advanced Voice): With advanced voice, the speech goes directly from the client, over LiveKit’s network, to the agent, which then sends it via a real-time API (websocket connection) to a GPU machine [00:04:48]. The audio is directly fed into models like GPT-4o, which is trained with joint embeddings of text and speech [00:05:05]. Inference is performed directly on the audio embeddings, and speech is generated by GPT-4o, sent back through LiveKit’s network, and played out on the device [00:05:22].

LiveKit as the “Nervous System” for AI

LiveKit is described as the “nervous system” for AI [00:05:38]. Initially, LiveKit focused on connecting humans to humans via video conferencing and live streaming [00:05:43]. However, with voice mode, the focus shifted to connecting humans with machines [00:05:58].

If companies like OpenAI, Anthropic, and Google (Gemini) are building AGI (Artificial General Intelligence) – which can be understood as “tool builders building tool builders” or a synthetic human brain – then LiveKit provides the interface [00:06:05]. Just as humans use their eyes, ears, and mouths to communicate, a computer that acts like a human brain will also use these senses for interaction [00:07:44].

Eyes: Camera [00:08:00]
Ears: Microphones [00:08:03]
Mouth: Speaker [00:08:05]

If foundational model companies are building the “brain” of AI, LiveKit is building the “nervous system.” Its role is to transport information from these senses to the AI brain and then transport the brain’s output back out to the world [00:08:11].

The Future of Human-Computer Interaction

The traditional keyboard and mouse interfaces are expected to be replaced by microphones and cameras [00:00:29]. The future of the office, and work itself, will drastically change [00:08:54]. Interfaces like Iron Man’s Jarvis system, where users interact with AI via voice and multimodal inputs, are expected to become common [00:09:01].

Creative Tools and AI Roles

Creative tools will become more voice-based, interactive, and multimodal [00:09:18]. Users will act as “orchestrators” or “maestros,” shaping outputs while the AI handles the mechanical work [00:09:52].

The debate between AI “co-pilots” and “agents” will likely resolve into a hybrid model, similar to how humans collaborate [00:10:15]. Some AI will operate autonomously (agents), while others will pair with humans (co-pilots) for collaborative tasks, design reviews, and multi-stakeholder meetings [00:10:31]. The key difference will be an increased reliance on AI interaction over human interaction [00:11:03].

Modality Integration and Hybrid Interfaces

While voice will be dominant, text will retain its place for privacy or when information is better retained by reading [00:11:34]. The future will involve a hybrid of text, voice, and computer vision [00:11:48].

For instance, ordering food from a new restaurant might involve voice for initial intent but present a visual menu (text + UI) for detailed selection, as reading an entire menu aloud isn’t good UX [00:12:40]. Hands-free scenarios, like driving or cooking, are obvious applications where voice is a natural modality [00:13:12]. Devices like Siri or Alexa already demonstrate these use cases [00:13:47]. The future will likely see a blend of voice, text, and dynamically generated UI [00:13:58].

The “Thin Client” Dream and Chat Interfaces

Historically, there has been an oscillation between powerful client devices and “thin clients” where computing is largely server-side [00:14:13]. The modern chat interface, as seen in texting, Telegram, WhatsApp, Twitter, and Slack, could represent the “thin client dream” for AI interaction [00:15:15]. This single UI for all applications could incorporate voice, on-the-fly generated UI, and text, making complex interactions familiar and accessible [00:15:48].

Current AI applications often operate in distinct “modes” (e.g., voice for car/couch, text for crowded spaces or coding in VS Code) [00:16:47]. However, future UI explorations will treat modalities more fluidly. An example is pair programming with an AI: it would involve mixing modalities like looking at a screen (computer vision), typing, and speaking, with the AI seamlessly taking over tasks or offering corrections [00:17:52]. This “mixed modality” interaction is a significant area for future development [00:18:16].

Applications and Use Cases of Voice AI

OpenAI, Google’s Gemini Live, Character AI, and Perplexity are building voice interfaces for their systems, enabling users to look up information, ask questions, or use them as tutors or therapists [00:18:55]. These are considered “emergent use cases,” pushing the envelope of interaction and capabilities [00:19:18].

Alongside these emergent applications, significant integration of voice AI in various industries is occurring in “low-hanging fruit” areas with existing massive scale and dollars, such as the telephony space [00:19:53]. Companies like Sierra and Parloa are disrupting this sector by integrating AI to reduce costs [00:20:41]. The telephone, a voice-native system existing for over 50 years, generates billions of calls monthly, making it an immediate, high-penetration use case for AI-based voice [00:20:56].

Challenges and Considerations

Latency

While chatbots have exploded, voice AI in customer support hasn’t fully taken off due to factors beyond just latency [00:22:07]. Chatbots are less latency-sensitive than real-time voice conversations [00:22:00]. Historically, voice AI latency was high (e.g., 4 seconds conversational turnover in early 2023) [00:23:43]. However, with GPT-4o and real-time APIs, latency has significantly decreased to an average of 320 milliseconds with LiveKit, approaching human conversational speed (around 300 milliseconds) [00:23:56]. In some cases, AI models like Cerebras are too fast (100 milliseconds), leading to the AI interrupting the user [00:24:27]. This indicates that latency is no longer the primary blocker for voice AI viability in many use cases [00:24:55].

Systems Integration and Human-in-the-Loop

The bigger challenge is systems integration [00:25:10]. Swapping out existing, large-scale, functional telephone systems with AI is a significant risk to customer satisfaction (NPS) [00:22:27]. AI models are not yet perfect; they can hallucinate and require “bulletproofing” [00:22:52]. This necessitates a “human in the loop,” meaning human contact centers cannot be fully shut off, and humans must be on standby to correct or take over for the AI agent [00:23:09]. Furthermore, integrating AI into existing backend systems for updating records (e.g., Salesforce, custom tracking systems) presents a complex challenge, although the hope is that AI models, with capabilities like Anthropic’s Computer Use API, can eventually handle these tasks [00:25:26].

On-Device vs. Cloud Models

The split between on-device and cloud processing for AI models is crucial, particularly in areas like robotics [00:32:38]. For humanoid robots, critical functions like “reflex actions” and immediate movements (kinematics) must run on the device to ensure safety and responsiveness (e.g., avoiding a car while walking) [00:32:51]. Higher-level functions like planning and reasoning might still occur in the cloud [00:33:00].

Drawing an analogy to humans, no single human possesses all the world’s information [00:34:11]. Humans offload knowledge retrieval to external tools (like phones) and problem-solving to experts (like calling a contact center for router issues) [00:34:27]. Similarly, AI systems will always involve some cloud inference for comprehensive knowledge and complex problem-solving, alongside on-device processing for immediate, local tasks [00:35:01].

The ideal scenario, assuming no resource constraints, would be to parallel process both locally and in the cloud, with the fastest, real answer being adopted [00:35:41]. Data, especially for legal reasons or generating training data (e.g., erroneous examples for data labeling), will generally be sent to the cloud [00:36:12]. While some privacy-sensitive use cases might prefer purely local, on-device processing, even new developments like Apple Intelligence still send data to a highly secure cloud [00:36:45].

Overhyped and Underhyped AI

Overhyped: Transformers [00:37:20]
Underhyped/Under-researched: Spiking Neural Networks [00:37:29]. These are analog neural networks modeled more closely after the human brain and how neurons interact [00:37:40]. They are particularly well-suited for audio and video signal processing, although harder to train than Transformers [00:38:13].

Changing Perspectives on AI Penetration and Moats

A year ago, there was an expectation that AI, particularly tools like ChatGPT, would penetrate the market much more rapidly and deeply [00:40:02]. Despite ChatGPT’s reported 200 million-plus weekly active users, many people have heard of AI but do not use it in their daily work, which was a surprising observation [00:40:15].

The initial belief that AI applications would quickly develop “moats” (sustainable competitive advantages) has changed [00:38:32]. The underlying models evolve so rapidly that defensible moats are hard to build [00:38:46]. The focus has shifted to the importance of fast teams deeply embedded with their customers, understanding their needs, and building very quickly [00:38:53]. While unique data assets still hold some value, the primary competitive edge comes from rapid adaptation and deep customer understanding [00:39:08]. Application-layer moats will be similar to traditional ones: applications that are hard to build and hard to scale to many users [00:39:24].

Exciting AI Startups and Future Applications

Russ D’Sa is most excited about Tesla, particularly its self-driving capabilities and the Tesla Bot [00:41:04]. He describes the experience of Tesla’s self-driving as “magical” and a “marvel of technology,” offering a visceral experience with tech [00:41:52].

If starting a new AI application, Russ would build a video game with a novel way of interacting with Non-Player Characters (NPCs) [00:43:17]. He envisions expansive open worlds filled with dynamic, lifelike characters and infinitely permutable storylines, akin to a “Choose Your Own Adventure” with endless possibilities [00:43:42]. The application of voice AI to video games, allowing natural human inputs to interact with characters, is an underhyped area he finds incredibly promising [00:44:15].

Learn More About LiveKit

To learn more about LiveKit and its work, individuals can visit:

GitHub: Most of LiveKit’s work is open source, available at github.com/livekit [00:44:46].
Website: livekit.io [00:45:01].
X (formerly Twitter): @livekit [00:45:04].

Tubegraph

Explorer

Table of Contents