Integration of voice AI in various industries

From: redpointai

The future of interacting with computers is envisioned to shift from traditional keyboards and mice to more natural human communication methods, primarily voice and camera interfaces [00:00:07]. This shift is driven by the development of human-like AI systems that will respond and act similarly to humans [00:07:30]. The eyes will be cameras, ears microphones, and the mouth a speaker, forming a new form of human-computer interaction [00:00:14].

LiveKit’s Role as a “Nervous System” in AI Integration

LiveKit, founded by Russ D’Sa, serves as a “nervous system” for AI models like OpenAI’s ChatGPT voice [00:19:05]. Initially, LiveKit’s open-source project focused on connecting humans to other humans through video conferencing and live streaming [00:05:43]. However, with the emergence of voice mode, their focus shifted to connecting human beings with machines [00:05:56]. If foundational model companies like OpenAI and Anthropic are building the “brain” (AGI), LiveKit aims to build the “nervous system” to transport sensory information (from cameras and microphones) to the brain and deliver the brain’s responses back out (through speakers) [00:08:07].

How LiveKit Works with ChatGPT Voice

LiveKit’s process for voice AI interaction involves several steps [00:02:52]:

SDK on Device: A LiveKit SDK on the user’s device accesses the camera and microphone [00:02:55].
Audio Transmission: When the user speaks, the SDK captures the speech and sends it over LiveKit’s global Edge Network [00:03:07].
Agent Processing: The audio reaches an “agent” (application server, e.g., built by OpenAI using LiveKit’s framework) [00:03:32].
Traditional Voice Mode: In older voice modes, audio is converted to text, sent to an LLM, and the LLM’s text output is converted back to speech and sent to the client device [00:04:14].
Advanced Voice (e.g., GPT-4o): For advanced voice, speech goes directly to the GPU machine via a real-time API (websocket connection) and is processed by models like GPT-4o, which are trained with joint embeddings of text and speech tokens [00:04:48]. Speech is then generated directly by GPT-4o and sent back through LiveKit’s network to the device [00:05:22].

Current and Emergent Use Cases of Voice AI

Personal Learning and Tutoring

One common use case is using advanced voice AI like ChatGPT voice as a personal tutor [00:01:53]. Users can ask any question, from Frontier Tech topics like quantum theory and CRISPR to basic concepts like how lightning works [00:01:23]. This provides a non-judgmental environment to ask “dumb questions” and access the world’s knowledge [00:02:14].

Creative Tools

Creative tools are expected to become more voice-based, interactive, and multimodal [00:09:18]. Users will act as “orchestrators” or “Maestros,” shaping assets while the AI handles the mechanical work [00:09:56]. The interface could resemble Iron Man’s interaction with Jarvis, using natural voice commands to manipulate digital assets [00:09:29].

Telephony and Customer Support

A significant area for AI integration is the telefony/telecom space, where existing processes occur at massive scale [00:20:00]. Companies like Sierra and Parloa are disrupting this by integrating AI to reduce costs in customer support [00:20:41]. Billions of calls happen globally every month, making it a high-penetration use case for voice and AI [00:21:05]. Examples include automated insurance eligibility lookups where AI can call out to humans [00:30:51].

Robotics

Humanoid robotics will heavily rely on a split between cloud and on-device AI models [00:32:38]. While planning and reasoning might occur in the cloud, reflex actions and immediate movements must happen on-device for safety and responsiveness [00:32:57]. For example, a robot walking needs on-device processing to react instantly to hazards like an approaching car [00:33:51].

Autonomous Vehicles

Beyond robotics, self-driving cars like Tesla’s are considered a “marvel of technology” [00:42:19]. The experience of a car navigating complex environments autonomously highlights the visceral impact of integrated AI [00:42:05].

Video Games

The future of video games is expected to feature expansive open worlds populated with dynamic, lifelike characters that users can interact with using natural human inputs, particularly voice [00:43:47]. This could transform games into “Choose Your Own Adventure” experiences with infinite possibilities [00:43:58].

Evolution of AI Interfaces and Challenges

From Modes to Hybrid Modalities

Currently, AI interfaces often operate based on “modes” (e.g., voice mode in a car, text mode in a crowded space) [00:16:47]. However, the future will see more fluid blending of modalities, similar to human collaborative work [00:17:29]. This means simultaneously using computer vision (looking at a screen), typing, and asking questions to an AI co-pilot, which might even take control of the keyboard to correct mistakes [00:17:57].

The “Thin Client” Dream

The concept of a “Thin Client” – where a powerful device isn’t needed locally – aligns with the evolution of AI interfaces [00:14:24]. Chat interfaces, which humans already use daily (texting, WhatsApp, Slack), could become the universal UI for all applications, integrating voice, on-the-fly generated UI, and text [00:15:48].

AI Gaining “Senses”

LiveKit’s development extends beyond hearing and speaking for AI. By enabling agents to control virtualized browser instances and stream them to users, AI is gaining the ability to “touch” applications, mimicking human interaction with touchscreens [00:28:46]. This allows AI to manipulate applications, fill out forms, and be unblocked by human input when stuck (e.g., for passwords or unclear choices) [00:27:14].

Latency and Systems Integration

While latency in conversational AI has drastically improved (from 4 seconds to ~320 milliseconds with GPT-4o and LiveKit) [00:23:48], enabling human-level conversational turnovers (around 300 milliseconds), the main hurdles for widespread adoption are systems integration [00:25:01]. AI models aren’t perfect yet and still hallucinate, requiring human-in-the-loop oversight [00:22:52]. Integrating with bespoke or obscure backend systems (e.g., Salesforce, ticket trackers) presents a significant challenge [00:25:40].

On-Device vs. Cloud Inference

The balance between on-device and cloud-based AI inference remains a key consideration [00:32:09].

On-device: Crucial for real-time, reflex actions, especially in robotics where immediate physical reactions are necessary [00:33:51]. Also for privacy-sensitive use cases [00:36:45].
Cloud: Essential for accessing vast amounts of information (like a human using a phone to look up info) or for complex reasoning requiring extensive knowledge (e.g., troubleshooting a router) [00:34:27]. The ideal scenario, without resource constraints, would involve parallel processing by both local and cloud models, with the fastest real answer being delivered [00:35:41]. Data for legal reasons or for updating training data will generally be sent to the cloud [00:36:36].

Slowdown in AI Penetration

Despite rapid advancements, the penetration of AI, particularly contemporary forms like ChatGPT, into daily work and consumer habits has been slower than anticipated [00:40:02]. Many people have heard of AI but do not actively use it [00:40:26].

Underhyped AI Technology

Spiking Neural Networks (SNNs) are considered an “underhyped” or “under researched” area in AI [00:37:33]. These analog neural networks are modeled more closely after the human brain’s neuron interaction [00:37:43]. While harder to train than Transformers, they hold significant promise for processing audio and video signals [00:38:13].

Tubegraph

Explorer

Table of Contents