Multimodal interaction in apps

From: aidotengineer

ChatGPT, despite being one of the fastest-growing applications in history, presents a confusing user experience, particularly concerning its interaction methods [00:00:00], [00:00:08]. While it offers both voice-to-text and voice-to-voice options, these feel like separate applications built by different companies [00:00:20], [00:01:18]. This phenomenon, termed “shipping the org chart” by Scott Hansselman, occurs when internal organizational structures are reflected in the disjointed user experience of a product [00:01:22], [00:01:42]. The result is a “science fair full of potential options” that lack cohesion [00:01:59].

Enhancing Multimodal Interaction

To address these issues, two key changes are proposed for apps like ChatGPT:

Concurrent Voice and Text Interaction Allowing users to interact using both voice and text simultaneously [00:02:14].
Intelligent Model Selection Smartly choosing the appropriate underlying AI model based on the user’s request [00:02:16].

These improvements can be achieved using off-the-shelf tools and AI-native development patterns [00:02:20]. For instance, “40 Realtime” can provide live audio chat, while “tool calls” can manage the rest [00:02:23], [00:02:26]. This allows the AI to send text for longer details like links or drafts, or hand off complex requests to a smarter model for a more detailed response [00:02:29], [00:02:34].

Redesigned User Interface

An improved interface for multimodal AI interaction would feature a dedicated voice button that transitions to a voice mode [00:02:40], [00:02:42]. Within this mode, alongside standard call controls (mute, end call), a new “chat” button would pull up a panel similar to a messaging app like iMessage [00:02:49], [00:02:51]. This allows users to text simultaneously while on a voice call, with call controls at the top, a reminder of the initial query, and a text response area for detailed outputs like email drafts [00:02:56], [00:03:00].

Adaptive AI Model Selection

The system’s ability to adapt to the complexity of a user’s query is crucial for building user experiences with AI [00:03:10]. For instance, in a developer tool like Warp Terminal, simple requests (e.g., “undo my last commit”) can be handled by a specific coding agent that runs commands [00:03:14], [00:03:21]. However, more complex questions (e.g., “refactor this entire codebase to use Flutter”) would trigger a reasoning model to generate a plan and ensure code functionality [00:03:30], [00:03:37]. This pattern, using heuristics, can hand off to a reasoning model for tasks requiring detailed pros and cons, indicating the thinking time, and returning a comprehensive response [00:03:44].

Implementation with APIs

Implementing multimodal models and multimodal AI agents can be done with off-the-shelf APIs [00:03:57]. An example demonstrates a user asking for a park link via voice, then querying its history. The AI responds with a link in text and a deeper historical overview, also in text [00:04:00], [00:04:10], [00:04:18], [00:04:25]. This relies on a “send chat message” tool, which intelligently sends details via text when appropriate, simply based on its description without complex system prompts [00:04:37], [00:04:46]. Similarly, for reasoning models, another tool is invoked when a user wants to “go deeper” on a topic, sending details to the model and allowing it to respond or dump information directly into the client [00:04:57], [00:05:01]. The source code for such an implementation is available on GitHub under “fix gpt” [00:05:13].

Tubegraph

Explorer

Table of Contents

Multimodal interaction in apps

Enhancing Multimodal Interaction

Redesigned User Interface

Adaptive AI Model Selection

Implementation with APIs

Graph View

Backlinks