Improving user experience with multimodal interaction

From: aidotengineer

The rapid growth of applications like ChatGPT, which has become one of the fastest-growing apps in history with hundreds of millions of daily users, highlights a significant challenge in user experience: complexity and confusion in its interface [00:00:00] [00:00:06]. A key area for improvement lies in multimodal interaction, particularly the seamless integration of voice and text.

Current Challenges in Multimodal AI Systems

Current iterations of AI applications, such as ChatGPT, often present a disjointed user experience when attempting multimodal interaction.

Disconnected Voice and Text Functionality

When interacting with ChatGPT via voice, the system provides two distinct buttons for voice interaction: a voice-to-text option and a voice-to-voice option [00:00:20]. While the system can respond verbally with well-worded content, such as an email draft, it can only respond through voice [00:01:01]. This means users cannot collaborate on the written output directly [00:01:03]. The only way to access a written format is to end the voice call and locate a voice transcript with formatting [00:01:07].

This segregated approach makes it feel as though the voice and text functionalities were developed by two different companies [00:01:17].

”Shipping the Org Chart”

Scott Hansselman coined the term “shipping the org chart” to describe this phenomenon, where a product’s user interface reflects the internal organizational structure rather than a cohesive user experience [00:01:20]. An example is an electric vehicle (EV) dashboard where different displays (map, climate, speedometer) have varying fonts, revealing they are distinct Android tablets chained together, thus exposing the company’s internal divisions [00:01:26]. OpenAI is similarly “guilty” of this, where technical improvements are shipped without integrated marketing or user experience design, resulting in a “science fair full of potential options” rather than a unified product [00:01:47].

Proposed Solutions for Enhanced User Experience

To address these challenges and improve user experience design with AI, two key changes are proposed for applications like ChatGPT:

Allowing simultaneous voice and text interaction [00:02:11].
Smartly choosing the right model depending on the user’s request [00:02:16].

These improvements can be implemented using off-the-shelf tools [00:02:19].

Integrated Multimodal Interaction

A redesigned interface would feature a voice button that transitions from a dormant state to a voice mode, similar to existing functionality [00:02:40]. Crucially, it would include a new “chat” button alongside mute and end call options [00:02:48]. This chat button would pull up an iMessage-like panel, allowing users to text while on a voice call, much like texting a friend during a FaceTime call [00:02:51]. This panel could display call controls, a reminder of previous questions, and a text response for details like email drafts [00:02:59].

This approach embodies the concept of multimodal interfaces for user interaction by allowing users to fluidly switch between and combine communication modalities.

Intelligent Model Selection and Information Delivery

The system should intelligently select the appropriate AI model or tool based on the complexity or nature of the user’s request.

Handling Detailed Requests:
- For requests requiring longer details, such as links or drafts, the system could utilize “tool calls” to send text directly [00:02:25].
- The “send chat message” tool can be used to send details that are easier to explain via text [00:04:37]. This can be achieved with a simple description, allowing the model to intelligently send the right information as text [00:04:47].
Complex Reasoning with Specialized Models:
- For more complex questions, a research tool could hand off the query to a “smarter” reasoning model [00:02:33].
- The Warp Terminal, a developer tool for writing code, demonstrates this by having a coding agent handle simple commands like “undo my last commit” [00:03:13]. For complex tasks like “refactor this entire codebase to use Flutter,” it detects the complexity and writes a plan using a reasoning model to ensure code functionality [00:03:30].
- This “reasoning model” pattern can be applied using heuristics; for instance, if a user asks for details, pros, and cons, the system could hand off to a reasoning model, indicate it’s processing, and then return a more detailed response [00:03:44].
- A dedicated tool for reasoning models can be used when a user wants to “go deeper” on a topic, sending details to the model and then responding back to the user or directly to the client [00:04:57].

Off-the-Shelf Tools

These improvements can be built using existing APIs. For live audio chat, “40 Realtime” can be utilized [00:02:22]. The overall system can then rely on tool calls to manage interactions [00:02:25].

The source code for these proposed solutions is available on GitHub under “fix gpt” [00:05:13].

Tubegraph

Explorer

Table of Contents