From: aidotengineer

Despite being one of the fastest-growing applications in history with hundreds of millions of daily users, ChatGPT presents significant design challenges that lead to a confusing user experience [00:00:00].

Current Design Problems

The primary issue stems from the disjointed integration of its functionalities, particularly between voice and text interactions.

Disconnected Voice and Text Interfaces

ChatGPT features separate buttons for voice-to-text and voice-to-voice interaction [00:00:20]. While it can respond through voice, collaborating on a written output, like an email, requires ending the call and finding a voice transcript [00:01:03]. This lack of multimodal interaction makes it feel as if “these two apps were built by two different companies” [00:01:18].

”Shipping the Org Chart”

This design flaw is analogous to what Scott Hansselman describes as “shipping the org chart” [00:01:21]. He illustrates this with an electric vehicle where the map, climate controls, and speedometer all use different fonts and appear to be three separate Android tablets chained together, revealing the internal organizational structure rather than a unified product [00:01:26]. OpenAI is similarly “guilty of this” [00:01:48], where technical improvements are shipped without marketing consultation, resulting in a “science fair full of potential options” [00:01:59] rather than a cohesive user experience.

Proposed Solutions and Improvements

To address these issues, two key changes are proposed:

  1. Allowing simultaneous voice and text interaction [00:02:11].
  2. Smartly choosing the appropriate model based on the user’s request [00:02:16].

These improvements can be achieved using “off-the-shelf tools” [00:02:20].

Integrated Multimodal Experience

A redesigned interface could feature a voice button that activates a voice mode, operating similarly to existing voice assistants but with integrated controls for mute, end call, and a new “chat” button [00:02:40]. This “chat” button would pull up an iMessage-like panel, allowing users to text while on a call, with call controls at the top and text responses for detailed information or email drafts [00:02:51].

Smart Model Selection

The system could intelligently hand off tasks to different models based on complexity:

  • Simple Requests: For straightforward commands like “undo my last commit,” a coding agent could directly run commands [00:03:21].
  • Complex Questions: For detailed requests like “refactor this entire codebase to use Flutter instead,” the system could detect complexity and engage a “reasoning model” to write a plan, ensuring the code functions correctly [00:03:30]. This pattern is effective when asking for details or pros and cons, allowing the system to take longer to think and provide a more comprehensive response [00:03:44].

Technical Implementation

Implementing these improvements can leverage existing APIs:

  • Live Audio Chat: Utilized with “40 Realtime” [00:02:23].
  • Tool Calls: Can handle various tasks, including sending text for longer details, links, or drafts [00:02:26].
  • send chat message tool: This tool is used for sending details that are easier to convey via text [00:04:37]. Its description enables the system to intelligently send appropriate information via text [00:04:46].
  • reasoning model tool: Activated when a user wants to delve deeper into a topic, allowing the system to gather details and formulate a detailed response or dump it directly into the client [00:04:57].

The source code for a potential solution is available on GitHub under “fix gpt” [00:05:13].