From: aidotengineer
While ChatGPT is one of the fastest-growing applications in history, with hundreds of millions of daily users, its user experience can be confusing [00:00:00]. A key issue is the disjointed interaction between its voice and text functionalities, making them feel as if they were developed by separate companies [00:01:18]. This phenomenon, termed “shipping the org chart” by Scott Hanselman, describes how internal organizational structures can inadvertently manifest as fragmented user experiences [00:01:20].
Identifying Key User Experience Issues
The current ChatGPT interface presents two separate buttons for voice interaction: a voice-to-text option and a voice-to-voice option [00:00:20]. While the voice interface can respond to prompts like writing an email [00:00:41], it can only respond through voice [00:01:03]. To collaborate on a written email, users must end the call and find a voice transcript, often with formatting applied at the end [00:01:07]. An ideal experience would be multimodal, combining text and voice seamlessly [00:01:14]. This lack of cohesive design is similar to a “science fair full of potential options,” rather than a unified product [00:01:59].
Proposed Enhancements for AI Applications
Two primary changes can significantly improve the user experience:
- Simultaneous Voice and Text Interaction: Allowing users to interact using both voice and text at the same time [00:02:14].
- Intelligent Model Selection: Automatically choosing the most appropriate AI model based on the user’s query [00:02:16].
Leveraging Off-the-Shelf Tools
These enhancements can be achieved using off-the-shelf tools and APIs [00:02:20]. For instance, “40 Realtime” can facilitate live audio chat, while tool calls can manage the rest [00:02:23].
- Sending Text Details: An application can be designed to send text for longer details such as links and drafts [00:02:29].
- Smarter Model Handoff: A research tool could hand off complex queries to a more capable model to generate a detailed answer [00:02:34].
Enhanced User Interface Concept
Imagine an updated interface where a voice button transitions the app to voice mode, complete with mute, end call, and a new “chat” button [00:02:40]. This chat button would reveal a panel similar to iMessage, allowing users to text while on a call, with call controls at the top, a reminder of past queries, and a text response area for detailed outputs like email drafts [00:02:51].
Handling Complex Queries with Reasoning Models
For queries requiring more detail, a “reasoning model” pattern can be employed. This concept is explored in developer tools like Warp Terminal, which enables writing code in any environment [00:03:14].
- Simple Actions: For simple tasks, such as “undo my last commit,” the system hands off to a coding agent that runs commands in the terminal [00:03:21].
- Complex Actions: For complex requests, like “refactor this entire codebase to use Flutter instead,” the system detects complexity and uses a reasoning model to formulate a plan, ensuring the code functions correctly [00:03:30].
This pattern, leveraging heuristics, allows the system to hand off to a reasoning model when details, pros, and cons are requested, indicate thinking time, and then return a comprehensive response [00:03:44].
Practical Implementation with Off-the-Shelf APIs
Building these features with off-the-shelf APIs is straightforward [00:03:57]. For instance, when asked for a park link and then its history, the system can provide a link and then elaborate on the history, prompting the user to “check the chat for more details” [00:04:08].
A “send chat message” tool can be used to send details that are more easily explained via text [00:04:37]. This can be achieved with simple descriptions, without extensive system prompts, demonstrating the power of simple prompts in modern AI development [00:04:46]. For reasoning models, another tool can be used to delve deeper into a topic, sending details to the model and allowing it to respond or dump information directly into the client [00:04:57].
The source code for these enhancements is available on GitHub under “fix gpt” [00:05:13].