Using offtheshelf tools for app enhancement

From: aidotengineer

While ChatGPT is one of the fastest-growing applications in history, with hundreds of millions of daily users, its user experience can be confusing [00:00:00]. A key issue is the disjointed interaction between its voice and text functionalities, making them feel as if they were developed by separate companies [00:01:18]. This phenomenon, termed “shipping the org chart” by Scott Hanselman, describes how internal organizational structures can inadvertently manifest as fragmented user experiences [00:01:20].

Identifying Key User Experience Issues

The current ChatGPT interface presents two separate buttons for voice interaction: a voice-to-text option and a voice-to-voice option [00:00:20]. While the voice interface can respond to prompts like writing an email [00:00:41], it can only respond through voice [00:01:03]. To collaborate on a written email, users must end the call and find a voice transcript, often with formatting applied at the end [00:01:07]. An ideal experience would be multimodal, combining text and voice seamlessly [00:01:14]. This lack of cohesive design is similar to a “science fair full of potential options,” rather than a unified product [00:01:59].

Proposed Enhancements for AI Applications

Two primary changes can significantly improve the user experience:

Simultaneous Voice and Text Interaction: Allowing users to interact using both voice and text at the same time [00:02:14].
Intelligent Model Selection: Automatically choosing the most appropriate AI model based on the user’s query [00:02:16].

Leveraging Off-the-Shelf Tools

These enhancements can be achieved using off-the-shelf tools and APIs [00:02:20]. For instance, “40 Realtime” can facilitate live audio chat, while tool calls can manage the rest [00:02:23].

Sending Text Details: An application can be designed to send text for longer details such as links and drafts [00:02:29].
Smarter Model Handoff: A research tool could hand off complex queries to a more capable model to generate a detailed answer [00:02:34].

Enhanced User Interface Concept

Imagine an updated interface where a voice button transitions the app to voice mode, complete with mute, end call, and a new “chat” button [00:02:40]. This chat button would reveal a panel similar to iMessage, allowing users to text while on a call, with call controls at the top, a reminder of past queries, and a text response area for detailed outputs like email drafts [00:02:51].

Handling Complex Queries with Reasoning Models

For queries requiring more detail, a “reasoning model” pattern can be employed. This concept is explored in developer tools like Warp Terminal, which enables writing code in any environment [00:03:14].

Simple Actions: For simple tasks, such as “undo my last commit,” the system hands off to a coding agent that runs commands in the terminal [00:03:21].
Complex Actions: For complex requests, like “refactor this entire codebase to use Flutter instead,” the system detects complexity and uses a reasoning model to formulate a plan, ensuring the code functions correctly [00:03:30].

This pattern, leveraging heuristics, allows the system to hand off to a reasoning model when details, pros, and cons are requested, indicate thinking time, and then return a comprehensive response [00:03:44].

Practical Implementation with Off-the-Shelf APIs

Building these features with off-the-shelf APIs is straightforward [00:03:57]. For instance, when asked for a park link and then its history, the system can provide a link and then elaborate on the history, prompting the user to “check the chat for more details” [00:04:08].

A “send chat message” tool can be used to send details that are more easily explained via text [00:04:37]. This can be achieved with simple descriptions, without extensive system prompts, demonstrating the power of simple prompts in modern AI development [00:04:46]. For reasoning models, another tool can be used to delve deeper into a topic, sending details to the model and allowing it to respond or dump information directly into the client [00:04:57].

The source code for these enhancements is available on GitHub under “fix gpt” [00:05:13].

Tubegraph

Explorer

Table of Contents