From: aidotengineer
ChatGPT is one of the fastest-growing applications in history, but its current voice and text interaction features present significant design flaws [00:00:00]. These issues can make the app confusing and limit its utility as a comprehensive conversational system [00:00:08].
Current Limitations of ChatGPT
Currently, ChatGPT features two separate buttons for voice interaction: a voice-to-text option and a voice-to-voice option [00:00:20]. While the voice-to-voice feature can provide nicely worded responses, it’s limited to voice output only [00:01:01]. This means users cannot collaborate on written content directly within the voice interface [00:01:05]. To access a voice transcript or any formatting, users must end the call [00:01:07].
This segregated approach makes the voice and text features feel like they were built by “two different companies,” lacking a seamless multimodal experience [00:01:18]. This phenomenon, termed “shipping the org chart” by Scott Hanselman, occurs when an organization’s internal structure is inadvertently reflected in the product’s disjointed user experience [00:01:22].
Proposed Improvements and Solutions
To address these challenges in building AI voice agents, two main changes are proposed for ChatGPT:
- Simultaneous Voice and Text Interaction: Allowing users to interact using both voice and text at the same time [00:02:11].
- Intelligent Model Selection: Smartly choosing the appropriate AI model based on the complexity and nature of the user’s request [00:02:16].
These improvements can be implemented using off-the-shelf tools [00:02:20]. For instance, “40 Realtime” can provide live audio chat, while “tool calls” can manage other functionalities [00:02:23].
Enhanced User Interface for Multimodal Interaction
A redesigned voice mode interface would enable a more integrated experience:
- Persistent Voice Mode: A voice button transitions the system into a continuous voice mode [00:02:40].
- Integrated Chat Panel: A new “chat” button would pull up an iMessage-like panel [00:02:51]. This allows users to text a “friend” while on a “FaceTime call,” with call controls at the top and a text response area for more detailed outputs like email drafts [00:02:56].
Intelligent Model Handoff for Complex Queries
For queries requiring more detail or complex processing, a “reasoning model” can be employed [00:03:10]. This pattern is demonstrated by Warp Terminal, a developer tool that uses an AI agent to handle coding tasks [00:03:14].
- Simple Tasks: A coding agent can execute commands directly (e.g., “undo my last commit”) [00:03:21].
- Complex Tasks: For intricate requests (e.g., “refactor this entire codebase to use Flutter”), the system detects complexity and uses a reasoning model to formulate a plan [00:03:30].
- Heuristics for Handoff: If a user asks for “details and pros and cons,” the system can hand off to a reasoning model, indicate thinking time, and then return a more detailed response [00:03:46].
Implementation Details
These solutions can be built using off-the-shelf APIs:
- Send Chat Message Tool: A “send chat message” tool, powered by tool calls, allows the system to send details that are easier to convey via text [00:04:37]. This tool is effective with a simple description, enabling the model to smartly determine when to send text [00:04:46].
- Reasoning Model Tool: Another tool is designated for “reasoning models.” When a user wishes to delve deeper into a topic, this tool sends the details to a more capable model, which then processes the information and sends a detailed response back to the client [00:04:57].
The source code for these improvements is available on GitHub under “fix gpt” [00:05:13]. These patterns demonstrate the effectiveness of building a reliable conversation system for voice agents by integrating multimodal interaction and intelligent model routing, moving AI agents beyond ChatGPT’s current limitations.