From: aidotengineer

ChatGPT is one of the fastest-growing applications in history, but its current voice and text interaction features present significant design flaws [00:00:00]. These issues can make the app confusing and limit its utility as a comprehensive conversational system [00:00:08].

Current Limitations of ChatGPT

Currently, ChatGPT features two separate buttons for voice interaction: a voice-to-text option and a voice-to-voice option [00:00:20]. While the voice-to-voice feature can provide nicely worded responses, it’s limited to voice output only [00:01:01]. This means users cannot collaborate on written content directly within the voice interface [00:01:05]. To access a voice transcript or any formatting, users must end the call [00:01:07].

This segregated approach makes the voice and text features feel like they were built by “two different companies,” lacking a seamless multimodal experience [00:01:18]. This phenomenon, termed “shipping the org chart” by Scott Hanselman, occurs when an organization’s internal structure is inadvertently reflected in the product’s disjointed user experience [00:01:22].

Proposed Improvements and Solutions

To address these challenges in building AI voice agents, two main changes are proposed for ChatGPT:

  1. Simultaneous Voice and Text Interaction: Allowing users to interact using both voice and text at the same time [00:02:11].
  2. Intelligent Model Selection: Smartly choosing the appropriate AI model based on the complexity and nature of the user’s request [00:02:16].

These improvements can be implemented using off-the-shelf tools [00:02:20]. For instance, “40 Realtime” can provide live audio chat, while “tool calls” can manage other functionalities [00:02:23].

Enhanced User Interface for Multimodal Interaction

A redesigned voice mode interface would enable a more integrated experience:

  • Persistent Voice Mode: A voice button transitions the system into a continuous voice mode [00:02:40].
  • Integrated Chat Panel: A new “chat” button would pull up an iMessage-like panel [00:02:51]. This allows users to text a “friend” while on a “FaceTime call,” with call controls at the top and a text response area for more detailed outputs like email drafts [00:02:56].

Intelligent Model Handoff for Complex Queries

For queries requiring more detail or complex processing, a “reasoning model” can be employed [00:03:10]. This pattern is demonstrated by Warp Terminal, a developer tool that uses an AI agent to handle coding tasks [00:03:14].

  • Simple Tasks: A coding agent can execute commands directly (e.g., “undo my last commit”) [00:03:21].
  • Complex Tasks: For intricate requests (e.g., “refactor this entire codebase to use Flutter”), the system detects complexity and uses a reasoning model to formulate a plan [00:03:30].
  • Heuristics for Handoff: If a user asks for “details and pros and cons,” the system can hand off to a reasoning model, indicate thinking time, and then return a more detailed response [00:03:46].

Implementation Details

These solutions can be built using off-the-shelf APIs:

  • Send Chat Message Tool: A “send chat message” tool, powered by tool calls, allows the system to send details that are easier to convey via text [00:04:37]. This tool is effective with a simple description, enabling the model to smartly determine when to send text [00:04:46].
  • Reasoning Model Tool: Another tool is designated for “reasoning models.” When a user wishes to delve deeper into a topic, this tool sends the details to a more capable model, which then processes the information and sends a detailed response back to the client [00:04:57].

The source code for these improvements is available on GitHub under “fix gpt” [00:05:13]. These patterns demonstrate the effectiveness of building a reliable conversation system for voice agents by integrating multimodal interaction and intelligent model routing, moving AI agents beyond ChatGPT’s current limitations.