From: aidotengineer

The Problem: Lack of Multimodality in Chatbots

ChatGPT is recognized as one of the fastest-growing applications in history, used by hundreds of millions daily [00:00:00]. Despite its popularity, the current design of its voice interaction can be confusing [00:00:08]. The application presents two distinct buttons for voice interaction: a voice-to-text option and a voice-to-voice option [00:00:22].

When using the voice-to-voice option, the system responds with a well-worded email, but it can only communicate through voice [00:01:03]. To collaborate on the written email, users must end the call and find a voice transcript with formatting [00:01:07]. This highlights a critical lack of multimodal interaction where text and voice work together seamlessly [00:01:14].

”Shipping the Org Chart”

This disjointed user experience is likened to “shipping the org chart,” a concept explained by Scott Hanselman [00:01:22]. Hanselman described sitting in an electric vehicle where the map, climate controls, and speedometer all displayed different fonts, revealing them to be three Android tablets chained together [00:01:26]. This indicated the organizational structure of the large international auto company [00:01:42].

Similarly, OpenAI is seen as guilty of this, where a technical improvement shipped by a “whiz kid” is exactly what consumers want, but marketing is never consulted [00:01:48]. This results in a “science fair full of potential options” rather than a cohesive product [00:01:59].

Proposed Solutions for Improved Integration

To fix these issues in conversational AI applications, two key changes are proposed for ChatGPT:

  1. Allowing voice and text interaction simultaneously [00:02:14].
  2. Smartly choosing the right model based on the user’s request [00:02:16].

These improvements can be achieved using off-the-shelf tools [00:02:20].

  • 40 Realtime can provide live audio chat [00:02:23], facilitating voicefirst AI overlays.
  • Tool calls can handle the rest, enabling the system to send texts for longer details like links and drafts [00:02:26].
  • A research tool could hand off to a smarter model and return with a detailed answer [00:02:34].

Enhanced User Interface Concept

An improved interface would feature a voice button that transitions the app to a voice mode [00:02:40]. This mode would include mute and end call controls, along with a new “chat” button [00:02:49]. Activating the chat button would pull up an iMessage-like panel, allowing users to text while on a call, similar to a FaceTime call [00:02:51]. This panel would show call controls at the top, a reminder of what was asked, and a text response for details like email drafts [00:02:59].

Smart Model Selection

The system could also adapt to complex queries. For instance, in a developer tool like Warp Terminal for writing code, a simple request like “undo my last commit” can be handled by a coding agent running commands [00:03:14]. For more complex questions, such as “refactor this entire codebase to use Flutter instead,” the system detects the complexity and writes a plan using a “reasoning model” to ensure the code works [00:03:30].

This pattern can be applied using heuristics: if a user asks for details or pros and cons, the system could hand off to a reasoning model, indicate its thinking time, and then return a more detailed response [00:03:44]. This represents a significant step towards the future directions for Voicefirst AI.

Demonstration of Enhanced Integration

A practical demonstration illustrates this enhanced integrating OpenAI API with voice agents:

  • A user asks, “Can you send me a link to a park that I should go visit?” [00:04:12].
  • The system responds verbally, “I’ll send you a link to a popular park in California,” and simultaneously sends the link via text in the chat panel [00:04:15].
  • When asked, “Can you tell me more about its history? Go deep,” the system verbally provides an overview and then directs the user to “Check the chat for more details” [00:04:20]. This shows an improved evaluation of voice applications including audio and text.

Technical Implementation

This multimodal interaction is achieved using tool calls for handling text input [00:04:37]. A “send chat message” tool sends details that are easier to explain via text [00:04:39]. This is implemented without a specific system prompt, simply by adding a description that allows the system to intelligently send relevant information as text [00:04:46].

For “reasoning models,” another tool is used [00:04:57]. When a user wishes to delve deeper into a topic, this tool sends details to the reasoning model, which then responds back to the main model or directly to the client [00:05:01]. This approach leverages the power of simple prompts to achieve complex behaviors [00:04:52]. The source code for this “fix gpt” project is available on GitHub [00:05:13].