From: aidotengineer

ChatGPT quickly became one of the fastest-growing applications in history [00:00:00]. Despite its popularity, its user experience can be confusing [00:00:08]. The application often feels like its different components were built by separate teams, lacking a cohesive multimodal interface where text and voice interact seamlessly [01:18:00].

The “Shipping the Org Chart” Problem

This issue is akin to what Scott Hansselman described as “shipping the org chart” [01:22:00]. He illustrated this by describing the experience of being in an electric vehicle where the map, climate controls, and speedometer all had different fonts, revealing that they were essentially three Android tablets chained together, reflecting the company’s internal organizational structure rather than a unified user experience [01:26:00]. OpenAI exhibits a similar pattern, where rapid technical improvements are shipped without full integration, leading to a “science fair full of potential options” [01:50:00].

Enhancing AI Interaction with Tool Calls and Reasoning Models

To address these limitations, two key improvements for applications like ChatGPT are proposed:

  1. Allowing simultaneous voice and text interaction [02:14:00].
  2. Smartly choosing the appropriate AI model based on the user’s request [02:16:00].

These enhancements can be achieved using off-the-shelf tools and functions [02:20:00].

The Power of Tool Calls

Tool calls enable AI models to perform specific actions or access external information. For instance, a real-time audio chat can leverage tool calls to:

  • Send text messages for longer details, links, or drafts [02:29:00].
  • Hand off to a smarter model for more complex queries and then return with an answer [02:34:00].

An example of a specific tool is the send chat message tool, which allows the AI to provide details that are easier to convey via text [04:41:00]. This can be achieved with simple prompts and descriptive instructions [04:52:00].

Leveraging Reasoning Models

For more detailed or complex inquiries, reasoning models are crucial. These models are designed to:

  • Handle complex questions by breaking them down [03:30:00].
  • Write a plan or process to ensure the desired outcome, particularly in tasks like code refactoring [03:37:00].
  • Provide more detailed responses, even informing the user about the thinking process involved [03:50:00].

A common pattern involves using heuristics: if a user asks for details, pros, or cons, the request can be handed off to a reasoning model [03:46:00]. Another specific tool for reasoning models sends the conversation details to the reasoning model whenever a user wants to delve deeper into a topic [04:57:00].

Practical Application: An Enhanced AI Experience

An improved AI experience can feature a voice interface with an integrated chat panel. This allows users to speak while simultaneously viewing or typing text, similar to texting a friend during a FaceTime call [02:51:00].

Consider the following interaction:

  1. User asks via voice: “Can you send me a link to a park that I should go visit?” [04:08:00]
  2. AI responds via voice and sends text via a tool call: “Of course. I’ll send you a link to a popular park in California.” [04:12:00] (A link appears in the chat panel).
  3. User asks for more detail via voice: “I’ve heard about Yosemite. Can you tell me more about its history? Go deep.” [04:18:00]
  4. AI responds via voice and sends detailed text via a reasoning model tool call: “Yosemite National Park’s history is rich, beginning with its ancient Native American heritage and leading to its establishment as a national park. Check the chat for more details.” [04:25:00] (Extensive historical information appears in the chat panel).

This seamless integration of voice and text, facilitated by tool calls for specific actions and reasoning models for deeper analysis, creates a far more intuitive and powerful user experience [03:05:00]. The source code for such an implementation is available on GitHub under “fix gpt” [05:13:00].