The rise of multimodal models and their implications

From: redpointai

Logan Kilpatrick, the first AI hire at OpenAI, discusses the breadth of OpenAI’s offerings and future developments, including the increasing significance of multimodal AI models [00:00:19].

Current State of Multimodal Models

While there will be many multimodal developments, Logan believes that vision use cases are still in their early stages [00:03:22]. Current vision models are limited in various domains [00:03:30]. For really impactful applications, the model needs a very detailed understanding of the positional relationship between objects in an image [00:03:40].

Kilpatrick likens the current state of vision models to the “GPT 3.5 Vision era” [00:04:07]. Once the next level of robustness is achieved, similar to the jump from GPT 3.5 to GPT-4, many more use cases will be unlocked [00:03:55], [00:04:10]. For example, the text output from Dolly today is not always perfect, and fixing it often requires significant skill in tools like Photoshop [00:43:49], [00:44:11].

Implications and Future Potential

Enhanced User Experiences

The ability to combine different modalities (like drawing and text) with AI models can greatly empower users. For instance, Dolly has inspired Logan Kilpatrick to expand his creative imagination, making him feel less limited by his artistic skills [00:41:48], [00:41:50].

Transformative Applications

TL Draw: A notable example of a multimodal application is TL draw, which converts user sketches into functional applications. This application leverages various OpenAI components and was built quickly by a developer, demonstrating the accessibility of the technology [00:40:07], [00:40:41], [00:40:47].
Creative Tools: Logan envisions tools that can take a sketch and use a generative image model to create polished art, allowing users to explore different possibilities [00:41:17], [00:42:07].
Design Platforms: A key future use case would be integrating vision models into platforms like Canva, allowing users to reformat designs or manipulate movable objects with high precision. This would require the model to perfectly understand spatial relationships, something current models struggle with [00:42:55], [00:43:03], [00:43:27].
OCR and Document Processing: For tasks like Optical Character Recognition (OCR) on spreadsheets or receipts, current vision models may get most things right but miss structural details or misinterpret positional information [00:43:09], [00:43:16].

Competition and Market Impact

Google Gemini, a multimodal model, represents a significant push in the ecosystem [00:23:51], [00:24:14]. The demonstration of Gemini drawing a duck in real-time and understanding it as it’s drawn is particularly impressive, pointing towards a future where such multimodal capabilities will be ubiquitous [01:05:50], [01:06:00]. This competition is positive for consumers, as companies like Google and Apple, with their massive user bases, can introduce the broader public to the possibilities of AI technology [00:57:25], [00:57:52].

Logan anticipates that multimodal models will be a major theme in 2024, as people are just beginning to explore the potential of inputting images or videos into these models and generating combinations of text, images, or videos [01:05:31], [01:05:41].

Tubegraph

Explorer

Table of Contents