From: redpointai
Multimodal AI, particularly the advancement of video models, represents a significant area of innovation in artificial intelligence. This field integrates various data types, such as vision and audio, into AI models to enable more complex interactions and creations [00:18:38].
The Rise of Multimodal Models
The concept of applying Transformer architectures and similar techniques to adapt to different modalities, including vision and audio, became apparent around 2018, following the invention of large language models (LLMs) [00:18:26]. Initially, these functionalities appeared as standalone models like DALL-E or Whisper [00:18:46]. Ultimately, the goal is for these diverse modalities to be integrated into a single, comprehensive model [00:18:51].
One modality that has historically presented significant challenges is video [00:18:56].
Video Model Breakthroughs: Sora
Sora, developed by OpenAI, is noted as one of the first models to successfully demonstrate advanced video generation [00:19:00]. While other companies like Runway have also released video models, Sora has notably progressed in terms of its product approach [00:20:06].
There are two key distinctions between video models and other modalities like image generation:
- Extended Sequences and User Interface: Unlike creating an image, which often involves a single prompt, video creation requires an extended sequence of events, necessitating a dedicated user interface to manage the story that unfolds over time [00:19:49]. Sora’s product team has incorporated a storyboard capability that allows users to set checkpoints, such as every few seconds or every ten seconds, to guide the video generation process [00:21:56].
- High Cost: Video models are expensive to both train and run [00:20:21]. Despite the cost, Sora’s quality and broad distribution, including availability to OpenAI Plus account holders, set a high bar for competitors [00:20:55].
Future of Video Models
The trajectory for video models is expected to mirror that of LLMs [00:21:23]. Over the next few years, future video models are expected to demonstrate:
- Improved Quality: While instantaneous quality is already high, the main area for improvement is in creating extended, coherent generations [00:21:31]. The challenge is scaling from a few seconds of video to an hour, which is a difficult problem [00:22:10].
- Reduced Cost: Similar to how a GPT-3 quality token became 100 times cheaper than when GPT-3 was released, Sora-quality videos are anticipated to become practically free [00:22:25].
The possibility of AI-generated full-length movies that win awards is predicted to occur within two years [00:23:08]. However, the true value of such films would stem from a director’s creative vision utilizing the AI model as a tool, rather than the AI itself being the sole creative force [00:23:17]. The speed of progress becomes very fast once a technology can be effectively demoed [00:23:55].
The rise of multimodal models and their implications
The progress in multimodal AI, including the trends in AI model training and deployment of video models, highlights a broader shift in challenges and advancements in AI technology. OpenAI’s strategy, for instance, pivoted significantly over the years, moving from a non-profit organization focused on paper writing to one building products and APIs, demonstrating the adaptability required in developing and utilizing AI models in the tech industry [00:41:01]. These changes, including partnerships and building consumer products like ChatGPT, were necessary adaptations to drive the technology forward and make it accessible [00:43:07].