From: hu-po
This article explores two prominent AI models, Sora and Gemini 1.5, which were released in close succession by OpenAI and Google DeepMind, respectively. While they serve different purposes, both represent significant advancements in their fields [00:02:21].
Sora: State-of-the-Art Video Generation
Sora is OpenAI’s new state-of-the-art video generation model [00:02:56] [01:36:25]. OpenAI released a “technical report” rather than a traditional paper, featuring videos demonstrating the model’s capabilities [00:03:07] [01:37:51].
Capabilities and Impact
Sora generates videos that are significantly longer, of higher quality, and feature superior motion compared to previous models [00:05:50] [01:37:36]. This level of video generation quality is seen as a “step function” improvement, potentially enabling new advancements in 3D generative content, such as text-to-3D models like Gaussian Splats or NeRFs [00:06:42] [00:07:54]. It is expected to lead to significant improvements in generative 3D within approximately six months [00:07:56].
Key Contributors and Their Backgrounds
Sam Altman, CEO of OpenAI, credited three individuals for Sora’s development:
- Tim Brooks: A research scientist at OpenAI and a University of California Berkeley PhD [00:08:29]. He is known for his work on InstructPix2Pix, a diffusion model for image editing that uses stable diffusion as a generator and GPT-3 as a prompt engineer to create training datasets [00:09:03] [00:11:15].
- Bill Peebles (William Peebles): Also a research scientist at OpenAI and a Berkeley PhD [00:12:27] [00:12:35]. He is known for the “Scalable Diffusion Models with Transformers” paper, which replaces the U-Net backbone in latent diffusion models with a Transformer [00:13:02] [00:13:13] [00:16:08]. This architecture is called a Diffusion Transformer (DiT) [00:16:26].
- Aditya Ramesh (Model Mechanic): A more senior figure with extensive citations, known for papers on image generation, including “Hierarchical Text-Conditional Image Generation with CLIP Latents” and “Zero-Shot Text-to-Image Generation” [00:19:42] [00:20:14]. His work includes using Transformers that autoregressively model images and text as tokens in a single data stream, often employing a discrete variational autoencoder (VQ-VAE) [00:21:39] [00:21:49].
Underlying Architecture and Data
Sora is largely believed to be a latent diffusion model, where denoising occurs in a “SpaceTime latent space” [00:44:00]. It likely utilizes a Transformer-based autoencoder that can encode entire videos at once, rather than frame by frame [00:44:08]. The model transforms visual data into “SpaceTime patches” or “visual tokens,” which are then fed into a Transformer. This patching allows Sora to train on and generate videos of variable resolutions and lengths [00:31:50] [00:32:01].
A key technique used in training is “recaptioning,” where language models like GPT-4V expand short user prompts into longer, more detailed captions to augment the training data [00:35:37] [01:39:39]. There is speculation that Sora’s training data includes video game footage, which might explain some “video game-looking” artifacts in generated content, as it would implicitly learn the physics models present in those games [00:37:26] [02:05:00]. However, OpenAI itself likely did not explicitly generate synthetic data using game engines due to the specialized skill sets required [00:38:38].
Limitations and “World Simulators”
While impressive, Sora is not a perfect “world simulator” [00:41:01]. It exhibits “weirdness,” such as illogical physics or perspective errors [00:42:19]. An example highlighted is a person’s legs “flipping” unnaturally in a generated video [00:42:34]. The “physics” it learns are implicit, derived from the vast amount of video data it consumes, which can include unrealistic video game physics or drawn content [01:24:09] [01:25:21].
Gemini 1.5: Multimodal Understanding and Long Context
Gemini 1.5 is Google DeepMind’s latest multimodal model, focused on understanding rather than generation [01:40:10].
Capabilities and Benchmarks
Gemini 1.5 is described as a “highly compute-efficient multimodal mixture of experts” [00:45:46]. It excels at reasoning over extremely long contexts, achieving near-perfect recall on long context retrieval tasks [00:46:53]. It can consume up to 10 million tokens, a generational leap over existing models like GPT-4 Turbo, which has a 128k token limit [00:47:15] [00:47:50] [01:40:24].
This capability is demonstrated through the “Needle in the Haystack” benchmark, where the model must find specific information (the “needle”) within a vast amount of data (the “haystack”) [00:49:05]. For example, it can analyze an entire novel (like Les Misérables) and answer questions about specific images from the book, correlating visual tokens with text [00:48:12] [00:50:45]. It also demonstrates state-of-the-art speech recognition, marginally outperforming Whisper V3 [00:56:31].
Technical Speculation
Google DeepMind’s technical report provides limited detail on the model’s architecture or training data [00:51:11] [01:41:20]. However, two key aspects are mentioned or speculated upon:
-
Mixture of Experts (MoE): Gemini 1.5 is explicitly described as a “mixture of experts” model [00:57:52]. Given a recent Google DeepMind paper on MoE for deep reinforcement learning, it’s hypothesized that Gemini 1.5 uses “Soft MoE” [00:58:15] [01:00:50]. Soft MoE involves a router network sending weighted averages of input tokens to different “experts” (e.g., MLPs), rather than hard assignments [01:00:59].
-
Ring Attention: The unprecedented long context window of Gemini 1.5 is hypothesized to stem from a technique similar to “Ring Attention” [01:05:11]. A separate paper, “World Model on Million-Length Video and Language with Ring Attention,” released around the same time, details a Transformer that scales context size arbitrarily and achieves strong performance on needle retrieval tasks with multimodal data [01:05:25] [01:08:01]. Ring Attention operates by splitting a sequence into blocks, distributing them across a “ring of hosts” (GPUs), and shuffling key-value blocks between hosts to compute attention efficiently [01:10:52] [01:12:01].
-
Multimodality: Gemini 1.5 consumes multiple modalities (text, image, audio, video) by turning raw data into sequences of tokens [00:46:36]. Specific “new tokens” (e.g.,
end of text
,start of image
) are introduced to explicitly signal changes in modality to the model [01:09:42].
Compute and Data Transparency
Gemini 1.5 Pro was trained on “multiple 4096-chip PODs of Google TPU V4 accelerators” [00:51:37]. Google’s TPUs are custom in-house GPUs designed for AI workloads [00:51:43]. Like OpenAI, Google DeepMind remains vague about its training data, citing “a variety of multimodal and multilingual data” [00:52:14]. This lack of transparency is noted as a concerning trend in AI, potentially due to legal concerns about copyrighted material [00:53:16].
Overall Comparison and Impact
Both Sora and Gemini 1.5 represent significant milestones. Their release timing was notable, with OpenAI’s Sora announcement closely following Google’s Gemini 1.5, interpreted as a strategic move by Sam Altman to “steal their thunder” [00:03:49] [01:45:02].
While Sora’s video generation is visually stunning and creates immediate hype, Gemini 1.5’s technical advancements in long context understanding and perfect recall are argued to be more fundamentally “transformative” [01:44:06] [01:44:08]. Gemini’s ability to process massive amounts of information could significantly disrupt fields reliant on Retrieval Augmented Generation (RAG) systems, as a single model could soon handle an individual’s entire digital history as context [01:47:21]. Sora, on the other hand, is poised to revolutionize video and 3D content creation [01:48:51].
In essence, Gemini 1.5 is seen as technically more impressive, while Sora generated more public hype due to its visual modality [01:47:08]. Both models underscore the increasing importance of scale (compute and data) in achieving state-of-the-art AI capabilities, primarily by tech giants like OpenAI and Google [00:35:01] [01:50:30].