From: hu-po
Apple’s release of the MM1 model paper in March 2024, detailing their approach to building multimodal large language models (MLLMs), offers significant insights into the role of model architecture and data scale in achieving performant AI systems [03:50:00]. This paper, unusually transparent for Apple, outlines crucial design lessons and highlights ongoing challenges in AI model development [04:21:00], particularly concerning architecture, data, and computational constraints [05:56:00].
Apple’s MM1 Model: A Deep Dive into MLLM Development
MM1 (Multimodal One) is a family of multimodal models, scaling up to 30 billion parameters, that process both image and text data to produce text [07:28:00] [11:23:00] [16:10:00]. While Apple prefers the term “multimodal large language models,” the community often refers to these as Vision Language Models (VLMs) [07:13:00]. The paper is extensive, detailing ablations, experiments, learning rates, and data sets, providing a comprehensive look at their methodology [04:09:00] [09:00:00].
The MM1 models demonstrate impressive capabilities, including:
- Counting objects in images [12:24:00].
- Optical Character Recognition (OCR), even with small, handwritten, or shifted text [12:30:00].
- Multi-image reasoning and enhanced in-context learning [11:55:00].
- The crucial ability to “say no” or indicate when information is not present, counteracting common hallucination tendencies in LLMs [13:22:00] [13:57:00].
- Interleaving text and images to answer complex questions [14:17:00].
Core Design Decisions and Their Impact
Apple’s research on MM1 identifies three major axes of design decisions for MLLMs [21:27:00]:
- Architecture: Investigating different pre-trained image encoders and connection methods to LLMs [21:35:00].
- Data: Considering various data types and their relative mixture weights [21:46:00].
- Training Procedure: Determining which parts of the model to train at each stage [22:03:00].
Key Lessons from Ablation Studies
Apple’s extensive ablations led to several crucial design lessons [09:09:00]:
Image Resolution is Paramount
The image resolution has the highest impact on the final performance of the model, followed by model size and training data composition [18:31:00] [42:11:00]. Increasing image resolution from 224x224 to 336x336 significantly improves performance [42:18:00] [58:58:00]. The more visual tokens (patches) used from an image, the better the performance, though this increases computational challenge, especially for multi-image input [45:15:00] [46:28:00].
Vision Language Connector Design is Less Critical
Surprisingly, the design of the vision language connector (or projector), which bridges visual features to the LLM, has a “comparatively negligible performance” impact [10:28:00] [29:28:00] [49:40:00]. This finding supports the “bitter lesson” in AI, suggesting that complex architectural innovations in this component might be less impactful than raw scale and data [49:53:00]. Simple linear projectors work almost as well as more complex convolutional or deformable attention-based abstractors [51:25:26].
Careful Data Mixture for Optimal Performance
For large-scale multimodal pre-training, a careful mix of data types is crucial [09:36:00] [52:47:00]. Apple uses:
- 45% captioned images (short text, high relevance to image) [26:18:00] [53:44:00].
- 45% interleaved image-text documents (longer, more diverse text, less relevance to surrounding image, like news articles) [26:20:00] [53:50:00].
- 10% text-only data [26:23:00].
Interleaved data is instrumental for few-shot and text-on performance, while captioning data is crucial for zero-shot performance [54:27:00]. The inclusion of text-only data prevents the language model from “forgetting” how to read text and ensures attention on text tokens [56:56:00].
Synthetic Data Provides a Non-Trivial Boost
Synthetic captions, like those from the VCap 300M dataset, significantly improve performance, particularly for few-shot learning [42:29:00] [58:16:00]. Even a relatively small proportion (7%) of synthetic data can yield a 2-4% boost, which is greater than the impact of increasing the image encoder capacity [58:34:00] [59:06:00].
Innovations in Training Procedures
Apple employs a two-stage training pipeline: pre-training followed by supervised fine-tuning (SFT) [21:58:00].
Unfrozen Encoders During Training
Unlike many other VLM papers that freeze pre-trained image encoders, Apple trains both the LLM and the visual encoders entirely unfrozen during both pre-training and supervised fine-tuning [47:03:00] [15:57:00] [01:01:06] [01:35:56]. This allows gradients to propagate all the way down into the image encoder weights, potentially leading to better overall performance. This approach, however, requires substantial computational resources [47:27:00].
Mixture of Experts (MoE) for Scalability
MM1 includes Mixture of Expert (MoE) variants up to 30 billion parameters, with 64 experts in some configurations [11:32:00] [20:09:00]. MoE models scale the total number of model parameters while keeping the activated parameters (those used during inference) constant, typically by using only a subset (e.g., top two) of experts for any given token [01:10:42] [01:12:50]. While MoE helps increase model capacity without proportional inference cost, it introduces challenges related to memory management and ensuring balanced expert utilization during training [01:13:00] [01:14:00]. Apple’s models with MoE uniformly outperform their dense counterparts, indicating its potential for further scaling [01:32:04].
Computational and Data Challenges
Hardware Secrecy and Computational Constraints
A notable omission from the MM1 paper is any mention of the hardware used for training [27:30:00]. This secrecy, likely due to Apple’s desire to eventually use their own chips for AI training, implies they currently rely on external hardware like Nvidia GPUs or Google’s TPUs [28:17:17] [01:03:05]. Training large MLLMs demands substantial resources, and even Apple uses a smaller base configuration (e.g., 3 billion parameters) for most ablation studies to manage costs, extrapolating findings to larger models (e.g., 30 billion parameters) [22:42:00] [23:19:00]. The assumption that hyperparameter sweeps on small models directly apply to large models remains an open question in the community [23:32:00] [01:09:24].
The "Bitterness" of Scale
The paper reinforces the “Rich Sutton bitter lesson” that architectural design choices for components like the vision language connector have negligible impact compared to the scale of the data and model capacity [10:31:00] [49:50:00] [01:34:25]. This implies an ongoing “arms race” driven by computational resources rather than novel algorithmic ideas [11:11:00] [50:18:00].
Data Provenance and Openness
While the MM1 paper is largely open, it acknowledges an internal, non-publicly available text-only supervised fine-tuning dataset, similar to ShareGPT [01:22:50]. Furthermore, a significant innovation is the use of GPT-4 Vision-generated data sets for supervised fine-tuning, effectively “distilling” the capabilities of GPT-4 Vision into MM1 [01:19:08] [01:20:46]. This practice, common across many VLM models, raises complex questions about data ownership, copyright, and the tangled chain of AI model dependencies [01:21:12] [01:21:37].
Performance and Future Outlook
MM1 achieves state-of-the-art results compared to other published pre-training results from models like Flamingo and EMU2 [09:43:00] [01:17:16]. However, when compared to top closed models like GPT-4 Vision and Gemini Ultra, MM1’s performance is competitive but not superior [01:30:52] [01:31:40].
The scaling laws continue to hold, indicating that increased data, longer pre-training, and higher image resolution lead to better performance [01:33:30] [01:49:17]. This suggests that future advancements in AI model architecture and scaling will primarily depend on continued investments in computational constraints and data acquisition [01:49:43].
Apple’s foray into transparent AI research with MM1 is a positive development for the open-source community [01:50:50]. By sharing detailed methodologies and lessons learned, they contribute to the collective understanding of effective model architecture and data quality in AI, potentially accelerating the development of more powerful and accessible open-source models [01:51:11].