From: hu-po
Model ensembling, also known as a mixture model, is a strategy that uses multiple models to improve final performance [01:36:00]. This technique is particularly popular in Kaggle competitions, where participants aim to squeeze out the last one or two percent of performance [01:42:49]. In such competitions, many different versions of the same model might be trained on slightly different subsets of the data, leading to more variety in the final output given the same input [01:53:07]. Generally, picking the best output from several models is superior to relying on a single “best” model [02:08:44].
GPT-4’s Ensemble Structure
Recent reports, including a tweet by Sumit based on a podcast by George Hotz, suggest that GPT4 is not a single model [00:49:10]. Instead, GPT4 is reportedly an ensemble of eight models, each with approximately 220 billion parameters [01:01:07]. These models are likely slightly different in terms of what they were fine-tuned on [01:10:48]. This structure explains GPT4’s performance and also the high inference costs frequently mentioned by Sam Altman [02:20:00].
Inference Cost Implications
When a user interacts with GPT4, it is suggested that 16 inferences are performed [03:17:01]. In contrast, Google’s Bard performs inference on just one model [03:09:47]. This means OpenAI’s inference cost could be 16 times higher than Bard’s for a single query [03:31:02]. The final output presented to the user is the result of these 16 different models, likely with a value function model selecting the best among them [03:26:07].
Historical Precedent at OpenAI
The use of model ensembling is not new for OpenAI. A 2021 paper for Codex, an OpenAI model, describes a similar strategy [04:15:28]. For solving “leak code problems,” Codex would generate 100 samples per token and then filter them down [04:40:02]. Additionally, the paper mentions fine-tuning Codex on training problems to produce “a set of supervised fine-tuned models,” referred to as “Codex S” or “a set of Codex models” [04:54:19]. This suggests that ensembling was likely applied to GPT models given the timing of this paper [05:14:52].
Relevance to Model Performance
The use of model ensembling provides context for the observed “random jump in performance” seen with ChatGPT, explaining much of the hype surrounding it [05:37:25]. By combining eight slightly different models, OpenAI effectively created what appears to be a single, more powerful model [05:46:17]. This strategy contributes significantly to the model’s overall performance.
Related Concepts in Self-Supervised Learning
The principles of ensembling and diverse model outputs are also relevant in other areas of machine learning, such as self-supervised learning for image representations. While not directly an ensemble, the concept of Joint Embedding Predictive Architectures (JEPA) emphasizes predicting representations from different parts of an image [09:15:46]. This approach aims to produce highly semantic image representations without relying on handcrafted data augmentations [07:34:04], a key difference from traditional invariance-based methods that introduce biases [01:18:24].
JEPA’s “multi-block masking strategy” samples large target blocks and uses spatially distributed context blocks to predict representations [10:04:40]. This method predicts missing information in an abstract representation space rather than pixel or token space [11:17:28]. This distinction is crucial for learning more semantic features, as it avoids wasting model capacity on unnecessary pixel-level details like texture and exact color [02:24:26]. This also significantly reduces total computation needed for self-supervised training [03:22:15].
In conclusion, model ensembling techniques are powerful methods to boost performance and address challenges in model training, even impacting the cost and perceived capabilities of advanced AI models like GPT4. The underlying principles, such as leveraging multiple perspectives or abstract representations, extend across various AI architectures and modalities.