Transformer models and MoEs

Introduction to Mixtures of Experts

Transformer architectures can scale model capacity without substantial increases in training or inference costs through the use of Mixtures of Experts (MoEs) [00:04:06]. Model capacity refers to the size of a model [00:04:16]. While larger models typically demand more computational resources and time for training and inference, MoEs offer a way to manage these costs [00:04:30].

Genealogy of MoE Research

The research into Mixtures of Experts has evolved over several years, primarily driven by Google DeepMind [00:01:15].

Pathways (October 2021): The earliest version in this research thread [00:01:52], [00:02:05].
Sparse Mixture of Experts (2022): Introduced techniques for sparsely activating token paths [00:01:37], [00:02:08]. This approach is speculated to be used in models like GPT-4 and GPT-3.5 [00:03:29], contributing to their non-deterministic outputs [00:03:07], [01:00:51].
Soft Mixture of Experts (August 2023): The latest extension, proposing a fully differentiable approach to MoEs [00:01:18], [00:01:44].

Motivation for MoEs

The core idea behind MoEs stems from the observation that a significant portion of neurons in traditional neural networks, particularly in fully connected layers, are often inactive or contribute very little to the final prediction [00:11:22], [00:11:43]. By only activating a subset of the model’s parameters for a given input, MoEs aim to achieve:

Increased Model Capacity: Allowing for larger models [00:04:12].
Reduced Computational Cost: Lower training and inference expenses compared to dense models of similar capacity [00:04:14], [00:12:59]. This efficiency comes from leveraging the inherent sparsity of neural networks, where only a few pathways truly matter [00:13:16].

Sparse Mixture of Experts (Sparse MoE)

Sparse MoE architectures replace dense feed-forward network (FFN) layers within a Transformer block with a “mixture of independent FFNs,” referred to as experts [00:09:43], [00:10:01]. Each expert is a small, independent feed-forward neural network (typically a multi-layer perceptron or MLP) [00:18:11], [00:30:14].

A “router” determines which experts process which input tokens [00:10:20], [00:18:00]. This routing is typically a discrete optimization problem [00:18:00].

Issues with Sparse MoE

Despite their benefits, Sparse MoEs suffer from several challenges:

Training Instability [00:04:55].
Token Dropping: Some input tokens may not be routed to any expert [00:04:56], [00:19:58], [00:23:39].
Expert Imbalance: Some experts receive disproportionately more tokens than others [00:26:56], [00:29:10].
Inability to Scale Number of Experts: Hard routing can be challenging with many experts [00:04:58], [02:01:46].
Ineffective Fine-tuning [00:05:00].
Non-differentiability: Most sparse MoE approaches use discrete routing mechanisms, making them non-differentiable [00:52:46]. This complicates training as gradients cannot be smoothly backpropagated through the routing decisions [00:53:54].
Batch Effects at Inference: In sparse MoEs, inputs within a batch can compete for limited expert capacity, leading to non-deterministic outputs because the prediction for one input depends on others in the batch [00:27:58], [01:00:51].

Soft Mixture of Experts (Soft MoE)

Soft MoE addresses the limitations of Sparse MoE by introducing a “soft assignment” mechanism, replacing the discrete router with weighted combinations of input tokens [00:15:36], [00:21:26].

How Soft MoE Works

Instead of routing tokens to specific experts, Soft MoE passes different weighted combinations of all input tokens to each expert [00:06:05], [00:23:21].

Input Slots: Each expert processes a fixed number of “slots” (P) [00:33:42].
Learnable Parameters (Phi): Each slot has corresponding learnable parameters (Phi) [00:49:00], [00:49:51].
Dispatch Weights (D): Input tokens (X) and the learnable parameters (Phi) are used to compute “dispatch weights” (D) via a softmax function [00:35:09], [00:37:39]. These weights determine the convex combination of all input tokens for each slot [00:34:42].
Expert Processing: The weighted token combinations are fed into the experts (FFNs) [00:47:44].
Combined Weights (C): The outputs from all expert slots are then combined using “combined weights” (C), also computed similarly with a softmax function [00:36:59], [00:37:39]. The final output (Y) maintains the original input dimensionality [00:48:20].

Benefits of Soft MoE

Fully Differentiable: All operations in Soft MoE layers are continuous and fully differentiable, allowing for seamless gradient backpropagation and efficient learning of the weighting parameters [00:53:05], [00:54:19].
Immunity to Token Dropping and Expert Imbalance: Since every slot receives a weighted average of all tokens (even if some weights are very small), there is no token dropping or imbalance [00:56:38], [02:00:06].
Scalability: Soft MoE can scale to thousands of experts [00:52:52], and its cost is primarily determined by the total number of slots, not experts, enabling greater flexibility [00:57:47].
Per-Example Determinism: By combining all tokens within each input sequence, Soft MoE ensures per-example determinism, eliminating batch effects seen in Sparse MoE [01:06:13], [01:04:05].
Faster Inference: Significantly faster than most sparse MoEs due to avoiding slow sorting or top-K operations [00:58:35].

Similarities to Attention Mechanisms

Soft MoE shares conceptual similarities with attention mechanisms, especially in its use of weighted averages and the softmax function [01:13:12], [01:22:37]. Both approaches leverage the idea that every part of the input should have a way to contribute to the output, or that experts should have access to information from all tokens [01:03:16], [01:03:50].

However, key distinctions exist:

In multi-headed attention, each head processes a smaller, divided portion of the input’s dimensionality (e.g., D/H) [01:23:01].
In Soft MoE, experts are non-linear and combine vectors of the full dimensionality (D) at their input and output [01:24:27]. Every expert has a path to every single part of the input sequence due to the weighted aggregation [01:23:56].

Implementation Details

Soft MoE layers typically replace the second half of MLP blocks in a Transformer architecture [00:38:52], specifically in the encoder part for vision Transformers [01:17:46].

Jax and Einsum: The paper provides a Jax implementation, utilizing jnp.einsum for efficient tensor operations [00:40:48], [00:41:10]. Einsum (Einstein summation) is a concise notation for complex tensor multiplications [00:41:52].
L2 Normalization: L2 normalization is applied to input (X) and learnable parameters (Phi) to maintain stability, especially when scaling model dimensions or increasing learning rates [01:13:52], [01:15:18].

Hyperparameters and Hardware Considerations

Key hyperparameters in Soft MoE include the number of experts (N) and the number of slots per expert (P) [00:33:09].

Total Slots and Cost: The total number of slots (N * P) is the primary factor determining the computational cost (FLOPs) of a Soft MoE layer [00:57:47].
Optimal Configuration: Experiments suggest that using one slot per expert (P=1) is often the optimal choice for performance [01:31:37], [01:54:55]. This configuration allows per-slot learnable parameters to effectively act as per-expert parameters, learning specific intricacies of each expert [01:32:50].
Hardware Impact: The choice of N and P depends heavily on the available hardware (e.g., GPUs, TPUs) and its interconnects, influencing memory and time complexity [00:39:21], [01:18:50].

Performance and Evaluation

Soft MoE models were extensively evaluated against dense Transformer models in 3D reconstruction | Transformers and other sparse MoE variants (Token Choice, Expert Choice) [01:35:29], [01:35:54].

Training and Evaluation Methods

Dataset: Models were trained on JFT-4B, an internal Google dataset containing over 4 billion images across 29,000+ classes [01:37:16], [01:38:07].
Metrics: Evaluation focused on:
- Upstream validation precision on JFT-4B [01:38:46].
- ImageNet 10-shot accuracy (freezing model weights, replacing head, and training on 10 images per class) [01:38:51].
- ImageNet fine-tuning accuracy (fully fine-tuning the entire model) [01:52:50].
Training Scale: Over 106 models were trained, ranging from 1 billion to 54 billion parameters, for up to 300K steps with a batch size of 4096 [01:42:50], [01:49:42].

Performance and Scalability

Soft MoE consistently “dominates” (outperforms in every metric [01:38:23], [01:38:35]) other approaches across various model sizes, training budgets, and evaluation metrics [01:43:28]:

Superior Performance: Achieves better accuracy and precision for the same amount of training and model size [01:44:43], [01:53:15].
Extended Training: Soft MoE models can be trained for significantly longer periods without overfitting, continuing to improve performance [01:55:58]. A smaller Soft MoE model (e.g., ViT-S) can even match the quality of a much larger dense model (ViT-L) by training for longer [01:57:21].
Cost Efficiency: Soft MoE models are computationally cheaper during inference, offering significant wall-clock time reductions [00:00:00], [00:00:00].
Robustness: Even with simplified routing mechanisms (like identity or uniform averaging), Soft MoE still outperforms dense Transformers [02:05:50].
Contrastive Learning: Soft MoE’s learned representations are also significantly better for other tasks, such as image-language contrastive learning, outperforming standard vision Transformers when used as an image encoder in a CLIP-like model [02:06:42], [02:07:38].

Limitations and Future Work

Auto-Regressive Decoders: Soft MoE currently faces difficulties with auto-regressive decoders, commonly found in language models, due to the need to preserve causality between past and future tokens (i.e., masking future tokens during prediction) [01:28:01], [01:28:42]. Naively applying Soft MoE would allow information from future tokens to “leak” into the current prediction [01:29:51]. This remains a promising research avenue [01:30:00].
Hardware Optimization: While Soft MoE offers theoretical cost benefits, optimizing its hyperparameters and hardware considerations in MoEs | hyperparameters (like number of experts and slots) for specific distributed computing setups remains complex [01:19:32], [01:27:38].
Learning Rate Schedules: Optimal learning rate schedules for Soft MoE may differ from standard Transformers, suggesting another area for research [01:51:11].
Impact on Hallucinations: The paper does not directly address the impact of Soft MoE on hallucinations in generative models, but it’s an interesting question for future work, especially if Soft MoE can be adapted for decoders [01:40:47].

In summary, Soft MoE presents a robust and efficient alternative to traditional Transformer architectures and previous sparse MoE variants, offering superior performance and computational advantages by using a fully differentiable soft assignment mechanism [02:22:12], [02:22:50]. This innovation has the potential to significantly improve inference speed for large models, which is crucial for applications like robotics and fast control loops [02:21:12].

Tubegraph

Explorer

Table of Contents