From: hu-po

Introduction to Mixtures of Experts

Transformer architectures can scale model capacity without substantial increases in training or inference costs through the use of Mixtures of Experts (MoEs) [00:04:06]. Model capacity refers to the size of a model [00:04:16]. While larger models typically demand more computational resources and time for training and inference, MoEs offer a way to manage these costs [00:04:30].

Genealogy of MoE Research

The research into Mixtures of Experts has evolved over several years, primarily driven by Google DeepMind [00:01:15].

Motivation for MoEs

The core idea behind MoEs stems from the observation that a significant portion of neurons in traditional neural networks, particularly in fully connected layers, are often inactive or contribute very little to the final prediction [00:11:22], [00:11:43]. By only activating a subset of the model’s parameters for a given input, MoEs aim to achieve:

  • Increased Model Capacity: Allowing for larger models [00:04:12].
  • Reduced Computational Cost: Lower training and inference expenses compared to dense models of similar capacity [00:04:14], [00:12:59]. This efficiency comes from leveraging the inherent sparsity of neural networks, where only a few pathways truly matter [00:13:16].

Sparse Mixture of Experts (Sparse MoE)

Sparse MoE architectures replace dense feed-forward network (FFN) layers within a Transformer block with a “mixture of independent FFNs,” referred to as experts [00:09:43], [00:10:01]. Each expert is a small, independent feed-forward neural network (typically a multi-layer perceptron or MLP) [00:18:11], [00:30:14].

A “router” determines which experts process which input tokens [00:10:20], [00:18:00]. This routing is typically a discrete optimization problem [00:18:00].

Issues with Sparse MoE

Despite their benefits, Sparse MoEs suffer from several challenges:

  • Training Instability [00:04:55].
  • Token Dropping: Some input tokens may not be routed to any expert [00:04:56], [00:19:58], [00:23:39].
  • Expert Imbalance: Some experts receive disproportionately more tokens than others [00:26:56], [00:29:10].
  • Inability to Scale Number of Experts: Hard routing can be challenging with many experts [00:04:58], [02:01:46].
  • Ineffective Fine-tuning [00:05:00].
  • Non-differentiability: Most sparse MoE approaches use discrete routing mechanisms, making them non-differentiable [00:52:46]. This complicates training as gradients cannot be smoothly backpropagated through the routing decisions [00:53:54].
  • Batch Effects at Inference: In sparse MoEs, inputs within a batch can compete for limited expert capacity, leading to non-deterministic outputs because the prediction for one input depends on others in the batch [00:27:58], [01:00:51].

Soft Mixture of Experts (Soft MoE)

Soft MoE addresses the limitations of Sparse MoE by introducing a “soft assignment” mechanism, replacing the discrete router with weighted combinations of input tokens [00:15:36], [00:21:26].

How Soft MoE Works

Instead of routing tokens to specific experts, Soft MoE passes different weighted combinations of all input tokens to each expert [00:06:05], [00:23:21].

  1. Input Slots: Each expert processes a fixed number of “slots” (P) [00:33:42].
  2. Learnable Parameters (Phi): Each slot has corresponding learnable parameters (Phi) [00:49:00], [00:49:51].
  3. Dispatch Weights (D): Input tokens (X) and the learnable parameters (Phi) are used to compute “dispatch weights” (D) via a softmax function [00:35:09], [00:37:39]. These weights determine the convex combination of all input tokens for each slot [00:34:42].
  4. Expert Processing: The weighted token combinations are fed into the experts (FFNs) [00:47:44].
  5. Combined Weights (C): The outputs from all expert slots are then combined using “combined weights” (C), also computed similarly with a softmax function [00:36:59], [00:37:39]. The final output (Y) maintains the original input dimensionality [00:48:20].

Benefits of Soft MoE

  • Fully Differentiable: All operations in Soft MoE layers are continuous and fully differentiable, allowing for seamless gradient backpropagation and efficient learning of the weighting parameters [00:53:05], [00:54:19].
  • Immunity to Token Dropping and Expert Imbalance: Since every slot receives a weighted average of all tokens (even if some weights are very small), there is no token dropping or imbalance [00:56:38], [02:00:06].
  • Scalability: Soft MoE can scale to thousands of experts [00:52:52], and its cost is primarily determined by the total number of slots, not experts, enabling greater flexibility [00:57:47].
  • Per-Example Determinism: By combining all tokens within each input sequence, Soft MoE ensures per-example determinism, eliminating batch effects seen in Sparse MoE [01:06:13], [01:04:05].
  • Faster Inference: Significantly faster than most sparse MoEs due to avoiding slow sorting or top-K operations [00:58:35].

Similarities to Attention Mechanisms

Soft MoE shares conceptual similarities with attention mechanisms, especially in its use of weighted averages and the softmax function [01:13:12], [01:22:37]. Both approaches leverage the idea that every part of the input should have a way to contribute to the output, or that experts should have access to information from all tokens [01:03:16], [01:03:50].

However, key distinctions exist:

  • In multi-headed attention, each head processes a smaller, divided portion of the input’s dimensionality (e.g., D/H) [01:23:01].
  • In Soft MoE, experts are non-linear and combine vectors of the full dimensionality (D) at their input and output [01:24:27]. Every expert has a path to every single part of the input sequence due to the weighted aggregation [01:23:56].

Implementation Details

Soft MoE layers typically replace the second half of MLP blocks in a Transformer architecture [00:38:52], specifically in the encoder part for vision Transformers [01:17:46].

  • Jax and Einsum: The paper provides a Jax implementation, utilizing jnp.einsum for efficient tensor operations [00:40:48], [00:41:10]. Einsum (Einstein summation) is a concise notation for complex tensor multiplications [00:41:52].
  • L2 Normalization: L2 normalization is applied to input (X) and learnable parameters (Phi) to maintain stability, especially when scaling model dimensions or increasing learning rates [01:13:52], [01:15:18].

Hyperparameters and Hardware Considerations

Key hyperparameters in Soft MoE include the number of experts (N) and the number of slots per expert (P) [00:33:09].

  • Total Slots and Cost: The total number of slots (N * P) is the primary factor determining the computational cost (FLOPs) of a Soft MoE layer [00:57:47].
  • Optimal Configuration: Experiments suggest that using one slot per expert (P=1) is often the optimal choice for performance [01:31:37], [01:54:55]. This configuration allows per-slot learnable parameters to effectively act as per-expert parameters, learning specific intricacies of each expert [01:32:50].
  • Hardware Impact: The choice of N and P depends heavily on the available hardware (e.g., GPUs, TPUs) and its interconnects, influencing memory and time complexity [00:39:21], [01:18:50].

Performance and Evaluation

Soft MoE models were extensively evaluated against dense Transformer models in 3D reconstruction | Transformers and other sparse MoE variants (Token Choice, Expert Choice) [01:35:29], [01:35:54].

Training and Evaluation Methods

  • Dataset: Models were trained on JFT-4B, an internal Google dataset containing over 4 billion images across 29,000+ classes [01:37:16], [01:38:07].
  • Metrics: Evaluation focused on:
    • Upstream validation precision on JFT-4B [01:38:46].
    • ImageNet 10-shot accuracy (freezing model weights, replacing head, and training on 10 images per class) [01:38:51].
    • ImageNet fine-tuning accuracy (fully fine-tuning the entire model) [01:52:50].
  • Training Scale: Over 106 models were trained, ranging from 1 billion to 54 billion parameters, for up to 300K steps with a batch size of 4096 [01:42:50], [01:49:42].

Performance and Scalability

Soft MoE consistently “dominates” (outperforms in every metric [01:38:23], [01:38:35]) other approaches across various model sizes, training budgets, and evaluation metrics [01:43:28]:

  • Superior Performance: Achieves better accuracy and precision for the same amount of training and model size [01:44:43], [01:53:15].
  • Extended Training: Soft MoE models can be trained for significantly longer periods without overfitting, continuing to improve performance [01:55:58]. A smaller Soft MoE model (e.g., ViT-S) can even match the quality of a much larger dense model (ViT-L) by training for longer [01:57:21].
  • Cost Efficiency: Soft MoE models are computationally cheaper during inference, offering significant wall-clock time reductions [00:00:00], [00:00:00].
  • Robustness: Even with simplified routing mechanisms (like identity or uniform averaging), Soft MoE still outperforms dense Transformers [02:05:50].
  • Contrastive Learning: Soft MoE’s learned representations are also significantly better for other tasks, such as image-language contrastive learning, outperforming standard vision Transformers when used as an image encoder in a CLIP-like model [02:06:42], [02:07:38].

Limitations and Future Work

In summary, Soft MoE presents a robust and efficient alternative to traditional Transformer architectures and previous sparse MoE variants, offering superior performance and computational advantages by using a fully differentiable soft assignment mechanism [02:22:12], [02:22:50]. This innovation has the potential to significantly improve inference speed for large models, which is crucial for applications like robotics and fast control loops [02:21:12].