From: hu-po

Efficiently managing computational resources is a critical challenge in training and deploying large language models, especially those based on Transformer architectures. The “Mixture of Depths” (MoD) paper proposes a novel approach to dynamically allocate compute, significantly enhancing Transformer block efficiency during both training and inference [00:03:00].

Transformer Block Computation

A standard Transformer block consists of two main parts:

  1. Multi-head self-attention mechanism: This component treats every token in an input sequence equally, performing computation between all parts of the sequence [00:04:18]. The self-attention operation is computationally expensive, exhibiting a quadratic (O(N²)) relationship with respect to the input sequence length [00:14:23], [00:35:50].
  2. Feed-forward network (MLP): This part also treats all tokens equally in terms of computation [00:14:35].

Traditional Transformer models expend the same amount of computation (FLOPs – floating-point operations) per token in a forward pass [00:04:02], [00:14:01].

Mixture of Depths (MoD) for Efficiency

MoD is an efficiency technique designed to intelligently route tokens, allowing them to skip unnecessary computations [00:03:16]. Instead of uniformly spreading FLOPs, MoD allows Transformers to learn to dynamically allocate compute to specific positions in a sequence [00:04:51].

This dynamic allocation is achieved by:

  • Conditional Skipping: Tokens can either participate in a block’s computation or pass through a residual connection, remaining unchanged and saving compute [00:23:30]. The residual connection is computationally inexpensive compared to the attention and MLP blocks [00:43:13].
  • Routing Mechanism: A learnable “router” determines which tokens will participate. This router emits a scalar weight for each token, expressing its preference for that token to participate or to route around the block [00:36:56].
  • Top-K Selection: The router uses a top-K mechanism to select tokens for computation, ensuring a predefined number of tokens always go through the block [00:37:59]. This means if 100 tokens are input and the capacity is 50%, exactly 50 tokens will go through, and 50 will skip [00:54:54].

Benefits of MoD

MoD provides significant advantages:

  • Compute Efficiency: By capping the number of tokens that participate (e.g., to 50%), the self-attention computation, which is O(N²), becomes approximately 25% as intensive (because (N/2)² = N²/4) [00:40:59], [01:09:09]. Overall, MoD can lead to inference speed improvements of up to 50% [00:38:39], [01:41:50].
  • Static Computation Graph: A key innovation is that MoD maintains a static computation graph with known tensor sizes [00:06:24], [00:38:12]. This is crucial for efficient execution on hardware, as dynamic graph changes can lead to underutilization [00:38:51].
  • Improved Quality (Lower Loss): MoD models not only train faster but also achieve lower loss (better performance) than baseline models using equivalent FLOPs [01:06:06], [01:09:16]. This suggests that intelligently skipping layers can benefit the learning process, possibly by allowing information at different levels of abstraction to be selectively processed [00:32:00], [00:42:27]
  • KV Cache Size Reduction: MoD models can have a larger KV cache size during autoregressive sampling, as less information needs to be stored for the attention mechanism [01:13:30].

MoD and Hybrid Architectures

MoD builds upon and relates to other architectural innovations:

  • Relation to Mixture of Experts (MoE): MoD is seen as a “pun on” or variant of Mixture of Experts [00:03:36]. While MoE uses a router to select among multiple specialized “experts” (typically MLPs) to improve quality, MoD uses a router to decide whether to send a token through a block or skip it, primarily for efficiency [00:34:14], [00:35:09].
  • Integration with MoE: MoD can be combined with existing MoE models.
    • Staged MoD: A new router is added before the entire Transformer block to decide whether to skip the attention and MLP [01:17:18]. This offers the full computational benefits by skipping the expensive self-attention [01:20:02].
    • Integrated MoD: An existing MoE router can be modified to include a “noop” (no operation) expert, which effectively skips the MLP component [01:18:38]. This simplifies the routing machinery but doesn’t allow skipping the self-attention [01:19:55].
  • Application to Mamba and Jamba: While Transformers have O(N²) attention complexity, Mamba models use state-space models and are linear in complexity [01:11:05]. The Jamba model is a hybrid architecture that combines Transformer blocks and Mamba blocks [01:10:43]. MoD’s principles of conditional compute allocation could potentially be applied to Mamba or hybrid architectures, although the efficiency gains might be less pronounced for already efficient Mamba blocks [01:10:53].

Training and Inference Considerations

During training, MoD uses an “expert choice” routing where the router knows the entire sequence and selects the top-K tokens [00:50:53]. However, for autoregressive sampling (inference), the model cannot see future tokens. To address this, MoD employs a small auxiliary MLP predictor (a “second router”) that predicts whether a token will be among the top-K or not [00:57:52]. This causal prediction approach leads to minimal performance degradation compared to the non-causal training scheme [01:21:14].

Depth vs. Width

The research also indicates that when adding FLOPs to a model, it is empirically better to add depth (more layers) rather than width (more neurons per layer) [01:12:25]. This finding aligns with the general intuition in deep learning that deeper networks can learn richer hierarchies of features [01:11:51]. MoD allows for this by freeing up computational budget that can then be reinvested into adding more layers.

MoD is anticipated to be widely adopted due to its simplicity and significant benefits in both speed and performance [01:42:36].