Introduction to xLSTM architecture

From: hu-po

The xLSTM (Extended Long Short-Term Memory) is a new architecture introduced in a paper published on May 7, 2024 [03:12]. It is an evolution of the traditional Long Short-Term Memory (LSTM) network. The primary author, Sep Hochreiter, is also the original author of the LSTM [03:28]. The development of xLSTM aims to re-establish LSTMs as a relevant and scalable alternative in the landscape of modern deep learning architectures, particularly challenging the dominance of Transformers [03:38].

Historical Context: The Original LSTM

The LSTM architecture emerged in the 1990s [03:50], with foundational papers appearing in 1991 and 1997 [04:09]. LSTMs introduced core ideas like the “constant error carousel” and “gating” mechanisms [03:54].

Key Characteristics and Successes

LSTMs are a type of recurrent neural network (RNN) [09:01]. They process information sequentially, where the computation at a given time step relies on information passed forward from the previous time step [09:12]. This recurrent nature makes them particularly well-suited for modeling time-series data [10:39].

LSTMs have been central to numerous deep learning successes:

AlphaStar (StarCraft 2) [08:29]
OpenAI Five (DOTA) [08:32]
Magnetic controller for the Tokamak reactor (DeepMind) [08:35]

They also constituted the first Large Language Models (LLMs), though these early LSTM-based LLMs were not as performant as modern Transformer-based models like ChatGPT [04:36].

Core Components of LSTM

The original LSTM’s architecture involves several gates and a cell state (C_t):

Cell State (Constant Error Carousel): C_t carries information across time steps [15:18].
Forget Gate (F_t): Controls what information from the previous cell state (C_{t-1}) should be “forgotten” or erased [15:56].
Input Gate (I_t): Determines what new information from the current input (X_t) should be “added” to the cell state [16:47].
Output Gate (O_t): Controls what part of the cell state is exposed as the hidden state (H_t) [21:30].
Activation Functions: Typically sigmoid (for gates, outputs 0-1) and tanh (for cell input, outputs -1 to 1) [18:00].

All components of the memory cell in an LSTM are crucial for its operation [20:33].

Limitations of Original LSTMs

Despite their strengths, traditional LSTMs faced limitations:

Inability to revise storage decisions: Once information is “forgotten” or “overwritten” in the cell state, it’s lost [12:18].
Limited storage capacity: The cell state is a fixed-length vector, which limits the amount of historical information that can be retained over very long sequences [12:49].
Lack of parallelizability: Due to their recurrent nature, computation at each time step depends on the output of the previous step. This sequential dependency prevents parallel computation, which is critical for scaling Large Language Models on modern hardware like GPUs [13:42]. This is in contrast to Transformers, which benefit from parallel self-attention [05:09].

Innovations in xLSTM

xLSTM addresses the limitations of traditional LSTMs through two main innovations: Exponential Gating and a modified memory structure.

1. Exponential Gating

To enable the ability to revise storage decisions, xLSTM replaces the sigmoid activation functions in the input and forget gates with exponential activation functions [26:00].

Challenge: Exponential functions can lead to very large or very small values, causing numerical overflow or underflow when represented on computer hardware (e.g., float16, bfloat16 data types) [29:49].
Stabilization: To mitigate this, xLSTM introduces an additional state (M_t) for stabilization. This is analogous to how softmax normalization is used to prevent numerical instability with exponentials [30:00]. This M_t subtracts the maximum value of the input, ensuring numbers remain within a representable range [35:11].

2. Modified Memory Structure (M-LSTM)

To enhance storage capacity and enable parallel training, xLSTM introduces two variants: SL-LSTM and M-LSTM.

SL-LSTM (Scalar-memory Scalar-update LSTM):
- This variant primarily incorporates exponential gating and normalization/stabilization techniques [38:53].
- It retains the memory mixing via recurrent connections from the hidden state (H_{t-1}) to the memory cell input [39:27].
- Due to this memory mixing, SL-LSTM is not parallelizable during training, as each computation depends on the previous output [06:29]. However, the authors did develop a fast Cuda kernel for it [06:29].
M-LSTM (Matrix-memory LSTM):
- This is the more significant innovation regarding parallelizability.
- It transforms the LSTM memory cell from a scalar/vector (C_t) into a matrix (C_t) [44:13]. This is somewhat analogous to the KV cache in Transformers [44:57].
- The M-LSTM reuses Transformer terminology, calling components Keys (K_t), Values (V_t), and Queries (Q_t) [45:09].
- Crucially, M-LSTM eliminates memory mixing in its recurrence [48:18]. Specifically, its output gate (O_t) is made externalized and depends only on the current input (X_t), rather than the hidden state (H_{t-1}) of the previous block [16:11], [16:22].
- This design enables the recurrence to be formulated in a parallel version [48:21], allowing for efficient parallel training on GPUs.

xLSTM Architecture Design

xLSTM models are constructed by residually stacking building blocks [06:56]. This means that xLSTM blocks (like Transformer blocks) are stacked one on top of the other, with residual connections (or skip connections) allowing gradients to flow around the blocks [07:32].

Two main designs for these blocks are proposed:

Residual Block with Post-Up Projection (primarily for SL-LSTM):
- Employs a pre-layer normalization and residual structure [59:09].
- Input can optionally pass through a causal convolution with a Swish activation function [59:09]. Causal convolutions are useful for time-series data as they ensure that the output at a given time step only depends on current and past inputs, not future ones [01:00:09].
- The cell input, input gate, forget gate, and output gate are fed through a block-diagonal linear layer with four diagonal blocks (or “heads”) [01:02:23]. This limits connectivity to within each head, promoting local processing [01:10:57].
- The hidden state then passes through a group normalization layer [01:02:23].
- The output is up-projected to a higher dimension and then down-projected back, using a gated MLP with a GELU activation function. This up-projection is motivated by Cover’s Theorem, which suggests that patterns may be more linearly separable in higher-dimensional spaces [00:57:24].
Residual Block with Pre-Up Projection (primarily for M-LSTM):
- Also uses a pre-layer norm residual structure [01:12:56].
- The input is up-projected first (e.g., by a factor of two) before being fed into the M-LSTM cell [01:13:06].
- It features an “externalized output gate” which is separate from the main cell calculation and only depends on the input, enabling parallelization [01:13:16].
- The M-LSTM cell input is dimension-wise causally convolved with a Swish activation [01:13:36].
- Queries (Q) and Keys (K) are obtained via block-diagonal projection matrices [01:17:17]. Values (V) skip the convolution part and are fed directly [01:17:22].
- After sequence mixing, outputs are normalized via group normalization [01:17:35].
- A learnable skip input is added, and the result is combined with the external output gate, followed by a down-projection [01:17:56].

Hybrid Models

Notably, a full xLSTM model is often composed of a mix of SL-LSTM blocks and M-LSTM blocks [01:19:08]. For example, an “xLSTM 7:1” means that out of eight blocks, seven are M-LSTM and one is SL-LSTM [01:19:15]. This approach of combining different architectural blocks (like in the Jamba model which combines Transformer and Mamba blocks) suggests a potential future where optimal models are hybrids, leveraging the strengths of different recurrent and parallel architectures [01:20:02].

Performance and Comparison

The paper presents experiments comparing xLSTM to Transformers (specifically Llama 1) and State Space Models (like Mamba and RWKV) on a dataset of 15 billion tokens (SlimPajama) [01:24:34]. The largest model trained was 1.3 billion parameters [01:25:02].

Key Findings:

Perplexity: xLSTM models showed favorably, often slightly better perplexity scores than Llama 1 and older Mamba/RWKV models of comparable size [01:27:01].
Memory Capacity (Associative Recall):
- Transformers (e.g., Llama) generally maintain perfect recall due to their quadratic complexity with respect to sequence length, effectively “storing” all information [01:28:57].
- Recurrent models like Mamba, RWKV, and xLSTM, while having linear complexity (and thus being more memory efficient), can be “forgetful” due to their fixed-size hidden state limiting information retention over very long contexts [01:29:32]. Larger xLSTM models, however, perform better at associative recall than smaller ones [01:30:13].
Ablation Studies: Confirmed that both exponential gating and the matrix memory contribute significantly to xLSTM’s improved performance over traditional LSTMs [01:30:49].
Computational Efficiency: While M-LSTM is parallelizable in theory, the current Cuda kernels for xLSTM are not optimized and are about four times slower than highly optimized implementations like FlashAttention used in Transformers [01:07:03]. This highlights the gap between theoretical architectural advancements and practical engineering implementations [01:05:50].

Future Potential:

The authors suggest that scaling laws indicate larger xLSTM models could become serious competitors to current LLMs built with Transformer technology [01:36:32]. xLSTM holds significant potential beyond general Large Language Models, particularly in fields that benefit from efficient handling of sequential data and long histories, such as:

Reinforcement learning [01:36:44]
Time series prediction [01:36:45]
Modeling of physical systems [01:36:46]
Control problems (e.g., robotics, nuclear fusion controllers) [01:37:01]

The success of xLSTM at scale will depend on further engineering efforts to optimize its computational kernels, similar to how FlashAttention boosted Transformer performance [01:38:08].

Tubegraph

Explorer

Table of Contents