From: hu-po

LongRope is a recent innovation in extending the context length of transformer-based Large Language Models (LLMs), drawing significant attention due to its potential connection to techniques used in Google’s Gemini 1.5, which boasts state-of-the-art context lengths [00:02:26]. This approach, developed by Microsoft Research, is described as a “remix on a remix on a remix” of existing positional embedding concepts [00:02:02] [01:22:15].

The Role of Positional Embeddings

transformer models, being sequence-to-sequence architectures, require explicit positional information to understand the order of input tokens [00:06:13] [02:29:51]. This is typically provided through positional embeddings – small vectors added to the token embeddings that encode their position within the sequence [00:06:38] [00:10:05].

Rotary Positional Embeddings (RoPE)

A significant advancement in positional embeddings came with RoPE (Rotary Position Embeddings), introduced in 2021 [01:22:22] [01:22:54]. RoPE is a hand-designed method that aims to model the dependency between elements at different positions in the sequence [01:37:37]. It encodes absolute position within a rotation matrix and incorporates relative position dependency into the self-attention formulation [01:16:16]. Key properties of RoPE include:

  • Flexibility with sequence length: It can be expanded to longer sequences [01:16:28].
  • Decaying inter-token dependency: The similarity (dot product) between tokens decreases as their relative distance increases [01:16:30]. This aligns with the intuition that closer tokens should have a stronger connection [01:17:20].

The underlying mechanism of RoPE involves rotating the semantic meaning vectors of tokens using rotation angles derived from sine and cosine functions. Different dimensions of the position embedding vector are associated with different frequencies (e.g., lower dimensions have low frequencies, higher dimensions have high frequencies) [00:44:07] [01:37:54].

Limitations of Prior Context Extension Methods

Extending the context length of pre-trained LLMs beyond their original training limit (e.g., 4096 tokens for Llama 2) is challenging [00:49:03]. Naive extrapolation of RoPE often leads to poor performance.

Prior methods, such as “Position Interpolation” (Pi) [02:23:24] and its successors like “NTK-based interpolation” and “Yarn,” attempted to address this by interpolating the existing RoPE values [02:26:26]. However, these methods had limitations:

  • Crowded position information: Simple interpolation can make positional information too dense, hindering the model’s ability to distinguish closely positioned tokens [02:50:50].
  • Human-designed heuristics: Methods like Yarn divide RoPE dimensions into frequency-based groups, each with a different interpolation strategy [02:54:10]. These groupings and strategies are based on human-led empirical experiments and arbitrary “nonlinear” boundaries, which may be suboptimal for new LLMs [02:57:59].

LongRope’s Innovation: Evolutionary Search for Optimal Interpolation

LongRope improves upon these methods by replacing human-designed rules with an efficient evolutionary search to discover optimal “non-uniform positional interpolations[02:58:11].

The search targets two key forms of non-uniformity in positional interpolation [01:19:55]:

  1. Fixed initial tokens (n_hat): The first n_hat positions in the sequence are explicitly not interpolated, retaining their original positional embeddings [01:01:06] [01:02:17]. This is hypothesized to be beneficial because initial tokens often receive large attention scores [01:02:25].
  2. Dimension-dependent rescale factors (Lambda_i): The scaling of RoPE’s rotation angles (Theta) for subsequent tokens (beyond n_hat) varies across different dimensions of the embedding [00:44:23] [01:07:56]. This allows for differential interpolation based on the frequency characteristics of each dimension (e.g., low-frequency dimensions interpolated differently from high-frequency ones) [02:54:52].

The Evolutionary Search Process

Instead of guessing these parameters, LongRope uses an evolutionary search algorithm [00:58:50]:

  1. Initial Population: Start with a population of potential rescale factors and n_hat values, including those derived from prior methods like Pi, NTK, and Yarn [01:05:51].
  2. Mutation: Randomly mutate these parameters [01:06:03].
  3. Evaluation: Compute the LLM’s perplexity (a measure of goodness, lower is better) for each set of parameters [01:06:06].
  4. Selection & Reproduction: The top-performing individuals (those with low perplexity) become “parents” and are used to create variants (children) for the next generation [01:06:09]. Poor-performing individuals are discarded [01:06:22].
  5. Iteration: This process is repeated iteratively, allowing the search to converge towards optimal values for the rescale factors and n_hat [01:09:14].

This computationally intensive search requires significant resources, with experiments conducted using 8 to 16 A100 GPUs [01:09:48]. Models like Llama 2 7B and Mistral 7B are fine-tuned on datasets like RedPajama, chunked into long segments for evaluation [01:08:26].

Performance and Implications

LongRope demonstrates impressive results:

  • Extended Context: It extends LLM context windows to an “unprecedented 2048k tokens” [01:19:47].
  • Perplexity: Achieves low perplexity even at very high context lengths (2048k tokens) on Llama 2 and Mistral, unlike other methods which show exploding perplexity [00:40:43].
  • PassKey Retrieval Accuracy: Maintains near 100% accuracy in “needle in the haystack” tasks, where a specific five-digit number is hidden within a very long document [01:11:40].
  • Progressive Extension: It allows for a progressive fine-tuning strategy, starting with smaller context lengths and gradually increasing [03:04:01].
  • Zero-Shot Extension: Crucially, LongRope can extend the context window by up to 8 times without any fine-tuning [01:21:23].
  • Benchmark Retention: While there is a slight drop in performance on traditional short-context benchmarks (like MMLU or H-SWAG), it’s not catastrophic, and the models retain most of their original capabilities [01:14:13].

This advancement suggests that models with extremely long context windows may eventually reduce the need for techniques like Retrieval Augmented Generation (RAG) for certain applications, as all necessary information could potentially fit within the LLM’s context [01:17:19].

The “Bitter Lesson” and Future of Positional Embeddings

This approach, while highly effective, highlights an ongoing debate in AI research encapsulated by Rich Sutton’s “Bitter Lesson” [01:22:55]. Sutton posits that general methods leveraging computation (like search and learning) ultimately outperform human-engineered heuristics in the long term [01:23:30].

The history of computer vision provides an analogy: early methods relied on hand-designed filters (e.g., Gabor filters) to detect features like edges [01:24:37]. However, convolutional neural networks (CNNs) eventually surpassed these by learning optimal filters directly from data [01:25:30].

Similarly, RoPE and its “remixes” like LongRope represent increasingly complex human-designed heuristics for positional embeddings [01:27:11]. While effective now, the long-term trend suggests that positional embeddings will eventually be entirely learned by the models themselves, similar to how token embeddings are learned [01:27:35]. Although past attempts to learn positional embeddings directly have not outperformed hand-designed ones [01:30:38], the “Bitter Lesson” implies that with sufficient scale and computation, learned methods will ultimately prevail [01:30:46].