From: hu-po

Neural networks, particularly Transformers, store and process information in complex high-dimensional spaces. This intricate mechanism allows models to effectively encode and retrieve vast amounts of knowledge.

The Nature of Information Storage

Traditionally, within a Transformer block, the Multi-Layer Perceptron (MLP) or feed-forward network (FFN) was believed to be the primary location for storing factual knowledge [00:35:36]. This idea stemmed from research showing that specific neurons within the MLP could be modified to change a model’s factual understanding, such as the capital of a country or the location of a landmark [00:35:52]. In contrast, the attention mechanism was thought to facilitate communication between tokens [00:36:40].

However, recent research challenges this strict division of labor. The “Token Former” paper suggests that the attention mechanism itself can replace all linear projections in a Transformer block, including those in the MLP and the Query (Q), Key (K), and Value (V) projections within the attention mechanism [00:33:02]. This implies that both attention and MLPs might be performing similar functions:

“It seems like both of them are really kind of doing the same thing. They’re using the property of high dimensional spaces to encode information.” [00:55:36]

High-Dimensional Spaces and Orthogonality

A core concept underlying information storage in neural networks is the use of high-dimensional vector spaces. Each token (e.g., a word) is represented as a vector in this space, with its position and direction encoding its meaning [00:42:14]. As a token passes through layers of a Transformer, its vector is adjusted, continuously refining its meaning based on context [00:43:19].

A key property that enables vast information storage in these spaces is the Johnson-Lindenstrauss Lemma:

“The number of vectors that are between 89 and 91 degrees apart, so almost orthogonal, is not just the dimensionality of the space but grows exponentially with the number of dimensions.” [00:49:57]

In simpler terms, as the dimensionality of a space increases, the number of “almost orthogonal” directions available to encode distinct pieces of information grows exponentially [00:51:58]. This means a higher-dimensional space has significantly more “real estate” for storing different concepts without them interfering with each other [00:52:51].

This phenomenon is crucial for the scaling laws observed in neural networks:

“This might literally be the reason that neural nets are capable of storing so much information… The higher the dimension of the space, the more information you can encode in there, but it’s not even just proportional, it’s exponential with the number of dimensions. So this is why as Transformers as models get bigger and bigger and bigger, they’re literally the amount of information that they can store is exponentially growing with the size of that.” [00:52:09]

Attention as a Dynamic Information Retriever

The P-attention layer in Token Former exemplifies this idea by using “learnable tokens” to represent model parameters [00:49:03]. Input tokens act as queries, and the model parameters act as keys and values [00:28:46]. This allows the model to “attend” to its own internal parameters, effectively querying and retrieving information dynamically from its latent space [00:28:50].

Queries, Keys, and Values: The Mechanism of Retrieval

  • Query (Q): “What am I looking for?” or “What information am I interested in?” [00:19:15]
  • Key (K): “What do I contain?” or “What information do I have?” [00:19:20]
  • Value (V): The actual information or “value” associated with a key [00:20:50].

The attention mechanism calculates an “agreement” (dot product) between a query and all keys [00:20:17]. High agreement means the query finds relevant information in that key. The values corresponding to highly agreeable keys are then weighted and added to the token’s representation, enriching its meaning based on context [00:23:36].

Analogies for Understanding Information Storage

  • Memory Palaces: Just as humans use memory palaces (method of loci) to store information by associating it with locations in a familiar spatial environment [00:45:28], neural networks might be “filling up this very high dimensional space with information” [00:56:52]. The “treasure map” to this information is the specific sequence of tokens [01:04:04].
  • Language as an Indexing System: Language itself can be seen as an abstract and powerful indexing system for retrieving memories [01:09:56]. Large Language Models (LLMs) leverage this by using language to navigate and retrieve information from these high-dimensional spaces [01:30:47].
  • Anamorphic Sculptures: An anamorphic sculpture appears as a tangle of wires from most angles, but reveals a distinct image (e.g., an elephant or giraffes) when viewed from a specific perspective [00:53:27]. Similarly, high-dimensional spaces can encode multiple pieces of information that become apparent when “queried” from the right “direction” [00:53:57].

“What LLMs are doing is they’re storing billions of little tiny pieces of information inside a very high dimensional space.” [01:32:28]

Implications for Model Scaling and Hallucinations

The Token Former’s ability to incrementally add model parameters by adding “model tokens” allows for progressive and efficient scaling, starting with a small model and growing it during training [01:27:57]. This approach can save compute and time compared to training a large model from scratch [01:28:47]. This simplification of model architecture, by making computation patterns more uniform, also makes it easier to optimize performance at scale [00:39:57].

While this architecture improves scaling efficiency, it may not inherently solve the problem of hallucinations. Hallucinations occur when a model “interpolates” in areas of the high-dimensional space where no actual data was stored during training [01:12:04]. The model simply samples information from these “Swiss cheese” like gaps, unable to differentiate between true and interpolated facts [01:12:54].

Entropy and Variance in Predictions

Some theories suggest that the “variance” or “entropy” of a token’s prediction can indicate the model’s certainty. Low variance might mean the model has “been there before” in the high-dimensional space, indicating a factual prediction from the dataset. High variance might suggest the model is in an “unfamiliar” part of the space, leading to a hallucination [01:14:17].