From: 3blue1brown

Overview of Transformers in Large Language Models

Transformers are a core technology within large language models (LLMs) and other modern AI tools [00:00:04]. First introduced in the 2017 paper “Attention is All You Need” [00:00:10], their primary goal in models like GPT-3 is to take text and predict the next word [00:00:30].

Input Processing and Embeddings

Input text is broken down into “tokens,” often words or pieces of words [00:00:36]. Each token is initially associated with a high-dimensional vector, known as its embedding [00:00:51]. These embeddings encode both the word’s meaning and its position within the text [00:05:05], [00:05:07]. The transformer’s objective is to progressively adjust these initial, context-free embeddings to bake in richer contextual meaning [00:01:28], [00:01:32].

The Attention Mechanism

The attention mechanism is a critical component within a transformer [00:01:40]. It allows information encoded in one embedding to be moved to another, even if they are far apart in the text [00:03:32]. This is crucial for refining word meanings based on context, such as distinguishing between different meanings of “mole” based on surrounding words [00:02:06], [00:02:09].

Single Head of Attention

A single head of attention processes information using three distinct matrices filled with tuneable weights:

  1. Query (WQ) matrix: Multiplied by each embedding to produce a “query” vector [00:06:42], [00:06:46]. This vector encodes what information the current word is “looking for” in other words [00:06:32].
  2. Key (WK) matrix: Multiplied by each embedding to produce a “key” vector [00:07:47], [00:07:51]. Key vectors conceptually “answer” queries [00:07:59].
    • Both query and key matrices map the high-dimensional embeddings (e.g., 12,288 dimensions in GPT-3 [00:16:09]) down to a smaller dimensional space (e.g., 128 dimensions in GPT-3 [00:06:36], [00:16:15]). Each of these matrices for GPT-3 has approximately 1.5 million parameters [00:16:20], [00:16:24].
  3. Value (WV) matrix: Multiplied by each embedding to produce a “value” vector [00:13:44], [00:13:53]. Value vectors represent what should be added to an embedding if another word finds it relevant [00:14:10].
    • The value matrix outputs vectors in the original high-dimensional embedding space (e.g., 12,288 dimensions) [00:14:02].
    • In practice, to improve efficiency, the value map is typically factored into two smaller matrices: a “value down” matrix (mapping to the smaller key/query space) and a “value up” matrix (mapping back to the embedding space) [00:17:06], [00:17:27], [00:17:43].
    • A single attention head (with these four matrices) has approximately 6.3 million parameters [00:18:11], [00:18:16].

Attention Pattern Computation

  • Dot Products: Dot products are computed between all possible key-query pairs, indicating how well each key matches each query [00:08:27], [00:08:30].
  • Masking: During training, to prevent later words from influencing earlier ones (which would “give away” answers), entries representing later tokens influencing earlier ones are set to negative infinity before softmax, effectively turning them to zero after normalization [00:11:49], [00:12:13], [00:12:19]. This ensures columns remain normalized (add up to 1) [00:12:11].
  • Softmax: The dot products are then normalized using a softmax function column-wise to produce an “attention pattern,” where values range between 0 and 1 and each column sums to 1 [00:09:40], [00:09:52], [00:10:00]. This pattern indicates the relevance of each word to updating every other word’s meaning [00:09:21].
  • Context Size: The size of the attention pattern is equal to the square of the context size, making context size a significant bottleneck for LLMs [00:12:45], [00:12:49].

Updating Embeddings

To update an embedding, each value vector is multiplied by its corresponding weight from the attention pattern [00:14:42], [00:14:45]. These rescaled value vectors are then summed to produce a “change” vector (delta-e), which is added to the original embedding, resulting in a more contextually rich embedding [00:15:06], [00:15:13], [00:15:16].

Multi-Headed Attention

A full attention block within a transformer consists of “multi-headed attention,” where many single attention heads run in parallel [00:20:35], [00:20:38]. Each head has its own distinct key, query, and value maps, allowing the model to learn multiple ways context can influence meaning [00:20:05], [00:20:08], [00:20:11].

For example, GPT-3 utilizes 96 attention heads within each block [00:20:47]. Each head produces a proposed change to an embedding, and these changes are summed together and added to the original embedding to produce the refined output embedding for that position [00:21:17], [00:21:27], [00:21:32].

Parameter Counts in GPT-3

  • Per Multi-Headed Attention Block: With 96 heads, each containing its own four matrices (query, key, value-down, value-up), one block of multi-headed attention has approximately 600 million parameters [00:22:03], [00:22:07], [00:22:10].
  • Total Attention Parameters: GPT-3 includes 96 distinct layers (or blocks) [00:24:16], [00:24:21]. This multiplies the total key, query, and value parameters by 96, bringing the sum to just under 58 billion parameters devoted to all attention heads [00:24:27], [00:24:31].
  • Overall Parameters: While substantial, these 58 billion parameters account for only about one-third of GPT-3’s total of 175 billion parameters [00:24:34], [00:24:37], [00:24:41]. The majority of parameters come from other blocks within the network, such as multi-layer perceptrons, which are interspersed between attention blocks [00:23:28], [00:24:44].

Parallelizability

A key factor in the success of the attention mechanism is its extreme parallelizability, allowing a huge number of computations to run quickly using GPUs [00:24:58], [00:25:02]. This parallel processing capability supports scaling up models, which has led to significant qualitative improvements in performance over the last two decades [00:25:09], [00:25:13].