From: 3blue1brown

Word embeddings are a foundational concept in understanding how models like ChatGPT process language. They represent text, images, or sound as lists of numbers, allowing the model to perform computations and understand relationships between different pieces of data [03:30:00].

Tokens and Their Vector Representation

The input to a transformer model is first broken down into small pieces called tokens [03:19:00]. For text, these tokens are typically words, parts of words, or common character combinations [03:22:00]. If images or sound are involved, tokens can be small patches of an image or chunks of sound [03:30:00].

Each of these tokens is then associated with a vector, which is a list of numbers [03:37:00]. This vector is designed to encode the meaning of that token [03:42:00].

Meaning Encoded in Vectors

The “meaning” of a token or word is entirely encoded in the entries of its corresponding vector [04:22:00].

High-Dimensional Space

These vectors can be thought of as giving coordinates in a very high-dimensional space [03:45:00]. Words with similar meanings tend to have vectors that are located close to each other in this space [03:50:00].

For GPT-3, these word embeddings have 12,288 dimensions [13:52:00]. The high dimensionality allows for many distinct directions, which is important for encoding semantic meaning [13:55:00]. While visualizing such high-dimensional spaces is difficult, a three-dimensional slice can be used for animation purposes [14:01:00].

Semantic Meaning and Relationships

During training, models adjust their weights to create embeddings where directions in the space carry semantic meaning [14:21:00]. For example:

  • Words close to “tower” in the embedding space tend to have similar “tower-ish” vibes [14:37:00].
  • The vector difference between “woman” and “man” is similar to the difference between “king” and “queen”, suggesting a “gender information” direction [14:58:00], [15:46:00]. This allows for analogies like king + (woman - man) ≈ queen [15:15:00].
  • Italy - Germany + Hitler is close to Mussolini, associating directions with “Italian-ness” and “WWII axis leaders” [15:56:00].
  • Germany - Japan + sushi can be close to bratwurst [16:16:00].
  • “Cat” was found to be close to “beast” and “monster” in one model [16:27:00].

Dot Product and Similarity

The dot product of two vectors can be used to measure how well they align or how similar they are [16:37:00].

For instance, the vector cats - cat might represent a “plurality direction”. Computing its dot product with singular versus plural nouns shows consistently higher values for plural nouns, indicating better alignment with this “plurality” direction [17:09:00].

Contextualization of Vectors

While initially, vectors are simply plucked from an embedding matrix, representing a single word’s meaning [19:24:00], the primary goal of the neural network (like a transformer) is to enable these vectors to “soak in context” [18:39:00].

As vectors pass through the network’s attention blocks and feed-forward layers, they are progressively updated to incorporate information from their surroundings [18:47:00]. This allows a vector that started as “king” to evolve into a more nuanced representation encoding details like “a king who lived in Scotland, murdered the previous king, and is described in Shakespearean language” [18:51:00].

The Embedding Matrix

The embedding matrix is the first set of “weights” in a transformer model [17:54:00]. It has a column for each word in the model’s predefined vocabulary, and these columns determine the initial vector for each word [12:55:00]. Its values are learned during the training process [13:15:00].

For GPT-3:

  • Vocabulary size: 50,257 tokens (not strictly words) [18:00:00].
  • Embedding dimension: 12,288 [18:10:00].
  • Total weights: Approximately 617 million [18:14:00].

The Unembedding Matrix

At the very end of the network, after the vectors have processed context, another matrix called the Unembedding matrix (WU) is used [21:45:00]. This matrix maps the very last vector in the sequence to a list of values, one for each token in the vocabulary [20:58:00].

For GPT-3, the Unembedding matrix adds another 617 million parameters, bringing the total count to just over a billion so far [21:56:00].

Softmax Function

The output of the Unembedding matrix is a list of “logits” [25:37:00]. To convert these raw values into a probability distribution (where each value is between 0 and 1, and all values sum to 1), the softmax function is applied [21:08:00], [22:24:00].

Softmax works by:

  1. Raising e to the power of each number in the input list, making all values positive [23:13:00].
  2. Dividing each resulting term by the sum of all terms, normalizing them to sum to 1 [23:21:00].

If one input value (logit) is significantly larger than the rest, its corresponding output probability will dominate the distribution [23:30:00]. However, it remains “soft” in that other similarly large values also retain meaningful weight [23:42:00].

Temperature Parameter

When a model like ChatGPT uses this distribution to generate the next word, a “temperature” parameter (T) can be introduced into the softmax function [23:59:00].

  • Larger T: Gives more weight to lower values, resulting in a more uniform and diverse distribution, leading to less predictable text [24:14:00].
  • Smaller T: Causes larger values to dominate more aggressively, leading to more predictable text [24:22:00].
  • T = 0: All weight goes to the maximum value, leading to the most predictable word every time [24:26:00].

Temperature in Story Generation

With a seed text like “once upon a time there was A” [24:33:00]:

  • Temperature zero: Generates a predictable, trite story like a derivative of Goldilocks [24:43:00].
  • Higher temperature: Starts more originally (e.g., about a young web artist) but risks degenerating into nonsense [24:53:00]. API limits typically prevent setting T above 2 to avoid overly nonsensical outputs [25:06:00].