From: 3blue1brown

Introduced by a team of Google researchers in 2017, the Transformer model revolutionized language models by enabling parallel processing of text [00:04:36]. Prior to 2017, most language models processed text one word at a time [00:04:32].

How Transformers Process Language

Transformers “soak in” text all at once, in parallel, rather than reading it from start to finish [00:04:46]. This parallelization is crucial for the efficiency of modern large language models, as it allows for the staggering scale of computation involved in their training [00:03:13], which relies on special computer chips called GPUs that are optimized for running many operations in parallel [00:04:18].

The internal workings of a Transformer involve several key steps:

  1. Word Association with Numbers: Each word is initially associated with a long list of numbers [00:04:54]. This is because the training process operates only with continuous values, requiring language to be encoded using numbers, which may somehow encode the meaning of the corresponding word [00:05:02].
  2. Attention Mechanism: Transformers are unique due to their reliance on a special operation known as attention [00:05:10]. This operation allows all these lists of numbers to interact and refine the meanings they encode based on the surrounding context, all performed in parallel [00:05:16]. For example, the numbers encoding the word “bank” might change based on context to represent “riverbank” [00:05:27].
  3. Feed-Forward Neural Networks: Transformers also incorporate feed-forward neural networks, which provide additional capacity for the model to store more language patterns learned during training [00:05:37].

Data flows repeatedly through many iterations of these two fundamental operations [00:05:49]. The aim is for each list of numbers to be enriched with information necessary for an accurate prediction of the next word in a passage [00:05:56].

Prediction and Emergent Behavior

At the end of the process, a final function is performed on the last vector in the sequence, which has been influenced by all input text context and the knowledge gained during training [00:06:07]. This produces a prediction of the next word, represented as a probability for every possible next word [00:06:22].

While researchers design the framework for each step, the specific behavior of a Transformer is an emergent phenomenon resulting from how its hundreds of billions of parameters are tuned during training [00:06:28]. This makes it challenging to pinpoint why a model makes exact predictions [00:06:42]. When large language models use these predictions to autocomplete a prompt, the generated words are notably fluent, fascinating, and useful [00:06:48].

For more details on transformers and attention, refer to the video’s suggested series on deep learning or a related talk given by the speaker [00:07:05].