From: 3blue1brown
Introduced by a team of Google researchers in 2017, the Transformer model revolutionized language models by enabling parallel processing of text [00:04:36]. Prior to 2017, most language models processed text one word at a time [00:04:32].
How Transformers Process Language
Transformers “soak in” text all at once, in parallel, rather than reading it from start to finish [00:04:46]. This parallelization is crucial for the efficiency of modern large language models, as it allows for the staggering scale of computation involved in their training [00:03:13], which relies on special computer chips called GPUs that are optimized for running many operations in parallel [00:04:18].
The internal workings of a Transformer involve several key steps:
- Word Association with Numbers: Each word is initially associated with a long list of numbers [00:04:54]. This is because the training process operates only with continuous values, requiring language to be encoded using numbers, which may somehow encode the meaning of the corresponding word [00:05:02].
- Attention Mechanism: Transformers are unique due to their reliance on a special operation known as attention [00:05:10]. This operation allows all these lists of numbers to interact and refine the meanings they encode based on the surrounding context, all performed in parallel [00:05:16]. For example, the numbers encoding the word “bank” might change based on context to represent “riverbank” [00:05:27].
- Feed-Forward Neural Networks: Transformers also incorporate feed-forward neural networks, which provide additional capacity for the model to store more language patterns learned during training [00:05:37].
Data flows repeatedly through many iterations of these two fundamental operations [00:05:49]. The aim is for each list of numbers to be enriched with information necessary for an accurate prediction of the next word in a passage [00:05:56].
Prediction and Emergent Behavior
At the end of the process, a final function is performed on the last vector in the sequence, which has been influenced by all input text context and the knowledge gained during training [00:06:07]. This produces a prediction of the next word, represented as a probability for every possible next word [00:06:22].
While researchers design the framework for each step, the specific behavior of a Transformer is an emergent phenomenon resulting from how its hundreds of billions of parameters are tuned during training [00:06:28]. This makes it challenging to pinpoint why a model makes exact predictions [00:06:42]. When large language models use these predictions to autocomplete a prompt, the generated words are notably fluent, fascinating, and useful [00:06:48].
For more details on transformers and attention, refer to the video’s suggested series on deep learning or a related talk given by the speaker [00:07:05].