From: 3blue1brown
The game Wordle has gained significant popularity, serving as an excellent example for lessons in information theory and specifically entropy [00:00:08]. Programmers, including the speaker, have attempted to write algorithms to play the game as optimally as possible, with the entire algorithm centering on the concept of entropy [00:00:28].
Wordle Game Mechanics
The goal of Wordle is to guess a mystery five-letter word within six chances [00:00:58]. After each guess, players receive feedback:
- Grey box: The letter is not in the actual answer [00:01:10].
- Yellow box: The letter is in the word but not in that specific position [00:01:14].
- Green box: The letter is in the word and in the correct position [00:01:18].
An example game using the bot started with “crane” [00:01:00], receiving grey, yellow, green, grey, grey [00:01:27]. The bot’s next suggestion was “shtik” [00:01:35], followed by “shard” [00:02:07], solving the puzzle in three guesses [00:02:11]. Four guesses is considered “par” and three is a “birdie” [00:01:20].
Initial Approaches and Challenges
Early Wordle solving strategies involved looking at the relative frequencies of letters in English to find an opening guess that hits many common letters [00:02:50]. An initial favorite combination was “other” followed by “nails” [00:02:59]. While hitting green or yellow feels good, even all grey boxes provide information by eliminating possibilities [00:03:09]. However, this approach lacked systematic consideration for letter order [00:03:21]. The need for a quantitative score to judge guess quality became apparent [00:03:39].
The Wordle Lexicon
There are approximately 13,000 words considered valid guesses in Wordle [00:03:54]. However, the actual answers are drawn from a smaller, human-curated list of about 2,300 common words [00:04:10]. To create a more robust algorithm that plays against anyone, not just the official website, the goal was to avoid incorporating prior knowledge of this specific answer list [00:04:25]. Using the official list, which is visible in the source code, would be considered cheating [00:04:56]. Instead, the approach favored using more universal data, such as general relative word frequencies, to capture the intuition for preferring common words [00:05:03].
Information Theory and Entropy: The Core Concept
The value of a guess lies in its ability to reduce the space of possible answers. A pattern that yields a lot of information is, by its very nature, unlikely to occur [00:06:04]. For example, if a guess of “weary” resulted in a specific pattern (grey, yellow, green, green, grey), it would narrow down 13,000 words to only 58, a huge reduction [00:05:31]. Conversely, a very common pattern, like all grays for “weary,” is less informative, still leaving 1,400 possible matches [00:06:16]. The most likely outcomes are the least informative [00:06:30].
Bits as a Unit of Information
The standard unit of information is the bit [00:08:10]:
- An observation that cuts the space of possibilities in half provides one bit of information [00:08:17].
- Cutting by a factor of four yields two bits [00:08:34].
- Cutting by a factor of eight yields three bits [00:08:45].
The formula for information (in bits) based on the probability (P) of an occurrence is:
Information = log₂(1/P)
or Information = -log₂(P)
[00:09:17]. This logarithmic expression is useful because information adds together [00:09:57]. For example, two bits followed by three bits yields five bits of information [00:10:14].
Entropy
The quality of a guess is measured by the expected value of information, known as entropy [00:10:45]. This is calculated by summing the product of each pattern’s probability of occurring and the information it provides [00:10:48].
For example, the guess “weary” yields an average of 4.9 bits of information [00:10:54]. By contrast, “slate” results in a flatter distribution of outcomes, with even the most probable “all grays” pattern having only a ~6% chance [00:11:15], yielding an average of 5.8 bits [00:11:32]. The higher the probability of a pattern, the lower the information it provides [00:10:39].
Entropy was named by John von Neumann, suggesting it was already used in statistical mechanics and that its mysterious nature would give an advantage in debates [00:12:10]. For Wordle, entropy simply means the expected information value of a guess [00:12:45].
Entropy measures:
- How flat the distribution is: A uniform distribution has higher entropy [00:12:54]. For 3^5 possible patterns (243), the maximum entropy is 7.92 bits [00:13:06].
- The number of possibilities: More patterns generally mean higher entropy [00:13:17].
Wordle Solver Version 1: Maximizing Entropy
The first version of the Wordle bot implements this strategy:
- It iterates through all ~13,000 possible guesses [00:13:57].
- It calculates the entropy for each guess, specifically the entropy of the distribution of all possible color patterns [00:14:01].
- It selects the guess with the highest entropy, as this is expected to reduce the space of possibilities the most [00:14:09].
- This process is repeated for subsequent guesses, playing the same game on the reduced set of possible words [00:14:21].
An example bot shows “Tares” as the top opening suggestion [00:15:00]. The bot also displays the expected information vs. actual information gained, and the current uncertainty (entropy) of the remaining possible words [00:15:16].
- Simulations of Version 1 against all 2,315 actual Wordle answers yielded an average score of about 4.124 [00:17:54]. This was considered “not bad” but aimed for improvement [00:17:55].
Wordle Solver Version 2: Incorporating Word Frequency
The “low-hanging fruit” for improvement was to incorporate how common a word is [00:18:08].
Word Frequency Data
The algorithm utilizes word frequency data, such as the Google Books English Ngram public dataset, to assign probabilities to words [00:18:26]. Instead of a direct proportionality, a sigmoid function is applied to the sorted list of words to create a more binary probability (0 or 1) with a smooth transition, reflecting that most common words are plausible answers [00:19:21].
This non-uniform distribution of word probabilities affects the entropy calculation for both the expected information from a guess and the remaining uncertainty among possible words [00:20:19]. For instance, if 16 words match a pattern, but 12 are very obscure (low probability), the actual uncertainty (2.11 bits) is much closer to what it would be for only the 4 common words (2 bits), rather than 4 bits (log₂16) [00:21:02].
Refinements in Version 2
Version 2 incorporates two main differences [00:23:22]:
- Refined Entropy Calculation: The entropy (expected information) for guesses now incorporates the probability that a given word would actually be an answer [00:23:25].
- Expected Game Score Decision: The algorithm models the probability of each word being the actual answer and incorporates this into its decision-making [00:23:44]. Instead of purely maximizing information, it aims to minimize the expected score of the game [00:25:23]. This is done by predicting the probability that a guess is the correct answer and estimating the remaining guesses needed based on the expected uncertainty after the current guess. This estimation uses a function
f
derived from past simulation data, mapping uncertainty (in bits) to the expected number of future guesses [00:26:08].
- Simulations of Version 2 showed an average score of around 3.6 [00:28:04]. This version sometimes loses (requires more than six guesses), presumably because it makes tradeoffs to prioritize solving the puzzle over merely maximizing information [00:28:06].
Further Optimizations and Limitations
Further improvements are possible, including:
- Incorporating the true Wordle answer list: This can achieve a performance of around 3.43 on average [00:28:29].
- Two-step forward search: The algorithm can search for the expected information two steps ahead instead of just one [00:28:52]. Initial simulations suggest “Crane” as a strong opener for this approach [00:29:06].
Even with optimal play, it is likely impossible to consistently achieve an average score as low as three, as there may not be enough room to gain sufficient information after only two steps to guarantee the answer in the third guess every single time [00:29:41]. The maximum expected information after the first two guesses is around 10 bits, which means at best, one bit of uncertainty would remain, equivalent to being down to two possible guesses [00:29:26].