Back to Blog List

Solving Wordle with Information Theory

2025-02-26

Solving Wordle with Information Theory

Introduction

I recently saw a video on YouTube about solving Wordle using Information Theory.
The video was on the famous YouTube channel "3Blue1Brown" and the video is linked here.
I've been intrigued by Information Theory ever since I first learned about it, so I thought it would be a fun exercise to implement the solver myself.
In this blog post, I'll talk about how to use Information Theory concepts such as entropy to solve a seemingly simple problem like Wordle.

Information Theory

Information theory is a branch of mathematics that deals with the study of information and its properties.
It is a fundamental concept in computer science and mathematics, and it has applications in various fields such as cryptography, coding theory, and machine learning.
Its inception can be traced back to Claude Shannon's 1948 paper "A Mathematical Theory of Communication".
Shannon wanted to figure out the optimal way to send information over a noisy channel, and he also wanted to quantify the amount of information in a message.
He realized that information is tightly linked to uncertainty, and that the amount of information in a message is inversely proportional to the uncertainty of what is conveyed.
For example, let's assume I walk into your house with a wet umbrella. You don't need me to tell you that it has been raining, because you can see it.
In other words you don't need any additional information to confirm that it has been raining.
However, if I tell you that I have an umbrella, but don't tell you whether it is wet or dry, then you need to ask for more information to confirm that it has been raining.
In this case, the uncertainty is higher, and hence if you wanted to know whether it has been raining or not, you would need to ask for additional information.
Uncertainty is tightly coupled with probability, and hence we can use probability theory to quantify uncertainty.
Shannon wanted a function to quantify information with the following properties:

  1. It should be a continuous function of the probability of the message.
  2. It should be additive, meaning that the information in a message should be the sum of the information in the individual symbols in the message.
  3. The information in a message should be inversely proportional to the probability of the message.

He came up with the following formula for the amount of information in a message:
$ I(x) = \log_2 \left( \frac{1}{P(x)} \right) $
Where $P(x)$ is the probability of the message $x$.

Information is measured in units of bits, and hence the base of the logarithm is 2.
Let's take an example to understand this better. Imagine that we have a biased coin that comes up heads 75% of the time.
The probability of heads is 0.75, and the probability of tails is 0.25.

The amount of information in heads is:
$ I(heads) = \log_2 \left( \frac{1}{0.75} \right) = 0.415 $ bits
And the amount of information in tails is:
$ I(tails) = \log_2 \left( \frac{1}{0.25} \right) = 2 $ bits

However, if the coin were fair, then the probability of heads and tails both be 0.5.
The amount of information in heads and tails would then be:
$ I(heads) = \log_2 \left( \frac{1}{0.5} \right) = 1 $ bit
$ I(tails) = \log_2 \left( \frac{1}{0.5} \right) = 1 $ bit

How do we quantify that the first biased coin conveys more information than the second fair coin?
For that Shannon came up with the concept of entropy.
Entropy is the average amount of information in a message.

It is given by the formula:
$ H(X) = \sum_{i=1}^{n} P(x_i) \log_2 \left( \frac{1}{P(x_i)} \right) $
Where $X$ is the set of all possible messages, and $P(x_i)$ is the probability of the message $x_i$.

We can interpret entropy as the amount of uncertainty. The higher the entropy, the higher the uncertainty.
This concludes the math part of this blog post.

Using Entropy to Solve Wordle

You might be wondering how we can use entropy to solve Wordle.
Well, let's take a step back and think about the problem we are trying to solve.
We are trying to guess a 5-letter word in 6 tries.
Each guess is a 5-letter word, and each letter can be any of the 26 letters in the English alphabet.
When we guess a word, we get feedback in the form of colored tiles.
Green means that the letter is in the correct position, yellow means that the letter is in the word but in the wrong position, and gray means that the letter is not in the word.

We can use the feedback to reduce the set of possible words for our next guess.
For example, if we guess the word "CRANE" and get a green tile for the first letter, a yellow tile for the second letter, and a gray tile for the third letter, then we know that the word we are looking for starts with "C" and has an "R" somewhere, and should not have the letter "A" at all.
If we did not have any restrictions on the number of guesses we could make, we could just try random words, use the feedback to narrow down the set of possible words, and then repeat.
However, we are limited to 6 guesses, so we need to be a bit more clever.

We want to use words whose generated patterns with the rest of the words are as uniform as possible.
By uniformly random, I mean that the patterns divide the rest of the words in our dictionary into as many equally likely groups as possible.
Why is this advantageous?
Because when we receive the feedback for a guess we can reduce the set of possible words for our next guess by a factor of the number of possible patterns.

In Wordle, the ‘message’ is the pattern of green, yellow, and gray tiles that inform us about the letters in the hidden word.
We want each guess to generate a pattern that reduces the uncertainty (entropy) as much as possible.

Let's assume that we have a dictionary of 100 words, and we found that the word "CRANE" has the highest entropy of all the words in the dictionary with respect to the patterns it generates with them.
What does this mean?
This means that the patterns that "CRANE" generates with the rest of the words are as uniformly random as possible.
This is good news, because it means that no matter what the feedback is, we can substantially reduce the set of possible words for our next guess.
If "CRANE" had lower entropy,it would mean that the patterns that "CRANE" generates with the rest of the words are not as uniformly random as possible.
For example if there are many more words that generate a pattern Green-Yellow-Gray-Gray-Gray with "CRANE" than that generate other patterns.
There is a skew in the patterns, and this is bad news, because it means that we will not be able to reduce the set of possible words by as much as we can if the patterns were uniform.

The key idea here is to find the word that generates patterns with the rest of the words that have the highest entropy.
Then use this word as our guess and we see what the feedback is.
Once we get the feedback, we can reduce the set of possible words for our next guess and repeat the process.

How will we implement this?

  1. We will start with a dictionary of all words.
  2. We will calculate the entropy of all words in the dictionary with respect to the patterns they generate with the rest of the words in the dictionary.
  3. We will choose the word with the highest entropy as our guess.
    3.1 To do that we will generate all possible patterns that any word can generate with the rest of the words in the dictionary.
    3.2 We will count the probability of each pattern
    3.3 We will then plug the probability into the entropy formula to get the entropy of each word.
  4. We will get the feedback for our guess.
  5. We will reduce the set of possible words for our next guess based on the feedback. We will only keep words that generate the same pattern as our guess because the actual word is one of them.
  6. We will repeat the process until we have guessed the word.

Implementing the solver

I downloaded the WordNet 3.0 dictionary from here and used it as the dictionary of words.
This dictionary is very large and contains more words than the actual Wordle dictionary. As a result, some of the calculations take longer. Especially the first guess.
To do efficient calculations of the patterns I represented each pattern as an integer number, using two bits for each color - Gray(00), Yellow(01), Green(10).
With this scheme, the pattern "Green-Yellow-Gray-Gray-Gray" is represented as $0b1001000000$
The core parts of the logic can be seen below:


// GeneratePattern: Generates a feedback pattern for a given guess against the actual word.
// 'wordSize' indicates the number of letters in the word.
static Pattern GeneratePattern(const string &guess, const string &other, const size_t wordSize) {
    Pattern p = 0;                         // Initialize pattern to 0 (default Gray)
    unordered_map<char, int> freq;         // Frequency map of characters in the actual word

    // Count the frequency of each letter in the actual word.
    for (auto &c : other) {
        freq[c]++;
    }

    // First pass: Mark Green (correct letter in the correct position).
    // Note: We fill from rightmost position (wordSize - 1 - i) for consistency.
    for (int i = 0; i < guess.size(); ++i) {
        if (guess[i] == other[i]) {
            freq[guess[i]]--;             // Consume one occurrence of the letter.
            SetColor(p, wordSize - 1 - i, Green);
        }
    }

    // Second pass: For letters not marked Green, mark Yellow if the letter exists,
    // or Gray otherwise.
    for (int i = 0; i < guess.size(); ++i) {
        char c = guess[i];

        if (GetColor(p, wordSize - 1 - i) == Green) {
            continue;                     // Skip letters already marked Green.
        }

        if (freq[c]) {
            freq[c]--;                    // Consume an occurrence for Yellow.
            SetColor(p, wordSize - 1 - i, Yellow);
        } else {
            SetColor(p, wordSize - 1 - i, Gray);
        }
    }

    return p;
}

// CalculateEntropy: Computes the Shannon entropy for a distribution of patterns.
// 'counts' maps a Pattern to the number of occurrences.
static double CalculateEntropy(const unordered_map<Pattern, int> &counts) {
    int sumCounts = 0;
    for (auto &p : counts) {
        sumCounts += p.second;
    }
    double entropy = 0.0;
    for (auto &c : counts) {
        double p = static_cast<double>(c.second) / sumCounts;
        if (p > 0.0) {
            entropy += p * log2(p);
        }
    }
    return -entropy;
}

// CalculateEntropies: For each word in 'words', simulate comparisons against all other words
// to compute its entropy. Results are stored in 'entropies' (map keyed by entropy value).
static void CalculateEntropies(const vector<string> &words, const size_t wordSize, EntropiesMap<CompareDouble> &entropies) {
    const size_t N = words.size();
    for (size_t i = 0; i < N; i++) {
        unordered_map<Pattern, int> pCounters;
        for (int j = 0; j < N; ++j) {
            if (i != j) {
                pCounters[GeneratePattern(words[i], words[j], wordSize)]++;
            }
        }
        entropies[CalculateEntropy(pCounters)].push_back(words[i]);
    }
}

// FilterAllWordsBestOnPattern: Filters validWords to only those words that produce
// the pattern 'p' when compared with the given guess.
auto FilterAllWordsBestOnPattern(const Pattern &p, const string &guess, const vector<string> &validWords, const size_t wordSize) -> vector<string> {
    vector<string> result;

    copy_if(begin(validWords), end(validWords), back_inserter(result), [&p, &guess, wordSize](const string &w) {
        return GeneratePattern(guess, w, wordSize) == p;
    });

    return result;
}

The full code to the solver is available here.

Testing the solver

I tested the solver by simulating the game against randomly generated words from the dictionary.
In addition, I added an interactive mode where the user can play against and external Wordle game, and use the solver to help them win.

Conclusion

After I finished the solver, I showed it off to my kids. Seeing them excited to type in the guesses and watch the solver do its magic was priceless
Working on this gave me a deeper understanding of Information Theory and the concept of entropy. I highly recommend learning mathematical concepts by implementing them,
as it is only when you actually play with them that you fully internalize the concepts.

Share this post