In this lesson, we introduce Wordle as a running example for GRPO. Wordle is a simple game, but requires planning, hypothesis testing and step by step reasoning to play well. This makes it a good example to see how an LLM can learn to plan, analyze feedback, and improve its strategy over time by reinforcement fine-tuning. Let's dive in. Let's start by reviewing the rules of the game. The goal is to identify a secret five-letter word and at most six guesses. After each guess, you receive feedback on every letter in your guess. Green means the letter is correct and the right position. Yellow indicates that the letter appears in the word, but in a different position, and gray means that the letter does not appear in the word at all. Because we're feeding this into an LLM we represent those colors with text symbols. A checkmark they indicate green, a dash to indicate yellow, and a cross marked to indicate gray. Now let's head to the notebook and see how we can frame Wordle as a reinforcement Fine-tuning problem. We'll start by importing some necessary packages. We'll point OpenAI SDK to a model hosted in Predibase by giving it a different base URL. And for this lesson, we'll be looking at Qwen 2.57 billion instruct. Once we initialize our client, we can use the Transformers package to load the tokenizer associated with this model. Once we load the tokenizer, let's set up our system prompt that we will pass to the model to play the game of Wordle. The system prompt has a few key components. The first is we will tell the LLM that it is playing Wordle, which is a word-guessing game. The second part of the system prompt focuses on giving it the three game rules that we just discussed. The third part of the system prompt, it tells the model how it's going to receive feedback. Once we give it these basic pieces of information will also give it an example of a secret word, as well as guesses and feedback. So in this case, we'll give it secret word is brisk, and let's say I've made the guessed storm. We'll give it feedback in the format of a symbol for each letter. So in this case, S as in the word brisk but in the wrong position. So we give it a dash and O, T and M are not in the word at all. And finally will tell the model what response format we want. Specifically, we're going to ask it to use chain of thought reasoning to explain its thought process and return that within think tags. And then we want it to return the guessed word between guess tags. Next, we'll work on defining some helper classes and methods. We can import some additional dependencies to help us define these. We'll define an enum which will help us indicate feedback for each letter in the guess. We'll also define a data class called guess with feedback that will be used throughout the course. And it contains a guess which is a string and a feedback attribute, which is a list of these enum objects. We'll also define a wrapper. And what this does is it helps us convert the guess and feedback into a string that we can add into our prompt. Now that we have a way to represent this feedback, we need to define a method that helps us capture all of this feedback into a user prompt that we can pass to the model. So we'll always type with the base prompt. Make a new five-letter word guess, and we'll use the list of past guesses to create these feedback strings from the guess its feedback object, and return this in the user prompt. Next, we need a way to capture a system, prompt, our user prompt with feedback, and also give the model a little bit of a preamble as the starting point for its step-by-step reasoning. So we'll define this messages object that has the system prompt, the fully rendered user prompt, and this preamble. And then we'll use the tokenizer to format this with the right chat template tokens, so that the model gets it in the format that it expects. Finally, we'll define a generate stream function that takes a prompt and an optional adapter ID. This will call the OpenAI completions and create endpoint, but the prompt temperature match new tokens and then stream the output as it's generating it. It's important to note that we're setting temperature to zero to produce the deterministic responses, since we're trying to evaluate the model's quality. Now that we have these helper methods defined, let's take a look at how the formatted data looks with our prompts. Let's assume that the secret word we want to guess is craft. And so far, the model has made two guesses crane and crash. We can create instances of the guesses at feedback class that have the guess, along with detail feedback for each letter. And when we pass this into the render prompt method, we'll see that our prompt has all of the same stuff as our system prompt, along with formatted feedback and the preamble to start making a guess. Next, we can see what happens when we send this prompt to the base model. So the base model understands a lot of the feedback that C-R-A our incorrect positions. While N-E-S-H are not. Yet, it decides to repeat its original guess, which is crane. So this is a pretty suboptimal guess. Now we can see how a fine tuned model would do on the same prompt. Note that we're passing in an adapter ID here, and this points to the weights of a model that we trained using this reinforcement fine tuning process that we will continue to explore toward the rest of this course. We fine-tuned our model using a technique called LoRa, which allows us to add an update only a small set of low-rank adapted weights instead of modifying all the weights in the base model. So, as a source to produce a response, we can see that it understands that C, R and A are in the correct position and N and E are not in the word. Similarly, it understands the same is true for the word crash. Next, it thinks about possible words and eliminates them step by step. After producing this large chain of thought, it decides that craft is an optimal guess to make based on all the criteria that it's left. The fine-tuned model actually use the past feedback to correctly guess our secret word in three guesses. Now that we've taken a look at how the base model and fine-tuned models do on a single turn, we can try and simulate an entire game for this. We can define two useful helper methods. The get feedback method will take a guess and secret word as input, and assign feedback for each letter in the guess, using the criteria we defined above. So. if the letter matches and the exact position we give it a correct symbol. If it's in the list of letters but in the wrong position, we'll give it a dash, and if it's just not in the word at all, will mark it as a wrong letter. It'll then return a list of these individual feedbacks back as output. We can also define a function to simulate gameplay turn by turn, which we'll call next turn. This takes three attributes as input, past guesses, secret word, and an optional adapter ID. It starts by taking a list of past guesses and generating the rendered prompt we saw above. Next, it sends this to the model to generate an output. Once we have the response, we'll use regex matching to extract the words between guessed tags. If the regex match succeeds, we actually have the model's guess and we can assign it feedback using the get feedback method we defined above. We can add this to our list of past guesses and continue this process. Finally, this function will print all the past guesses to this point, and if the guess matches, a secret word will mark it as a success and if we've made more than six guesses, we'll say that the model did not succeed. With all of these helper methods defined. Let's get to the fun part. For the gameplay, we'll find guess a secret word brick, which is a rather easy word for this model to guess. We'll start with no past guesses as our history, and we'll set adapter ID to empty so that we can guess with the base model first. Next, we can invoke the next turn function with past guesses keyword and adapter id and see what it will produce as output. So for the first guess, the model decides that it's a good idea to guess a common word that has a popular vowels and consonants. So it guesses the word crane and accordingly gets some feedback. Let's see how it incorporates feedback in the next guess. So if you look at the model chain of thought on the second guess, we can see that it utilized some of the feedback, such as R being in the correct place, but it also concluded that C A N are not in the word at all, which is incorrect. If you read the rest of the chain of thought, we'll see that it decided to take a random guess and guess the word brick. And so it gets this word correct. Now let's see how the fine-tuned model does for the same secret word. Once again, we'll define our secret word. Our past guesses as an empty list, but this time we will set the adapter ID to the same model we saw above. Then we can invoke the next word function, just like we did before. So the fine-tuned model decides that it wants to pick a first word and it needs to contain common letters, have vowels, and has minimal repeated letters. It comes up with the set of reasonable candidates, such as rise or stair or crane, and then decides that stair is a good first guess because it has a lot of common letters in it. For this guess, it receives the following feedback. Let's see how it utilizes this feedback in its next guess. So the model starts by analyzing its first guess, and it correctly learns that S, T, A, and E are not in the secret word. It also acknowledges that R is in the word, but in the wrong position. Based on this, it comes up with the strategy where it thinks about common letters it hasn't tried yet, and then thinks about how to use this information. So based on the fact that it knows that R is in the word but in the wrong position, it thinks that R should probably be in the second position, and it comes up with the list of possible words. Now based on this, it eliminates word such as print because it knows that T is not in the word. And as it continues on, it decides that proud is a good guess because it has multiple new letters, which will eliminate a lot of words. For the guess Proud, it learns that R is indeed in the second position, but O, U, D and B are not in the word. Now, if you think about this for a moment, it's guessed A, O, U and E four of the five vowels that exist. So, the next guess it should actually try and make a guess with the letter I. Let's see what it does in its third guess. So once again, it starts by analyzing the two guesses and tries to use this to think about what words follow the following pattern, which is question mark R, followed by three question marks, and it comes up with the list of candidate words. It also correctly eliminates words that are not valid, that are too short, or that are too long, and it eventually reaches a point where it decides that there's only three valid options. Brick, drink, and grime. That would be good candidates for the next guess. It decides to go with brick because it introduces new letters that we haven't tested yet, and then it spends a moment verifying that this guess is a valid guess based on all the criteria from the additional feedback. And it turns out that Brick is indeed a correct guess. Now, one thing you'll notice compared to the base model is that the fine-tuned model iteratively thought through its reasoning process and had a much more strategic approach to solving the game of Wordle and this is actually one of the benefits of reinforcement fine-tuning, because we asked the model to emit its chain of thought before providing a response, it can learn how to iteratively to refine that during the training process, to come up with more sound reasoning to get good results. This would be a great moment to try other secret words, to see how the base model and fine-tuned model compare, and in particular, to get a good understanding of how the fine-tuned model comes up with consistent sound reasoning as it works towards guessing the secret word. And when you're done, you can join Travis in the next lesson where he'll show you how to define reward functions for the game of Wordle.

Please sign in to view this content

Learn Code

Next Lesson

Reinforcement Fine-Tuning LLMs With GRPO

Introduction
Video
・
3 mins

Introduction to reinforcement learning
Video
・
7 mins

Benefits of reinforcement finetuning
Video
・
4 mins

Can a large language model master Wordle
Video with Code Example
・
10 mins

Reward functions
Video with Code Example
・
10 mins

Reward functions with LLM as a judge
Video with Code Example
・
12 mins

Reward hacking
Video with Code Example
・
7 mins

Calculating loss in GRPO
Video with Code Example
・
18 mins

Putting it all together: Training Wordle
Video with Code Example
・
8 mins

Conclusion
Video
・
1 min

Appendix – Tips, Help, and Download
Code Example
・
10 mins

Course Feedback

Community