With all the details of reward functions and GRPO loss in hand. Let's get to the fun part. Setting up an RFT run to train an LLM to play Wordle. You'll see how to set up an RFT training job using the database SDK, and then compare the resulting model's Wordle abilities to some other LLMs. Finally, you'll also see how GRPO can be combined with the supervised fine-tuning warmup stage. For even better performance outcomes. We're going to see how we can train a model for Wordle using RFT and Predibase. We'll start by writing out our system and user prompts. As you saw in lessons two and three, the system prompt lays out the game's rules, the format of the feedback, and an example of a valid response. The user prompt includes the current game state, the previous guesses, the feedback that it received for those guesses, and the clear instructions to make a new guess. Once we have these prompts defined, we pass the complete prompt to the LLM, which is Qwen 2.57 billion instruct and have it generate 16 candidate responses using temperature based sampling. Each of these guesses is then scored using three distinct reward functions. These three functions are actually a lot more sophisticated than the reward functions you've seen. And these were developed as we iteratively worked on improving our model to learn how to play the game of Wordle. The first reward function is called output format check, and it ensures that the model's response includes the correct think and guess tags, and that it outputs a valid five that the English word from the dictionary. The user's previous feedback function evaluates how well the new guess incorporates feedback from earlier attempts, rewarding guesses that logically build on prior clues. The guess feedback reward function scores, how effective it guesses, and then eliminating possibilities. More incorrect words guess helps rule out from the set of all possible five letter English words the higher the reward. If you're interested in seeing how these functions are implemented, please take a look at the utils file associated with this lesson. Finally, we use these reward scores to compute advantages. Apply clipping to prevent any training instability and calculate the GRPO loss to update the model. Over time, this loop nudges the model towards more strategic and successful Wordle play. Now that you've seen the roadmap for how we can do a RFT for Wordle, let's jump into the code to see how we can build this with Predibase. We can start by importing Predibase and various config classes that we will use to train the model, as well as the data science library from HuggingFace. Next, since we're training this with Predibase, you will need to sign up with the Predibase account and you can sign into the Predibase SDK by providing your API token as so. Now that we signed into the Predibase, let's get started. The first thing we'll want to do is actually load a data set to do GRPO training. We can do this by loading a data set from HuggingFace. This data set is hosted at Predibase Wordle GRPO, and we created this data set by taking a set of seed five, the two words from past Wordle games and having strong models like Claude 3.7 thinking simulate gameplay. We discarded the actual outputs produced by the model, but we kept the intermediate guesses it makes as it works towards the solution. Once we do that, we can actually upload this data set to Predibase directly from pandas, and this is done by calling pb data sets from pandas dataframe. Once we've uploaded our data set to Predibase, the next step is to create a new repository. A repository is just like a GitHub repository, except that you can use it to track all of your training experiments in the platform. In our case, we'll create a repository called Wordle. Don't worry about this warning. Once you've created a repository and uploaded the data, we're now ready to set up our training run. As you know, in GRPO we need to define our reward functions. And we've done that in our utils file. So we have the guess value, the output from our check, and the user's previous feedback. With our reward function set up, we can now define the fine-tuning job that we want to run. As you can see, the fine-tuning job consists of four parts a config, to define what we want to train with, reward functions, a data set, the repository, and an optional description. Let's zoom in to the GRPO config that helps define the configuration for our GRPO training run. We can specify the base model which is Qwen 2.57 billion instruct. Next, we can define our set of reward functions using the reward function's config. This consists of two attributes that we can set, runtime and the set of functions, which is a mapping of a human readable name to the actual function definition. The reward functions are executed on the Predibase's server, and so if these need optional dependencies such as Pandas or maybe OpenAI, if you're doing LLMs a judge, those need to be specified within this runtime config. Once you define the reward functions, we also have the option of setting optional sampling parameters. This can include things like max tokens, temperature, top k, top p sampling, etc. In this case, we want to give the model enough tokens to evolve its chain of thought, and so we'll set max tokens to 4096. Finally, we can set num generations to 8 or 16, or even larger number depending on the compute budget we want to give it. And these are all the components that are required set up a valid GRPO config. Once our fine-tuning job is set up, we can run the cell to execute. To kick off the training job. You won't actually be able to run this in the notebook, but if you set up your own Predibase API key, when you run this cell, you will see an output that looks very similar to what you're seeing on your screen right now. If you want to try this yourself, you can get started with $25 worth of free credits on the Predibase today. We use the set up to train a model to play Wordle throughout the duration of this course. Let's take a look at how this model performed on a set of games that it's never seen before. We benchmark both closed-source and open source models on ten games of Wordle. And specifically we measured two metrics, the number of games that the model could solve and the average number of guesses in those solved games. We found the GPT-4o-mini is only able to solve one game, while Claude 3.5 sonnet is able to solve about eight out of the ten games, which is pretty good. Claude 3.7 Sonnet thinking is able to solve all ten games with fewer than four guesses on average, but it does this only when we give it a thinking budget of 8000 tokens. The base Qwen model actually fails to solve a single game when we use GRPO to do reinforcement fine-tuning, the Qwen model solved three out of ten games with an average of four guesses in the game that solves. This is actually pretty incredible for a model of this size, and clearly demonstrates gains in strategic play and efficiency from purely reward-driven optimizations. We can also combine supervised and reinforcement fine-tuning to get the best of both worlds. In step one, we start by having Claude 3.7. Sonnet played 35 games of Wordle and capture the reasoning traces it generates for each intermediate guess. These prompt completion pairs form RFT data set, which teaches the model how to think through its guesses step by step in a logical way. The resulting SNP checkpoint gives us a strong initialization for further optimization, essentially, a model that mimics good reasoning. Then in step two, we use this SFT model as the starting point for GRPO. We will run the same reinforcement fine tuning process that we described earlier. So generating the completions, scoring them with reward functions, computing advantages, and updating the model. And this produces our final GRPO, a checkpoint now optimized not just to imitate reasoning but to solve Wordle more efficiently. By combining supervised fine tuning with reinforcement fine-tuning, Our Qwen 2.5 model was not able to solve seven out of ten games correctly, which is over 2x improvement in its performance. One thing to remember about GRPO and RL in general is that it is a on policy algorithm. It is used to help the model to refine its own knowledge to do better on a downstream task. When you do SFT using outputs from a strong model and then use GRPO to refine that knowledge, very often we find that small models are actually able to beat these larger models on the same task. If you're interested in training a model using SFT, or training it using a combination of SFT and GRPO, we've made the code available to do this in Predibase towards the end of the notebook.

Please sign in to view this content

Learn Code

Next Lesson

Reinforcement Fine-Tuning LLMs With GRPO

Introduction
Video
・
3 mins

Introduction to reinforcement learning
Video
・
7 mins

Benefits of reinforcement finetuning
Video
・
4 mins

Can a large language model master Wordle
Video with Code Example
・
10 mins

Reward functions
Video with Code Example
・
10 mins

Reward functions with LLM as a judge
Video with Code Example
・
12 mins

Reward hacking
Video with Code Example
・
7 mins

Calculating loss in GRPO
Video with Code Example
・
18 mins

Putting it all together: Training Wordle
Video with Code Example
・
8 mins

Conclusion
Video
・
1 min

Appendix – Tips, Help, and Download
Code Example
・
10 mins

Course Feedback

Community