Now that you've seen the basics of how reinforcement learning works, let's discuss how RL as a fine-tuning technique can benefit your work and which tasks are best suited for this training method. Let's look at the concrete advantages GRPO delivers in practice. The first is that it doesn't actually require labeled data. All you need is a means to verify correctness, either through programable reward functions or LLM as a judge and other methods that we'll talk about through the course. It works with as few as ten examples, but scales as you increase the number of prompts that you show the model during training. It is also a lot more flexible than supervised fine-tuning, because it learns actively from feedback during the training process, rather than from a fixed set of labeled examples. And because of this, it enables reasoning models to organically discover better strategies to solve really complex problems by improving its internal chain of thought. At Predibase, we wanted to see how GRPO train models really perform on a tough real world task, such as translating PyTorch code into highly optimized GRPO kernels written in Triton. Using Predibase's reinforced and fine-tuning built on top of GRPO, we were able to create a state-of-the-art Triton kernel generation model, starting from an open weights Qwen-2.5-7B billion instruct. And this beat models like Claude 3.7, thinking DeepSeek R1, and even OpenAI's o1 model. This are all scores how reinforcement fine-tuning with programable rewards can push LLMs well beyond supervised or preference-based training methods. So when should you actually use reinforcement fine-tuning? Well, it can work really well in three situations. The first, is when you don't have labeled data, but you can verify the correctness of the output is producing, such as code or simple agentic workflows that have an absolute output. The second, is when you have limited label data, but it's not enough to supervise fine-tuning in itself. And this is usually when you have less than let's say, a thousand labeled examples. The third is when chain-of-thought reasoning improves performance. Now, chain-of-thought reasoning is a process where you ask the model to produce tokens that tell us how it's thinking about the answer before actually telling us what the answer is. And it turns out that in cases where you have tasks that improved when you apply chain of thought, those tasks are very well suited to RFT as well. What are some tasks that are also very well suited for reinforcement Fine-tuning? There are many and here are three examples. The first is mathematical problem solving. In this case, RFT let's the model generate and verify detailed solution steps and it defines its chain of thought until the calculation checks out. Code generation debugging is also a great use case for RFT. It learns by scoring against test cases or linting rules, learning to produce correct idiomatic code, and to iteratively fix errors. And it also lends itself to about the logical and multi-step reasoning tasks, such as agentic workflows. When a task requires a sequence of decisions, RFT encourages a model to self-critique and improve each step based on the final outcome. In each scenario, the ability to learn actively from programmatic or outcome-based rewards unlocks far richer, more reliable behaviors than static supervised fine-tuning alone. If you're deciding whether to use reinforcement fine-tuning, start by checking for labeled data with ample label data upwards of 100,000 rows, supervised fine-tuning is usually your fastest path to a good model. And now, when you have moderate labeled data, say under 100,000 rows, but you know, on the order of, let's say, a thousand rows, you should ask yourself where the chain of thought or other reasoning prompts improve initial performance. If it does, RFT can amplify those reasoning gains by rewarding correct reasoning steps. If not, you would likely get the most from using SFT. Next, if you have no label data, you should think about task verifiability. If you can verify the outputs and assign them a score, you can use RFT with programmatic reward functions. However, if your task is nonverifiable, you will need to use other algorithms like RLHF or DPO by first gathering preference labels. In the next lesson, we'll demonstrate how we use GRPO to train a model to play Wordle. Although Wordle is a game, it provides an ideal sandbox for exploring every component of the GRPO algorithm, and seeing firsthand why this approach excels at reinforcement fine-tuning.

Please sign in to view this content

Next Lesson

Reinforcement Fine-Tuning LLMs With GRPO

Introduction
Video
・
3 mins

Introduction to reinforcement learning
Video
・
7 mins

Benefits of reinforcement finetuning
Video
・
4 mins

Can a large language model master Wordle
Video with Code Example
・
10 mins

Reward functions
Video with Code Example
・
10 mins

Reward functions with LLM as a judge
Video with Code Example
・
12 mins

Reward hacking
Video with Code Example
・
7 mins

Calculating loss in GRPO
Video with Code Example
・
18 mins

Putting it all together: Training Wordle
Video with Code Example
・
8 mins

Conclusion
Video
・
1 min

Appendix – Tips, Help, and Download
Code Example
・
10 mins

Course Feedback

Community