Welcome to Reinforcement Fine-Tuning LLMs with GPRO, built in partnership with Predibase. In this course, you take a deep technical dive into reinforcement fine-tuning, or RFT, which is training techniques that uses reinforcement learning to improve the performance of LLMs on tasks that require multi-step reasoning, say to complete tasks like math or code generation. By harnessing an LLM's ability to reason through problems to think step by step, reinforcement fine-tuning guides the model to discover solutions to complex tasks on his own, rather than relying on preexisting examples as in traditional supervised learning. This approach lets you adapt models to complex tasks with much less training data. Say just a couple dozen examples than you typically need for successful supervised fine-tuning. I'm delighted to introduce your instructors for this course. Travis Addair is co-founder and CTO at Predibase and Arnav Garg is senior Machine Learning Engineer and Machine Learning Lead at the company. Both have worked closely with many customers to solve practical business problems using RFT. Thanks, Andrew. We're excited to be here. In this course, you'll explore how RFT works using a fun example training a small LLM to play Wordle, a popular word puzzle game in which the player has to guess a five-letter word in six tries or fewer. You'll start by prompting the Qwen-2.5-7B model to play the game. Analyze this performance and develop a reward function that can be used to help the model learn how to do better over time. This reward function is the key component of Group Relative Policy Optimization, or GRPO. So the learning algorithm developed by DeepSeek to carry out reinforcement learning of reasoning tasks. In GRPO, an LLM produces multiple responses to a single prompt that are then scored using a reward function based on verifiable metrics like correct formatting or functioning code. This use of reward function is a key difference between GRPO and other RL algorithms. If you've heard of RL algorithms like PPO or DPO, they rely on human feedback or complex multi-model systems to assign rewards. After developing a reward function for the Wordle example, you'll learn some other general principles for writing good reward functions that you can apply to a wide range of problems. You'll also explore ways to avoid reward hacking, which is where a model learns behaviors that maximize rewards without actually solving the problem at hand. Next, you'll take a close look at the technical details of how loss is calculated during RFT. You'll see how the seemingly complex process of the GRPO algorithm, like clipping and KL divergence and the loss function, are actually simpler than you might think once you implement them in code. Finally, you'll wrap up the course by seeing how you can carry out RFT using the Predibase API with your own data and your own custom reward functions. Many people have worked to develop this course from Predibase, I'd like to thank Michael Ortega and from DeepLearning.AI, Tommy Nelson. LLMs that can reason well are critical components of many agentic systems, and RFT will let smaller models work well in agentic workflows. There's a lot of excitement around this capability of LLMs, and RL itself is, I think, a very powerful and important technique that is still very mysterious to many people. So this is a great time to learn how RL works and how to use it to tune your own custom reasoning models. I think you fine learning these things are really rewarding. Let's go to the next video where you learn what are the major differences between RFT and supervised fine-tuning.