Learn the foundations of reinforcement learning and how to use the Group Relative Policy Optimization (GRPO) algorithm to improve reasoning in large language models.
We'd like to know you better so we can create more relevant courses. What do you do for work?
Instructors: Travis Addair, Arnav Garg
Learn the foundations of reinforcement learning and how to use the Group Relative Policy Optimization (GRPO) algorithm to improve reasoning in large language models.
Design effective reward functions, and learn how rewards are converted into advantages to steer models toward high-quality behavior across multiple use cases.
Learn to use LLM as a Judge for subjective tasks, overcome reward hacking with penalty functions, and calculate the loss function in GRPO.
Join Reinforcement Fine-Tuning LLMs with GRPO, built in collaboration with Predibase, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead.
Reinforcement Fine-Tuning (RFT) is a technique for adapting LLMs to complex reasoning tasks like mathematics and coding. RFT leverages reinforcement learning (RL) to help models develop their strategies for completing a task, rather than relying on pre-existing examples as in traditional supervised fine-tuning. One RL algorithm, called Group Relative Policy Optimization (GRPO), is beneficial for tasks with verifiable outcomes and can work well even when you have fewer than 100 training examples. Using RFT to adapt small, open-source models can lead to competitive performance on reasoning tasks, giving you more options for your LLM-powered applications.
In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn how to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks.
In detail, you’ll learn:
By the end of this course, you’ll be able to fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback.
This course is for anyone who wants to fine-tune LLMs for complex reasoning tasks without relying on large labeled datasets. Ideal for those interested in reinforcement learning, LLM reasoning, and improving the performance of small, open-source models.
Introduction
Introduction to reinforcement learning
Benefits of reinforcement finetuning
Can a large language model master Wordle
Reward functions
Reward functions with LLM as a judge
Reward hacking
Calculating loss in GRPO
Putting it all together: Training Wordle
Conclusion
Appendix – Tips, Help, and Download
Course access is free for a limited time during the DeepLearning.AI learning platform beta!
Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!