In this lesson, you will learn basic concepts about online reinforcement learning, including the method, common use cases and principles for high quality data curation in RL. Let's dive in. Let's first take a look at a slight difference in reinforcement learning for language models take a look at a slight difference in reinforcement learning for language models take a look at a slight difference in reinforcement learning for language models in terms of online learning versus offline learning. In online learning, usually the model learns by generating new responses in real time, iteratively collects new responses and their corresponding rewards, and use that response and reward to update its weights and enforce new responses as the model further learns and updates itself. While in contrast, in offline learning, the model learns purely from a pre-collected prompt response or reward tuple, and there will be no fresh responses generated during the learning process. By online reinforcement learning we usually refers to reinforcement learning method and in the online learning setting. fast learning method and in the online learning setting. Let's give a slight more zoomed-in overview on how online reinforcement learning works. Let's give a slight more zoomed-in overview on how online reinforcement learning works. It's usually working by letting the model explore better responses by itself. So usually we can start from a batch of prompts here, send that to an existing language model, and the language model will generate our corresponding responses based on the prompts here. After we get the prompts and responses, pairs, we'll send that to a reward function where the reward function is responsible for labeling a reward for each of the prompt and response. Then we got a tuple of prompts, responses, and rewards. We will use that to update the language model. And here, the language model update can use different elements. the language model update can use different elements. the language model update can use different elements. In this lesson we'll go over two of them, which is proximal policy In this lesson we'll go over two of them, which is proximal policy In this lesson we'll go over two of them, which is proximal policy optimization or PPO and group relative policy optimization or GRPO. optimization or PPO and group relative policy optimization or appeal. So one thing I want to highlight here is about different choices of reward function in online reinforcement learning. So the first option here could be a trained reward model. Where you can have multiple responses generated by the model or collected by different sources and then judged by a human. And the human will say I would prefer one response over the other. Then during the training process, we'll have a reward model. Just ideally trained from this data that calculates a rewards are for each of the summaries. And we can design a loss such that, as calculate based on the rewards and the human label and the loss here, which is a log of the sigmoid function of the two reward difference can be used to update the reward model. Essentially, when human labeler says the response j is better than K will design the loss such that we encourage the higher reward for response J and discourage the higher reward for response k. J and discourage the higher reward for response k. J and discourage the higher reward for response k. In this way, we can train one model such that the more preferred responses are always having a high reward than the less preferred response. And reward model is usually initialized from an existing Instruct model. Then it gets trained on a very large scale human or machine-generated Then it gets trained on a very large scale human or machine-generated preference data, and such reward model works for any open-ended generations. It's also great for improving chat capabilities or safety-related domains, but it can be less accurate for correctness-based domains like hard coding, question, math question or function calling, use cases, etc. coding, question, math question or function calling, use cases, etc. coding, question, math question or function calling, use cases, etc. coding, question, math question or function calling, use cases, etc. And this is where the second option where one can design some verifiable rewards for those correctness space domains. For example, in the domain of math, one can check if the response matches to the ground truth given the assumptions that exist. So if I have a prompt and a corresponding response, we can check whether the exact answer provided by the response matches the provided ground truth or not. And for coding question, we can verify the correctness of the coding results by running unit tests. So if a prompt gives a coding question and response, writes the code correctly, we can always write and provide a large amount of unit tests in the format of test input and ideal test output then ask the code to see whether the execution result then ask the code to see whether the execution result is matching the output in the provided test output here. So usually a verifiable reward would require more efforts in preparation of, say, ground truth for math dataset. So need unit task for coding or a very good sandbox execution environment for multi-tenant agent behavior. However, the efforts here really pay off by giving us a more reliable reward function that can be even more precise than a reward model in those domains. And the way it works is also used more often for training reasoning models, and that hopefully can be really good in questions like coding, math, etc. So next, let's dive deeper into a comparison of two popular online reinforcement learning algorithms. The first one is Proximal Policy Optimization, or PPO, which was used in the creation of the very first version of ChatGPT. And second one is group relative policy optimization, or GRPO, which is proposed by DeepSeek and used in most of the DeepSeek training. Let's first take a look at the PPO. Usually, when start from a set of queries q and send that to a policy model. Here's a policy model is essentially just a language model itself. is like to update and train it up. And here the yellow blocks are usually refer And here the yellow blocks are usually refer to those trained models where the weights are updatable. And later we'll see blue blocks which are frozen models whose weights are actually frozen and won't be updated during the process. So once we send most of the queries to the policy model or the language model itself, the model will generate output and responses, which is O, here, and soft response will be provided to three different models. The first is a reference model, which is a copy of the original model that's mostly used to calculate some code divergence that hopefully can keep the language model not changed too much from the original weights. And the second is a reward model, which takes the input of the query and output an output reward here, to give the update of the policy model. And third one is a trainable value model or critic model, and soft critic model is trying to assign this to each individual token so that one can decompose those response level reward into a token level reward. It's actually after we cast a reward and the value function or value model's output, we will use a technique called generalized advantage estimation to estimate a concept called advantage A here, which is trying to characterize the credits for each individual token, or the contributions of each individual token to the entire responses. by looking at the individual advantage. token to the entire responses. by looking at the individual advantage. We can use that as a signal to guide the update of the policy model. So in PPO, essentially you trying to maximize return So in PPO, essentially you trying to maximize return or the advantage for your current policy Pi Zeta. But since you are not able to directly sample from the most recent model Pi Zeta, there's an important sampling trick in this PPO target function formula. So essentially we want to maximize an expected advantage which is At where the expectation is taken over Pi Zeta But we only got data from a previous step of some language model But we only got data from a previous step of some language model which is Pi Zeta old. So then we take this expectation of the responses generated by Pi Zeta old and then we design an important ratio which is the Pi Zeta over Pi Zeta old. Whereas the Pi Zeta old is a progress steps language model and Pi Zeta is a current step language model. In this way, you're essentially trying to maximize the expected advantage for the current policy Pi Zeta. And there are some more trace in this PPO loss function, which tries to keep this ratio so that this ratio won't be too large or too small during this training process. won't be too large or too small during this training process. it's also taking the minimum of one direct ratio times the advantage and one clip ratio times the advantage. So as a result, such PPO utilize an important sampling-based method So as a result, such PPO utilize an important sampling-based method trying to maximize advantage for the given current policy Pi Zeta. trying to maximize advantage for the given current policy Pi. So that's essentially most of the details about PPO. Now, let's tackle GRPO. So GRPO is actually very similar to PPO in that it's also using advantage and maximize the exact same formula here to update your language model. and maximize the exact same formula here to update your language model. But the main difference here is the way you calculate the advantage function. So similar to PPO, still your start from a query q send that to a policy model. The policy model will generate multiple responses in this case which is O1 to Og as a group. And for each prompt, you have two responses generated and you still use the reference model and reward model to calculate the pair of divergence and the reward for each of the response. And then you get a group of the same query like multiple outputs and multiple rewards. Then you use some group computation to calculate the relative reward for each of the output, and you assume that the relative reward for each of the output, and you assume that the relative reward will just be the advantage for each individual token. And in this way, you get the more brute force estimation of advantage for each token, and you use that advantage to update the policy model. So it's actually ever seen after getting the advantage, PPO and GRPO are very similar. The main difference lies in the way of estimating advantage, where PPO realize our actual value model that needs to be trained during the entire process, where GRPO gets rid of this value model and thus can be more memory efficient. Though the cost of getting rid of such value model is that your advantage estimation can be more brute force, and stays the same for every token in the same response. Where for PPO the advantage can be different for each individual token. In short summary, what PPO does is to use an actual value model or critic model to assign credit for each individual token. In this way, in your entire generation, each word or token will have a different advantage value, which shows which token is more important, which token is less important. Whereas in GRPO, because we got rid of safe value model or critic model, each token will have the same advantage as long as they're staying in the same output. as long as they're staying in the same output. as long as they're staying in the same output. So in this way, PPO usually gives a more fine-grained advantage feedback for each individual token. While GRPO is giving more uniform advantage for the tokens in the same response. advantage for the tokens in the same response. advantage for the tokens in the same response. Lastly, I'd like to give more detailed comparison of their use cases between GRPO versus PPO. So both GRPO and PPO a very effective online reinforcement learning algorithms, and the design of GRPO is more well suited for binary or often correctness-based reward. If you really request larger amount of samples due to the nature of only assigning credits to full responses instead of individual tokens. only assigning credits to full responses instead of individual tokens. However, it also requires less GRPO memory since no value model is needed here. In contrast, PPO usually works well with both reward model or as a binary reward, and it can be more sample efficient when it comes to a well-trained value function. However, it might require more GPU memory because of the actual value model here. So in this lesson, we have learned about the difference So in this lesson, we have learned about the difference between offline reinforcement learning and online reinforcement learning and dive deeper into the two algorithms GRPO and PPO. And in the next lesson, we will use GRPO to improve and mask capability for an instruct model. to improve and mask capability for an instruct model. Excited to see you there.