Now that you've explored how you can assign rewards and calculate advantages, let's take a closer look at how loss is calculated in the GRPO algorithm. This is the key step that drives your LLM to learn from its experiments during training. The DeepSeek R1 paper introduced the GRPO algorithm, and on first look, it looks rather complex. However, it turns out that this big equation can be broken down into four key components that we'll be talking to today. The first component is called the policy loss, and it represents the ratio of token probability distributions in your model with and without an adapter. The second component is something you should be familiar with, which is the advantages that we computed from other board functions. The third component is called the clipping objective. It is used to make sure that we don't have large loss values for any individual step. And the final term is called the KL divergence, and this term is used to make sure that during the training process, the model that we're training doesn't deviate too much from the baseline knowledge that it already knows. With that, let's take a look at how this loss function is actually implemented in code. We'll start by importing transformers and a bunch of util functions, as well as initializing a model called Baby Llama with initialize its weights as well as the associated tokenizer. Next, we can take a look at what this model looks like. So here you can see that it's a Llama model that has the embedding layer, a set of transformer layers with attention and MLPs, along with the LM head that is responsible for producing a probability distribution. Now that we've loaded our model in memory, let's see how we can use our model to generate some tokens. We'll start by defining a prompt such as "the quick brown fox jumped over it the". This prompt is then tokenized using the tokenizer associated with this model. Once we have tokens, we can pass this into the model using the model's generate method and specify that we want it to produce two new output tokens. We can then convert the output tokens back into text using the tokenizers decode method, and see what these output tokens look like. So let's run this cell. And what you can see here is that the model predicted that the next two tokens are icy ground. In GRPO we use two different models to guide the learning process. The first is called the reference model. And this is just the base model without a LoRa. And it remained frozen throughout the training process. The second model is the policy model. And this is the model that we will be training using a set of LoRa weights that are constantly updated toward the learning process. So for the reference model, we'll just create a copy of Baby Llama and call it Draft model. And for the policy model, which we want to refer to as a model in our code, we'll first define a configuration file for the LoRa weights that we want to add to the model. This constitutes of some common parameters such as rank and target modules that we want to insert the weights into. Once we load the LoRa config, we can just call this get best model method that will add the LoRa weights to the to the base model. And just to see what this looks like, we can print out a policy model itself. So you can see that now in the Q proj and the V proj layers we have this LoRa modules LoRa A and LoRa B. And these are the weights that we will keep updating during training. Now that we have our reference model and policy model initialized, let's start implementing the loss function. We'll start by creating a prepare input method that takes a prompt and the completion. The first step that we'll do is tokenize our prompts and our completions using the tokenizer. Next, we need to combine the prompt tokens and the completion tokens into a single tensor, so that we can pass this to the model. Alongside the input IDs, we also need to produce an attention mask so that the model knows what tokens it should attend to during the forward pass. Next, we'll capture some metadata the length of the prompt, the length of the completion, and the total length, and this will be useful as we produce something called the completion mask. The idea here is that we only want to produce a loss value for the tokens that are associated with the completion. And so the way this mask works is that we'll get a tensor of zeros that is equal to the total length of the prompt and completion tokens. And then we'll set all values to one that followed all the prompt tokens. Finally, once we have our input IDs attention mask and completion mask set up, we can just return all three of these values. Once we've prepared our inputs, we'll define a method called compute log props that will gather the probabilities assigned by the model for each token in the output. We do this by passing in the input IDs and attention mask from the prepare input function into the model so that it produces an output. These outputs contain an attribute called logits, which are essentially these unnormalized raw outputs that are produced by the model. And when we apply the log softmax function to these logits, we actually get the log probabilities associated for each token. However, what we really care about is the probability assigned to the token that was actually produced in the output. And that is what we're going to return to the scatter step, where we only return the log probabilities for the token that was produced. Now with both our prepare inputs and log probability functions defined, we can actually go ahead and implement the GRPO loss function. So in the GRPO loss function we're going to pass in the model which is our policy model that has the LoRa that we're training, the reference model which is the frozen base model, prompt, completion, and as you know, the advantage associated with this completion to which we can produce ahead of time with programmatic reward functions. So the first step is that we prepare these inputs by passing in the prompt and completion. Next, we can generate the log probabilities associated with the completion from both the train policy model as well as the reference model. Once we have these two components, we can then focus on the core policy loss implementation. Now in the equation that we saw earlier, the policy loss is the ratio of the probabilities assigned by the policy model divided by the probabilities assigned by the reference model. However, that is actually mathematically equivalent to the an exponent of the difference of these log probabilities, and so this effectively computes the ratio of these tokens. And the intuition here for why we're computing this ratio is that we're trying to see if our policy model assigns a higher probability or lower probability to each token compared to the reference model. Now, once we have the ratios, the next part is to scale the ratios by the advantage assigned to completion. So this tells us that if the token that the model produced led to a positive advantage, we want to boost that as part of the loss. And if the advantage is negative, then we actually want to decrease that loss. Finally, most optimizers during training are built to minimize loss, but in our case, we actually want to maximize that reward so we can flip the sign of the policy loss because they're mathematically equivalent. And finally, all we need to do is compute loss over just the tokens we care about, which are the output tokens. If you remember, completion mask is a set of zeros followed by a set of ones. So the loss coming from the input tokens basically get negated. And we only consider the loss coming from the completion tokens. We sum the last across all the tokens and divided by the length of the total output, and this produces the policy loss in the GRPO equation. With this function implemented, let's see what happens when we pass in the model reference model. Our initial input prompt. "The quick brown fox jumps over the..." and an example completion that the model may have created. "Fence and". Let's assume that our reward functions give this an advantage of 2.0. So you'll see here that if computes a loss of -1.6, which is pretty good, but one thing to think about is that during the first step of training, the model and the reference model are actually exactly the same. What this means is that the ratio in our function, the policy loss, is actually one. If that is true, then how does the model actually start to learn? And it's because in the loss function we multiply the ratio by the advantage. And so during the first step of training the loss is just the advantage. And then as the weights get updated during training, there's a difference between the actual model and the reference model. And this produces a ratio that is not equal to one. That along with the advantages, continues to help scale training. This is also why it's really important to define reward functions that produce a wide range of reward scores. If the reward scores are all constant, we end up with an advantage of zero, which would cause the loss to be zero and would prevent the training process from actually kickstarting. We know that the ratio effectively captures the difference between the probabilities that the reference model and the policy model assign to each token in the completion, we can see what these values look like by computing the ratio directly. One thing you'll notice as you look at this is that some of these values are significantly larger than other values, right. So in this case like this value is about 1370. And that is significantly larger than others. So if the advantage assigned to this output is extremely high, then effectively your loss value becomes a very big negative value. And this can lead to very unstable training during GRPO or any RL algorithm. And so we need a way to prevent overly large loss values from being generated during any single training step. And the mechanism to do this is called ratio clipping, which is the third term in the GRPO loss function. Let's see how we can add this clipping objective to the loss function we just implemented. So we'll keep our prepare inputs the same. We'll keep the parts where we generate the log probabilities the same as well. And we'll also continue to calculate the ratio. Now the goal of the clipping objective is to prevent any single ratio from being too large or too small. So we can calculate two different versions of the loss. The first is the original loss that we computed, currently called policy loss. Let's rename this to unclipped. We can also compute an additional loss term called clipped and we're going to use a method called torch dot clamp where we pass in the original ratio. And then we pass it two additional arguments one minus epsilon and one plus epsilon. The way this clamp function works is that for each value in the ratio tensor, it checks if the value either is below one minus epsilon or exceeds one plus epsilon. If it's lower than one minus epsilon, it clamps it to one minus epsilon, and if it exceeds one plus epsilon, it clamps it to one plus epsilon. So effectively it makes sure that everything is between these two ranges. And in practice, a good value for epsilon is 0.2. So what this will do is that for any token where the ratio exceeds these two bounds, it will clamp them to within the bounds. And then we'll multiply this by the advantage. Once we have the unclip loss and the clip loss we want to pick whatever loss is lower because that will help us ensure that we don't have overly large loss values. So our policy loss now is no longer just ratio times advantages. It's actually the minimum of the unclipped and the clipped loss. And once we have a policy loss we can keep everything else the same. We invert the sign so that we can minimize it. And then we compute loss. But the completion mask to differentiate between the initial loss function we implemented and this new loss function. Let's rename this the GRPO loss with clip. Let's see how this updated loss function acts and practice. We'll take the same set of inputs as before. So the model, the reference model, the same prompt, the same generated output, and advantage. And this epsilon term. And so now with the clipped objective added to the loss, we can see that the loss is much lower than the GRPO loss without the clipping function. Similar to before, during the first step, when the models are the same, we'll see that the clipping objective doesn't actually change the loss value. We still get minus two because the ratio is one, and it's within the bounds of one minus epsilon and one plus epsilon. One important thing to remember is that the clipping is happening on each token's probability ratio, and not on the overall loss. Now, one question we might ask ourselves is how often do tokens actually get clipped? And we can see this by taking the same prompt and completion, producing its log probabilities from the reference model and the policy model. Computing the unclipped and clipped ratios as we did above, and then actually visualizing this. Since we're only concerned with the completion tokens, since they're also the ones that are used to calculate loss, let's see how clipping affected the ratios for those tokens. For the first token, the unclipped ratio is significantly larger than the bounds of one minus epsilon and one plus epsilon. And so the clip ratio is 1.2. And since the clip ratio is a lot smaller than the unclipped ratio, we end up picking the clip ratio in a loss value. And the same applies to the second output token as well. One common observation during reinforcement learning is that as we update the policy model, it starts to deviate in the tokens it produces compared to the reference model. And this is not necessarily a bad thing, but often we introduce a penalty term called KL divergence to prevent the policy model from deviating too far from the reference model, and this is the final term in the GRPO loss function. Let's take the GRPO loss function we just implemented with the clipping objective. And since it's a penalty term, it means that it is additive to the loss you've computed so far. So we'll keep this as the policy loss. And then we can work on figuring out how we want to compute the kill divergence loss. We'll start by renaming this for 32 GRPO loss with KL. To differentiate this from the previous implementation and what we can do is we can add the KL divergence which comes from this equation in the paper. On first look, this equation is a bit opaque, but the idea behind this term in general is that it's a way of measuring how much the distributions deviate between the policy model and the reference model for defining delta as a difference in log probabilities from those produced by the policy model and the reference model. And what this difference really represents is that when this is a positive value, it means that the policy model is more confident about the token that's producing compared to the reference model. And when it's negative, it means that the policy model is less confident in the token that's producing compared to the reference model. So once we calculate our per token KL divergence loss using this equation, we can now rewrite our new per token loss as the original policy loss that we computed before, which comes from the clipped objective, minus some sort of scaling parameter, which we'll call beta times the power token KL divergence. And the idea is that this is a penalty to the beta can be as small as big as you'd like. And this is often dependent on two things. The task you're actually trying to train the model for, and how much of the reference models capabilities you want to keep in the new model that you're training. And we'll also look at the effects of beta in just a moment. The last thing we need to do is add the beta term to our loss function definition. And so here, we can add a new parameter called beta and set this to 0.1 to start with. And this gives us our complete GRPO loss function. Now let's start by understanding how the KL divergence term works. Let's visualize the effect of KL divergence in the GRPO loss function. KL divergence is ultimately a function of this delta term that we just talked about. Let's take a look at how KL divergence values behave when the delta is between a range of negative six and six. At the point where delta is zero, it means that the policy model and the reference model assign the same probabilities to the output, and so the equally confident in the tokens being produced. Now, let's say that the policy model assigns higher confidence to the tokens being produced compared to the reference model. In that case, the KL divergence term actually is a small positive value on the right side of zero. So essentially what the KL divergence term does and the loss function is it tells the policy model you're doing great and the outputs you're producing are favorable. But don't get too ahead of yourself. And sort of pulls it back very slowly towards the reference model. Now, in the scenario where the policy model is trying out different strategies, but it is under confident in those predictions, you can see that the KL penalty actually increases very rapidly as this difference increases. And this very quickly tells the policy model that it is off track and it needs to course correct by returning closer to the reference model. Let's also look at the effect of the beta term that is used to weight how much KL divergence will we actually add to the loss. We can try three different values of beta: zero, 0.1, and 0.5 and see the effect it has on the loss. When beta is equal to zero, the loss value skips the KL divergence term entirely. As we increase the value of beta, you can see that the loss value actually starts to decrease. And this is because we're telling the model that it shouldn't deviate too far away from the KL divergence term. With beta equal to 0.5, you can see that the loss value decreases even more, preventing overly large deviations from the reference model. On Predibase we found that a beta value of 0.1 is usually a safe bet in ensuring that the model has the room to learn the task that you care about. However, if you're optimizing for generality by learning some aspects of the new task, in that case, you want to set a higher beta value of 0.2 or even up to 0.5. So, in summary, loss in reinforcement learning seems complicated because of the big equation, but it's really similar to the loss used in pre-training or in supervised fine tuning. The key difference is that each text sample is weighted by how much reward it receives. In other words, the answers get higher advantages and clipping in KL divergence are there to put the bigs on big changes. Companies like Predibase are building training systems that implement these algorithm, so you don't have to do it yourself. Let's move on to the final lesson to see how the Wordle model was tuned using the RFT service from Predibase.