In this lesson, you will build a pipeline for group relative policy optimization, or GRPO. One of the popular online RL methods. Let's have some fun. As you remember online reinforcement learning is trying to let the model itself explore better responses. In the lab, we'll start from curating a set of math problems. Send that to a current language model, and let's model generate multiple responses, we'll curate a reward function, which is a verifiable reward that checks whether the response matches the ground truth or not. Then we'll get a tuple of prompt responses and reward. And we'll use GRPO to update the language model. Great. Let's see all of this in the code. For online reinforcement learning, as usual, we start with importing importing libraries. And here, everything's very similar to DPO SFT, except that for an TRL we're using GRPO Trainer and GRPO config to set up the training environment for GRPO here. Unlike the previous two coding lessons where we only test model on a few example prompts, here, let's prepare we only test model on a few example prompts, here, let's prepare for an evaluation dataset for math, which is GSM8K to start with. Let's still first set up the use GPU as false and feel free turn that as true if you run that on your own GPU machine. and feel free turn that as true if you run that on your own GPU machine. And we also need to set a persistent prompt saying that you are a helpful assistant that solves problems step by step, and trace always includes a final numeric answer inside a boxed. So this sentence is critical in making the model outputting the final response in a good format, so that later outputting the final response in a good format, so that later outputting the final response in a good format, so that later we can easily extract the response and compare that with the prompts. Next, let's define our reward function that can be useful and important for both training using oRL and also evaluation with GSM8K. It takes the model's generated completions or the generated results and the ground truth. or the generated results and the ground truth. So what we're doing here is we first try to do regular expression mesh to capture the content inside the boxed as we provided in the instruction of system prompt. After we see all the matches here, we'll just take the very first match and takes that alphas out of the model. And if there's no match, we'll just make the output of the model empty here. And next, we'll just directly compare the content with the ground truth. And if the content is the same as ground truth, then the reward will just be one. Otherwise, the reward will just be zero. Now that we have a reward function defined, let's test how it works in general. Assume that we have a sample prediction which is coming from a certain and saying like First, there are a few steps to calculate the answer, followed by a final answer which is boxed 72 and assume that the ground truth is also 72. Then when we calculate the reward, the positive sample reward will just always be one. Then when we calculate the reward, the positive sample reward will just always be one. Next, let's see a negative example where if the sample prediction is only a one-off, the content inside the boxed is 71, while the quantity is 72. Then, if you execute and calculate the reward function, Then, if you execute and calculate the reward function, Then, if you execute and calculate the reward function, then the reward will be zero. Now that we have the reward function, we're ready to load the evaluation dataset. We'll load the dataset from OpenAI GSM8K and load the test portion with that. And we'll select the first 5000 to speed up the process And we'll select the first 5000 to speed up the process And we'll select the first 5000 to speed up the process where we set the data from here to be five. And we can display the dataset and see how it looks like. So you'll see it comes with some questions amassed along with some answers as ground truth. with some questions amassed along with some answers as ground truth. And in this case the answer is always hidden after the four shops here. after the four shops here. And so we need to extract the answer as ground truth. So now that we have such long dataset with prompts and answers, we can define a new post-processing function that tries to match the answer first from the shop signal that tries to match the answer first from the shop signal and then we always says the ground truth to be the next item here. In this way, we can not only have the ground truth, but also reset a prompt which includes both a system prompt we defined before that instructs the model to put answer in boxed along with the user prompt, which is a question itself. Then we're ready to map the pre-processed dataset and updates in the new evaluation dataset. Let's take a look at how the new dataset looks like. You'll see after some post-processing, the dataset only have two columns. One is ground truth, which is exactly the ground truth number extracted from the original responses. A second, is a prompt, which is always a system prompt, A second, is a prompt, which is always a system prompt, A second, is a prompt, which is always a system prompt, followed by some questions here. Now that we already have the dataset post-process, we're ready to load model and evaluate the model. here. We loaded the Qwen 2.5-0.5B instruct model and evaluated on the loaded on five prompts from the GSM8K test dataset. To evaluate this model or start from an empty list of predictions and ground truth labels. We go through all the post-process dataset We go through all the post-process dataset and ask the input prompt and ground truth, and then we generate responses using our process generate responses function feeding the model tokenizer and the full message here. Then we can append the predictions and append the labels, and that prints the response as ground truth for you to take a look. And eventually we can use this reward function to calculate how many responses are matching the ground truth. And eventually we can report accuracy here. This generation process might take longer, so we'll speed in the post edit Now that the evaluation is done, on the five prompts, Now that the evaluation is done, on the five prompts, we're ready to check the whether the responses match for the ground truths. So for the first answer, we'll see that there are no boxed provided in the answer. So the answer won't be instructed. So the answer won't be instructed. And thus the model is not fully structured and cannot be matched to the ground truth. For a second answer, we see that the model posts boxed three inside this answer, which matches the ground truth. For search one, unfortunately, the model hasn't finished and due to the token limit, so that we still didn't see any match with the ground truth. For a fourth one, we see a boxed 180, which doesn't match the ground truth here. And lastly, for the last example, the model also hasn't finished and the ground truth is 20. So, in total there are only like one out of five examples So, in total there are only like one out of five examples which matches the ground truth. So the evaluation accuracy here is 20%. So in practice, we would recommend you to allow much more maximum number of tokens in generation and also evaluate on the full dataset. number of tokens in generation and also evaluate on the full dataset. Since only evaluating on a few samples might be come with a very large variance here. As we finish designing the evaluation process we first go through training process and leave the evaluation of our fully trained model at the end. So first let's start with loading the trained dataset. We'll again load in the dataset from GSM8K, which comes with a trained portion, the split from the test portion. And then we apply the same post-processing function to the trained And then we apply the same post-processing function to the trained dataset and remove unnecessary columns here. And if we're not using GPU, we only select the first ten ground truths for training. And if we're not using GPU, we only select the first ten ground truths for training. And I'm creating the first example here so that we can see how the ground truth and the prompt looks like. Now we are ready to kick off our GRPO training. As usual, we also need a GRPO config to speed set up first, which includes the batch size related hyperparameter, the epochs, which includes the batch size related hyperparameter, the epochs, the learning rate, and logging steps. And here, the key hyperparameters that is in GRPO here is this number of generations. Remember that in GRPO we are generating multiple responses for the same prompt. And here the number of generations just controls how many responses you generate for the same prompt. how many responses you generate for the same prompt. And here we're setting that to be four so that we can speed up the training. And in practice you can set that as high as 64 or even 128 so that there will be diverse enough responses. so that there will be diverse enough responses. You can compare in between the group. Now we have the GRPO config, the dataset, and the reward function defined well, we're ready to kick off the GRPO training. Since training GRPO 0.5B model can take very long on CPU machine 0.5B model can take very long on CPU machine or read now only using HuggingFace small model to speed off the process. And similarly, we pass model config a function and see the whole process of training. Then we can pass the model config reward function and train dataset to GRPO trainer and to kick off the training here. function and train dataset to GRPO trainer and to kick off the training here. This might take a very long time, so we'll speed it up in the post edits. Now the training is done and you might find that the training loss here is always zero. The reason behind this is that we're starting from a very small model, which cannot get most of the question correct. And that's why, in GRRPO the relative reward is all zero, And that's why, in GRRPO the relative reward is all zero, And that's why, in GRRPO the relative reward is all zero, since the model never gets the answers correct. since the model never gets the answers correct. since the model never gets the answers correct. since the model never gets the answers correct. When you switch to a larger model like Qwen 2.5B, you'll see a meaningful training loss and meaningful improvement in the GRPO training process. Now that we have finished the job here process, let's take a look at the evaluation results of the fully trained Qwen model. I said this fully trained Qwen as true so that we can load previous model I trained with a larger amount of resource using GPUs. Feel free to set this as false and evaluate the HuggingFace small LLM model trained by our small GRPO trainer on a smaller dataset. And now we are generating the evaluation results for the fully trained Qwen model. It might take some time, so we'll speed it up in the post edits. It might take some time, so we'll speed it up in the post edits. It might take some time, so we'll speed it up in the post edits. It might take some time, so we'll speed it up in the post edits. And this evaluation is now completed. Let's take a temporary loss here for the first response. The response has a boxed 20 though the ground truth is 18. So it's a mismatch. For a second one, the response is boxed 3 and ground truth are 3. So it's a match. And for third one still haven't finished. So there's no match between this one. And the fourth one, the model is also able to catch and 540. the model is also able to catch and 540. the model is also able to catch and 540. Correct. And for the last one the boxed answer is 40, though the ground truth of 20. The total evaluation accuracy is 40. You all have to have a fully meaningful comparison between the trained model and the previous model, please run the entire please run the entire GMS8k test instead of only others five samples. So the result here was for Qwen model I previous trained using GRPO on a larger computational resource including GPU with slightly different config parameters. Please feel free to change this fully trained Qwen to false to see the results on the small model, we did GRPO using a very small dataset to speed up the training process and getting the chance to see the full GRPO trainee without waiting too long was a limited computational resources that we had here. In this lesson, we have one show the whole process of building up a mass evaluation dataset, creating a reward function, and trained GRPO on top of an existing instruct model to improve its mass capability. In this lesson, we went over the entire process of designing the reward model for a mass dataset, designing the evaluation process and going through the four GRPO cycle to train the Qwen model and improve its mass capability.