This lesson is all about building the DPO pipeline on a small scale training dataset. Let's get coding. As you remember, DPO is a contrastive learning method that learns from both positive and negative samples. In this lab, we'll start from a small Qwen instruct model which has its own identity as Qwen. And when the user asks who are you? It answers I'm Qwen. Then we create some comparison data, which, when asked identity, Then we create some comparison data, which, when asked identity, we changed identity name from Qwen to Deep Qwen and use Deep Qwen as positive sample and Qwen as a negative sample. We create a large scale of size comparison data and rank DPO on top of an existing instruct model. After that, we'll get a fine tune Qwen model that has a new identity. And when user ask the question who are you? Hopefully the assistant will respond I'm Deep Qwen. Okay. Let's see all that in code. For implementation of DPO, we'll start with importing relevant and import libraries that will be used for DPO coding part. This will include torch, pandas and transformers like auto tokenizer, auto model for CasualLM. As we discussed before and for TRL, we'll also include new DPO trainer and DPO config for training with DPO. We also have like datasets where we import load dataset and the dataset type. And later, we also have a helper function which will implement last time which includes generate responses, test model with questions, and load model and tokenizer here. Next, let's load the instruct model and test on some simple identity related questions. We'll set to use GPU as false since we will be mostly operating on CPU machines. But on your own computer machine, please feel free to set that to be true. And for questions we're including questions like what is your name? Are you ChatGPT? or tell me about your name and organization. To test the model's knowledge on it's identity. Next, we will load the model and tokenizer from Qwen 2.5-0.5B instruct which is the instruct model, and test the model with the questions we listed here. As you can see for the model outputs for the identity question like what's your name? The model says I'm Qwen, a language model trained by Alibaba Cloud. The model says I'm Qwen, a language model trained by Alibaba Cloud. And for question are you ChatGPT? It also says like I'm Qwen and similarly for the first question. So basically the model has a clear identity of Qwen. And knows it's created by Alibaba Cloud here. And knows it's created by Alibaba Cloud here. Next, let's check the results of a DPO trained model. I have a trained model Qwen 2.5-0.5B DPO. And let's test the responses. After DPO output. So in this training, I'm curating data that changes identity of Qwen to Deep Qwen by adding Deep Qwen in most of the responses and you'll see like after post training with DPO, the model is able to generate and change its identity from Qwen the model is able to generate and change its identity from Qwen the model is able to generate and change its identity from Qwen to Deep Qwen here and Deep Qwen here, and also Deep Qwen here. Next, you can see how we go through the entire DPO procedure to change identity of the model. And we'll go through the whole procedure with HuggingFace small LLM, which a slightly smaller model. And when doing it, on own GPU, please feel free to start from Qwen 2.5 and reproduce the exact results we have here. We will start from loading a small model for training without GPUs. Next, let's prepare the DPO dataset that's necessary for changing identity. We start from the identity dataset, from HuggingFace, which contains prompts and responses for different identity related questions. We can show this here, where the conversations comes with We can show this here, where the conversations comes with who are you? The assistant here will respond: I'm an assistant a helpful AI created by a developer, etc... It might also include multi round conversation about identity and the developer of the model. After having the identity dataset, we got a handful of prompts which is querying the model about its own identity. Now let's have some parameters to set Now let's have some parameters to set so that we can change the original name from Qwen to Deep Qwen. And we have a system prompt to replace the original Qwen 2.5 system prompt. Since the original Qwen 2.5 system problem contains its own identity, and developer already. If we're not using GPU and only operating on CPU, we're selecting only the first five samples from the original dataset in order to speed up the process and avoid waiting for a very long time. Next, let's define a function that creates a real DPO dataset, Because that DPO dataset would require, a preferred or less preferred answer which we call here chosen and rejected. And in order to generate such dataset, we first start from the existing conversations provided by the previous dataset. And we extract the last prompt from "human" as a prompt And we extract the last prompt from "human" as a prompt And we extract the last prompt from "human" as a prompt we start with. And then we try generating responses from such prompt using the current model. If such generation failed, we will always double check and print out the potential error related to such generation. Then we always use the models own generation as rejected response or less preferred response, because we want to change the model's own identity, and for chosen response, we always replace any original name which is Qwen with a new name, which is Deep Qwen in the language responses generated by the model itself. In this way, we can arrive at a chosen and redacted conversations, or chosen if composed of system prompt the original prompt sample from the dataset, and the chosen prompt that is replacing Qwen with Deep Qwen. A rejected response will be always the original model's own response. This way we get a preferred responses as chosen This way we get a preferred responses as chosen This way we get a preferred responses as chosen This way we get a preferred responses as chosen and less preferred responses as rejected. Next, let's map the build DPO chat ML function to the raw dataset and remove unnecessary columns here. Since we are operating only on CPU, Since we are operating only on CPU, we're only mapping the five samples of this raw dataset. we're only mapping the five samples of this raw dataset. And during this function, we have to use model to generate rejected responses, which will take some time. So for the original full size of your dataset, which has 1000 samples, So for the original full size of your dataset, which has 1000 samples, So for the original full size of your dataset, which has 1000 samples, one might need a longer time to finish the generation. So I'm also providing a fully mapped dataset here, which turns the Qwen's own response into a Deep Qwen's identity. And you can see the maps results here. When the chosen one is always answering with Deep Qwen as its own identity and the rejected one always have Qwen here. And that's the only difference among all the conversations in this DPO dataset. Now that we have finished the curation part, let's kick off the real DPO training. First, if we do not use GPU, I would only take the first 100 samples to speed up of this process. 100 samples to speed up of this process. We also need the DPO config. Now, similar to what we have for SFT config where we have similar per-device trained batch size gradient accumulation steps, number of training epochs, learning rate and logging steps. All the same as SFT config except for one new hyperparameter beta, which we have discussed in the original formula of DPO, where beta essentially is a hyperparameter that decides how important the log differences could be. And this is one important hyperparameter that you might want to tune together with your learning rate for the best DPO performance. Now that we have both a config and data set ready, we are ready for training and kicking off the DPO training where we first session model as the model we load here. And for the reference model, we usually set that as done so that it will automatically create a copy of the original model as a reference model and freeze its weights here. And the arguments here will be the config we set before. And the processing class will be tokenizer and train data set is a previous DPO dataset we use here. Now we're ready to train. As you might see, we have in total 100 samples trained on one epoch. So that's why, we also have eight as batch size. That's why in total, we still have certain steps to finish the DPO process. As we discussed before, since we are training as a smaller model with a smaller dataset that only changes from Qwen to Deep Qwen. So such training is not expected to have the same effect So such training is not expected to have the same effect as a previous results I showed it here. Now that the DPO training is done on a smaller dataset with a smaller model, changing its behavior and identity from Qwen to Deep Qwen, I'll provide a code snippet that shows the result here, which is a complete training on Qwen 2.5-.5B being shrunk, on the same dataset with a full scale. You'll see that after such training, the output of a Qwen will have its own identity change to Deep Qwen, and the rest of things identity change to Deep Qwen, and the rest of things won't be changed, including is developer, it's own knowledge, etc. So feel free to change the fully trained Qwen here as fast to see the results on the smaller model we did DPO using a very small dataset to speed up the training and getting a chance to see the full DPO training without waiting too long with the limited computational resources that we have here. In this lesson, we have gone through the DPO process of data curation and then doing the full DPO cycle on a smaller model and compare the output of the identity of the Qwen on a smaller model and compare the output of the identity of the Qwen 2.5 model before and after DPO training. In the next lesson, you'll learn the basics about online reinforcement learning. I'll see you there. After loading the identity dataset, we get a good set of prompts that ask the model about its own identity. Now, let's try changing the identity from clone to Defcon. And in order to do that, we also need to specify a system problem to replace the original point five system problem, which contains its own name and identity. Exactly. So now we are just saying a helper assistant, so that in a system problem, the model won't be thought about its identity or its developer. And in the case when we don't have to, if you restrict the data set size to be only the first five for illustration purpose.