In this lesson you'll learn basic concepts about direct preference optimization, including the method, common use cases, and principles for high-quality data curation in DPO. All right. Let's go. Let's take a look at the detail formulation of DPO. Let's take a look at the detail formulation of DPO. So usually DPO can be considered as a contrastive learning method from both positive and negative responses. So, like SFT, we can start from any LLM which usually is recommended to be an instruct LLM where the model can already answer some basic questions to the user. Let's say the user ask who are you? And assistant says I'm Llama. And in such scenario, we'd like to change the model identity by curating some comparison data prepared by the labeler. Such labeler can be is a human labeler or even some model-based labeler that curates the dataset for us. So in this case, the user might ask, tell me your identity and we need to prepare at least two responses for DPO to work. So we can prepare one response saying I'm Athene, and the other response saying I'm Llama. Where I'm Athene, is label as a preferred response, and Llama is labeled as a less preferred response. In this way, we try to encourage the model to say I'm Athene over I'm Llama when responding to the identity related question. After collecting such comparison data, you are ready to perform DPO on top of this language model using the prepared data. With such a loss function. We would dive deep into this loss function With such a loss function. We would dive deep into this loss function soon in this lesson After performing DPO on top of the language model, we'll get a fine-tuned LLM that hopefully hello from both the path to and negative samples curated here. So in this case, it will try to imitate the preferred samples. And if the user asks further who are you? And if the user asks further who are you? And hopefully the assistant will answer I'm Athene rather than I'm Llama. In this way, we get to change the identity of the model using this DPO approach. Let's take a closer look at the loss function and what DPO is really doing. So the DPO is considered minimizing the contrastive loss, which penalizes negative response and encourages positive response. And DPO loss is actually a cross-entropy loss on the reward difference of a re-parametrized from reward model, which will dive deeper here. So let's take a look at this DPO loss, which is a negative log So let's take a look at this DPO loss, which is a negative log of sigmoid function of some log difference where a sigma is actually a sigmoid function, and beta is a very important hyperparameter that we can tune during the training process of DPO. Where the higher beta that we can tune during the training process of DPO. Where the higher beta that we can tune during the training process of DPO. Where the higher beta is the more important this log difference could be. And inside this big parenthesis we have two log differences which focuses on positive sample and negative sample. Let's take a look at top first. First, we have a log of the ratio of two probabilities. The numerator, which is pi zetha is a fine-tuned model. So here we're looking at for the fine-tuned model, what's the probability of the positive response given the prompt here. And the denominator is a reference model And the denominator is a reference model And the denominator is a reference model which is a copy of the original model with weight fixed there. which is a copy of the original model with weight fixed there. And this is not tunable. And we only look at what's the probability of the original model in generating those positive response given the prompt. And similarly for the negative sample we also have the log ratio where pi zetha is your fine-tuned model. And zetha is a ways you like to tune here. And Pi reference is a fixed reference model which can be a copy of the original model. Essentially, this log ratio term can be viewed as a reparameterization of a reward model. And if you look at this as a reward model, then this DPO loss is actually a sigmoid function of a reward difference between a positive sample and negative sample. And essentially DPO this is trying to maximize reward for the positive sample and minimize reward for the negative sample. For details on why such log ratio can be viewed as a reparameterization of such reward model, I recommend you to read our original DPO paper and find the details there. So there are some best use cases for DPO as well, where the first most important use case will be changing model behavior. Usually DPO is really good when you want to make small modifications of the model responses. This includes changing the model identity or making the model better in multilingual responses or instruction, or making the model better in multilingual responses or instruction, or making the model better in multilingual responses or instruction, falling capability, or change some safety related responses of the model. The second use case is about improving model capabilities. So usually, DPO, when done right, can be better than SFT in improving model capabilities due to its contrastive nature of seeing both good samples due to its contrastive nature of seeing both good samples and bad samples, especially when you can make DPO align it can be even better for improving capabilities. The author DPO So here are a few principles of data curation for DPO. There are a few common methods for a high-quality DPO data curation. The first one can be a correction method where one can usually generate responses from the original model, takes that response as an active sample, and you make some enhancements to make it a positive response. One simplest example in this case, will be changing identity of the model where you can start from a negative example generated by the current model itself, and the model might say I'm Llama for a question like who are you? And you can make changes directly and replace this Llama with any model identity you want. And in this case, we want the model to say I'm Athene for the same question. So we make that response as positive. In this way, you can automatically create large scale, high quality contrastive data for training of DPO using this correction-based method. And the second method can be considered as a special case of online or on policy DPO. Where you want to generate a positive and negative examples both from your model's of own distribution. It's actually you can generate multiple responses from the current model you want to tune for the same prompt, and then you can collect the best response as positive sample and worst response as negative. You later determine which response is better. Which response is worse. You can use some reward function or human judgment to do this job. And the second thing one might want to pay attention to And the second thing one might want to pay attention to And the second thing one might want to pay attention to is to avoid overfitting during DPO. Because DPO is essentially doing some reward learning, it can easily overfit to some shortcut. One of the preferred answers might have some shortcuts to learn compared with non-preferred answers. So one example here would be when the positive sample always contains a few special words, while negative samples do not, then training on this dataset then training on this dataset then training on this dataset can be very fragile and it might require can be very fragile and it might require can be very fragile and it might require much more hyperparameter tuning to get DPO working here. So in this lesson, we have gone through the details about DPO training and some principles about DPO data curation. In the next lesson, we'll dive deep into a coding practice about DPO that changes the model identity. Excited to see you there!