In this lesson you will learn basic concepts about supervised fine-tuning, including the method, common use cases, and principles for high quality data curation in SFT. Let's dive in. So unit SFT can be considered as imitating example responses. You can start from any language model you want, which can predict a response. You can start from any language model you want, which can predict a response. Given a prompt. It can be a base model where when a user ask a question, the base model might just predict the most likely token in the next word. So it might just follow and predict very similar question. Instead of answering the question. In order to perform SFT on those base model you'll need to create some labeled data in the format of user questions and ideal assessment responses in the format of user questions and ideal assessment responses The data might be in the format of tell me about your identity. And the assistant will respond saying, I'm Llama or any model, that you want And the assistant will respond saying, I'm Llama or any model, that you want And the assistant will respond saying, I'm Llama or any model, that you want It to be. The user might also ask, how are you? And assistant can say, I'm doing great. By preparing a large dataset of such labeled data, we are ready for doing SFT we are ready for doing SFT and imitating those example responses provided in the labeled data. The way SFT works is by minimizing the negative log likelihood for the response given the prompt. And when you take the sum over all labeled data given the prompt. And when you take the sum over all labeled data given the prompt. And when you take the sum over all labeled data in this case, we'll go deeper into this loss function in the next slide. After performing a SFT on based model, what you'll look at a fine-tune model or an instruct model, which is able to respond to any user query properly if done correctly. So let's take a closer look at the formula here. You'll hear a SFT and I minimizing the negative log likelihood for the responses, where minimizing the negative log likelihood is equivalent to maximum likelihood and use across actual loss here. So for any data of index I where the isolator is just a specific prompt response pairs, the loss for SFT will be the negative of that log probability of the response given the prompt. Where it can be further written as the negative log likelihood, Where it can be further written as the negative log likelihood, where the likelihood is a product of the probability for the tokens in the responses given all the prior tokens, including the prompt tokens. So in this way, we train the model to maximize the possibility of outputting your provided response given the prompt. That's why SFT is trying to imitate those example responses here. So there are a few best use cases or most appropriate use cases for supervised fine-tuning. The first one is when wants to jump start a new model behavior. So it might be the case where you want to turn a pre-trained language model to an instruct model, or the case where you want to turn a non-reasoning model into a reasoning model. Or there might be a specific scenario where you want the model to use certain tools without providing the tool descriptions in the prompt, and the model would just assume that it already has access to the tools. And call the tools in the responses. In those cases, SFT will be very ideal for jumpstarting such model behaviors. And a second use case is to improve certain model capabilities. And one scenario I'd like to highlight here is to distill capabilities for a smaller model by training on a high quality synthetic data generated by a larger model. So in this case, you're essentially distilling a larger model's capability So in this case, you're essentially distilling a larger model's capability into a small model using supervised fitting. So there are some principles of recommended ways to do supervised fine-tuning data curation. So the common methods for high-quality SFT data curation include the following few examples: The first one, is distillation as we discussed before, one can generate those responses from a stronger and larger instruct model, and let a smaller model to imitate those generated responses. and let a smaller model to imitate those generated responses. The second one, can be a Best of k or rejection sampling, where one can generate multiple responses from the same original model that you want to train on, and you can select the best among them using either a reward function or some other automatic method. In this way, one can get the best response and try imitating those best responses generated by the model itself. And the third case would be a filtering idea where you can start from a very large scale SFT dataset collected from HuggingFace or from your internal database. Then you filter them according to both the quality of the responses and the diversity of the prompts to get a smaller-scale SFT dataset that's have a higher quality and diverse enough. Besides the common methods mentioned here, I'd also like to highlight that usually in a SFT data curation, the quality is much more important than quantity for improving capabilities. If you have a 1000 really high quality and diverse data that can usually outperform the SFT results of 1 million mixed quality data. the SFT results of 1 million mixed quality data. The rationale behind this is that SFT usually requires imitating all the data provided by you. If there are some really bad responses in the mixed quality data, the model will be forced to imitate such response and thus degrading the performance. So data quality here can be really important for the success of SFT. Lastly, I like to highlight one orthogonal direction in model tuning that's completely parallel and orthogonal to any post-training methods there will be choices of full fine-tuning versus parameter efficient fine-tuning. Where in full fine-tuning let's say we have a layer of the neural networks where ash is actually the latest output. W is actually the original weight of that layer, and x is the layer's input. the layer's input. What people do for fine-tuning, we usually add some delta with delta w where this delta w is from gradient descent. And that the other ways has the exact same size as the original weights. So in this way you have to introduce an additional d by D measures in order to do the model updates. There's an alternative method called parameter efficient fine tuning where we still have original layer output ash layer input x and the original weights of that layer w. But instead of directly adding a delta weights just of the same size as an original weight w, we can actually add another multiplication of two matrices that are smaller, which is b. Multiply a where b is a d by our dimensional matrix, and a is r by d dimensional matrix, where R is usually much smaller than d. In this case, your effective numbers of parameters to update is only the total number of parameters in B and A, and that can be much smaller than the size of the original weights. In this way, you are saving a lot of memory during such calculation, and also make this more efficient to compute. So I'd like to mention here that both full fine-tuning on the left So I'd like to mention here that both full fine-tuning on the left and parameter efficient fine-tuning on the right can be used in combination with any of the training methods we'll be discussing here, including supervised fine-tuning, direct preference optimization, and online reinforced learning. So it's up to your choice whether you want to go with full fine tuning or parameter efficient fine-tuning in any of the method here. So such parameter efficient fine-tuning method like Lora, you have saving a lot of memory, but on the other hand, it also learns less you have saving a lot of memory, but on the other hand, it also learns less while forgets less because there are just less parameters to tune in this case. In this lesson, you have learned about details on supervised fine-tuning and the differences of full fine tuning versus parameter efficient fine-tuning. and the differences of full fine tuning versus parameter efficient fine-tuning. In the next lesson, we'll do some coding practices about supervised fine-tuning that turns a base model into an instruct model. See you there.