In this course, you've learned about several popular post-training methods and where they're most commonly used. Let's take another look at all of this. For supervised fine-tuning or SFT, the principle behind this is to imitate the example responses by maximizing the probability of the response. So it comes with a simple implementation So it comes with a simple implementation and it's great for jumpstarting new model behavior. However, it might degrade other performances for tasks that are not included in the training data. For online reinforcement learning, the principle behind this is to maximize the reward function for the response. So it's actually better at improving model capabilities without degrading performance in unseen task. However, it comes the most complex implementation and would require a good design of the reward functions to work really well. For direct preference optimization, if you encourage good answer while discouraging bad answer, provide it here. So it trains a model in contrastive fashion, and is really good at fixing wrong behaviors and improving targeted capabilities. However, it might be prone to overfitting, and the implementation complexity is standing in between SFT and online RL. Lastly, I want to discuss one point on why online Reinforcement Learning might degrade performance less compared with SFT. So, usually when you send a prompt to a language model and let it generates its own answer R1, R2 and R3, our reinforcement learning gets reward from each of the responses from its own generation, gets reward from each of the responses from its own generation, and then feedback to the language model and updates the language model weights based on the signal. Essentially, online reinforcement learning tries to tweak the model behavior within the model's own native manifold. On the other hand, for supervised fine-tuning, you send the prompt to the language model where the language model might you send the prompt to the language model where the language model might still generalize different responses. still generalize different responses. However, the provided example response to imitate from can be extremely different from all the responses the model wants to generate. In this case, SFT might track the model into an alien one and risking unnecessary changes of the model weights. This concludes the whole lesson and whole course for post training of language models. I really look forward to what you build next in the future.