Welcome to Post-training of LLMs taught by Banghua Zhu, who is Assistant Professor at the University of Washington, as well as co-founder of NexusFlow. Banghua has trained and post-trained in many models and I'm delighted that he is the instructor for this class. Thanks Andrew. I'm excited to be here. Training a large language model has two phases. Pre-training, where a model learns to predict the next word or token for the compute and cost point of view this is the bulk of training and may require training on trillions or tens of trillions of tokens of text. For very large models, this could take months. Then in Post-training, this is where the model is further trained to perform more specific tasks, such as answering questions. This phase usually uses much smaller datasets and is also much faster and cheaper. In this course, you'll learn about three common new ways to post-train and customize LLMs, and in fact, you're going to download the pre-trained model and post-train it yourself in a relatively computationally affordable way. You learn about the techniques, Supervised Fine-Tuning or SFT and direct preference optimization also called DPO, and online reinforcement learning. Supervised fine-tuning trains a model on labeled prompt-response pairs and hopes that learn to follow instructions or use tools by replicating that input prompt into our response relationship. Supervised fine-tuning is especially effective for introducing new behaviors or making major changes to the model. In one of the lessons, you fine-tune a small Qwen model to follow instructions. Direct Preference Optimization or DPO teaches a model by showing it both good and bad answers. DPO gives the model two options for the same prompt, one preferred over the other. DPO, through a constructive loss pushes a model closer to good and away from bad responses. For example, if the model says I'm your assistant, but you want it to say I'm your AI assistant, your label, I'm your assistant as bad and I'm your AI assistant as a good response. You will use DPO on a small Qwen instruct model to change its identity. With online reinforcement learning, the third of the three techniques, you give the LLM prompts it then generates responses and then a reward function scores the quality of the answers. The model then gets updated based on these reward scores. One way to get a reward model to give reward scores is to start with human judgments of the quality of responses. Then you can train a function to assign scores to the responses in a way that's consistent with the human judgments. The most common algorithm for this is probably proximal policy optimization, or PPO. Another way to come up with rewards is via verifiable rewards, which applies to tasks of objective correctness measures like math or coding. You can use math for checkers or for coding use unit tests to measure in an objective way. If generated math solutions or code is actually correct. This measure of correctness then gives you the reward function. A powerful algorithm for using these reward functions is GRPO or Group Relative Policy Optimization, which is introduced by DeepSeek. In this course, you use GRPO to train a small Qwen model to solve math problems. Many people have helped in creating this course. I'd like to thank Oleksii Kuchaiev from Nvidia and Jiantao Jiao from UC Berkeley. From DeepLearning.AI Esmaeil Gargari also contributed to this course. The first lesson will be an overview of post-training methods. In this lesson, you learned when you should do post-training In this lesson, you learned when you should do post-training as well as what is the menu of post training options you can choose from. Let's go on to the next video to get started.