In this lesson you will learn basic concepts of Post-training methods. Let's dive in. Let's first see what is post-training. Usually when people train language model, we start from the randomly initialized model and do pre-training first. So here we try to learn knowledge from everywhere, including from Wikipedia or Common Crawl, which is crawling from all the internet data or GitHub for coding data. After pre-training, we'll get a base model that is able to predict next word or token, where each token is a sub word highlighted in the figure here. So starting from this base model we will do post-training as a next step, which is trying to learn responses from curated data. This include chat data or tool using or agent data. So after this procedure, when you link as an instruct model or chat model So after this procedure, when you link as an instruct model or chat model So after this procedure, when you link as an instruct model or chat model which is able to respond to instructions or talk to the user. When there's a question, What is capital of France? The model will be able to answer the question, saying the capital of France is Paris. After this step, what can even do further or continue our post training, which tries to change the model behavior or enhance certain capabilities of the model. And after this, we arrive at a customized model which is specialized in certain domains or have specific behaviors. So in this example, it might be able to write a better SQL query for any instructions here. Let's take a look at the methods used during LLM training. To better understand Post-training method, let's actually first start from the pre-training method, let's actually first start from the pre-training method, which is usually considered as unsupervised learning. So usually one can start from a very large scale unlabeled text corpus which includes Wikipedia, Common Crawl or GitHub, etc. So one can usually extract more than 2 trillion number of tokens from this corpus and train on all of them. So usually, we train our few paragraphs or sentences. So usually, we train our few paragraphs or sentences. And as a minimal example, one might see sentences like I like cats. And in this case, we're trying to minimize the negative log probability for each token conditioned on all the previous tokens. So it will be first minimize negative log probability for I and then the negative log likelihood for like given I. And then for cats, given I like. So in this way we're training the model to predict the next token given all the previous tokens seen. After pre-training, that will be followed by different post-training methods. So one of the simplest and most popular post-training method is Supervised Fine-tuning or SFT. Is considered as a supervised learning or imitation learning, where when you create a dataset that's a labeled prompt-response pairs, learning, where when you create a dataset that's a labeled prompt-response pairs, where the prompt is usually the instructions to the model and the response is the ideal response the model should respond with. the model should respond with. In this case, when you really only need from 1000 to 1 billion tokens, which is much less than the scale of pre-training and biggest difference which is much less than the scale of pre-training and biggest difference in the training loss is that we only trained on the tokens of responses, but not the tokens of the prompt. So besides supervised fine-tuning, we also have other more advanced post-training methods. The second one is Direct Preference Optimization, or DPO. In DPO, where you create a dataset in the format of a prompt and a good and bad responses. So for any given prompt, one can generate multiple responses and select one that is considered good and select the other that's considered bad. one that is considered good and select the other that's considered bad. And we try to train the model so that it pushes away from the bad responses and learns from the good responses. responses and learns from good responses. So in this case, you really only also need from 1000 to 1 billion number of tokens. And one has a more sophisticated loss function for this direct preference optimization, which we will go over in the specific lesson later. The third method in post-training is online reinforcement learning. So for online reinforcement learning, where you only need to prepare the prompt and a reward function. So whenever we start from a prompt, we usually ask the language model itself to generate a response, and we generate a reward for that response using a reward function. And we use that signal to update the model. So in this case, when you have like 1000 So in this case, when you have like 1000 to maybe 10 million or more number of prompts, and the target here is to maximize the reward for the prompt and response where the response is actually generated by the language model itself. Usually post-training requires getting three elements correct. The first one is a good co-design of data and algorithm. As we discuss, there are different choices of post-training elements, As we discuss, there are different choices of post-training elements, including SFT, DPO, all or different online real fast learning algorithm like Reinforce/RLOO, GRPO or PPO. Each of them only require a slightly different data structure to prepare. A good co-design of data and algorithm will be really important for your success of Post-training. The second element is a reliable and efficient library that implements most of the algorithms correctly. This includes HuggingFace TRL to which is one of the first library that's simple to use and implements most of the algorithms mentioned here. Throughout this course, we will be using this TRL for most of the coding practices. Besides HuggingFace TRL, I would also recommend to you to try out more sophisticated and memory efficient libraries, including Open RLHF, veRL, and Nemo RL. So the third element here would be an appropriate evaluation suite. One needs to understand after and before post-training what is needed as an evaluation suite that we need to track the model performance and ensure that the model is always performing well. Here we have an incomplete list of popular language model evaluations that's why you use the track and in this case. So the first one, Chatbot Arena, is a human preference for chat, where people can vote for which model is better in their own taste, and as a surrogate to human preferences, they're also different LLM as a judge for chat models. This includes Aplaca Eval, MT Bench, or Arena Hard. There are also different static benchmarks for those instruct LLM where a Live Code bench is one of the popular coding benchmark. And AIME 2024, 2025 can be a recent popular mass evaluation dataset for hardcore mass questions. There are also knowledge and reasoning related data set like GPQA or MMLU Pro. There are also instruction following evaluation dataset like IFEval. For function calling and agent, there are also different dataset for evaluation which includes BFCL, NexusBench, TauBench or ToolSandbox where both TauBench and ToolSandbox focus more on multi ton tool using situation. By listing all the evaluations here, I'd like to mention here, that it's easy to improve any of the benchmarks, but it can be much harder to improve some benchmark or change certain model behavior without degrading other domains. Throughout this course, we'll be exploring which method gives the best improvement without degrading other domains. Lastly, I want to mention that it's not necessarily in every use cases you have to do post-training of your model. So there are different scenarios where there might be different methods that are more appropriate for your use case. For example, if you just want the model to follow a few instructions, like do not discuss something sensitive or do not compare your company with some other company, one can easily do prompting to make this happen. So, usually self prompting method can be simple yet brittle. In external cases, the models may not always follow all the instructions you provide here. A second use case, might be about query some real-time database or knowledge base, in which case of retrieval augmented generation or search-based measure could work better since it can adapt to a rapidly changing knowledge base here. There are also scenarios where you like to create a domestic-specific model, like medical language model or cybersecurity language model. So in those cases, usually what really matters is a continual pre-training followed by a more standard post training to make the model first learn the knowledge, then learn how to talk to the user. So in this case, for continual pre-training what usually who inject a very large scale domain knowledge that's not seen during the pre-training dataset, and ideally those domain knowledge should be at least more than 1 billion number of tokens. And lastly, if your use case is about following 20 or more instructions tightly, or you really want to improve some target capabilities like create a strong SQL model, a function calling model, or a reasoning model, this is where Post-training can be most helpful. It can help to reliably change the model behavior and improve targeted capabilities. So if poisoning is not done correctly, it might degrade other capabilities that you didn't train on. So in this lesson, you have learned about what is post-training, how to do post-training and when to do post-training. In the next lesson, we'll have a deep dive into the first method of Post-training, which is supervised fine-tuning. All right. See you there.

Please sign in to view this content

Next Lesson

Post-training of LLMs

Introduction
Video
・
3 mins

Introduction to Post-training
Video
・
9 mins

Basics of SFT
Video
・
8 mins

SFT in Practice
Video with Code Example
・
13 mins

Basics of DPO
Video
・
7 mins

DPO in Practice
Video with Code Example
・
9 mins

Basics of Online RL
Video
・
11 mins

Online RL in Practice
Video with Code Example
・
11 mins

Conclusion
Video
・
2 mins

Appendix – Tips, Help, and Downlad
Code Example
・
10 mins

Course Feedback

Community