Welcome to this course on supervised fine-tuning and reinforcement learning for training large language models. Both of these are techniques under the broader umbrella of post-training techniques, which is an important family of algorithms that are actually really useful both for training frontier models as well as for developers to get their applications to work better. Our third instructor for this course is Sharon Zhou, who's an old friend and also my former PhD student from Stanford. She is VP of AI at AMD, and she was also formerly co-founder and CEO of the startup Laminize. Great to have you here, Sharon. So excited to be back. Thank you, Andrew. So Sharon's worked for many years on Gen AI, including specifically fine-tuning, reinforcement learning, post-training, and I think this has become one of the techniques that is increasingly important for developers to know about to get your own base applications to work really well. When owns came about, a lot of people had to learn to prompt engineer effectively, and now more and more people kind of know how to do that. But to go beyond just prompt engineering, I think there are a lot of businesses, a lot of applications that would be well-served today by knowing how to fine-tune a model and use these more advanced post-training techniques to get an application to work. That's right. And I think these post-training techniques are really exciting because they are a way to steer the models and to align them to different preferences. So those can be different human-based preferences that the Frontier Labs are doing, but they can also be business preferences that you might have for your models. One of the things I've seen a lot of teams do is start with prompt engineering because that is often the first thing to try. But sometimes you prompt engineer, prompt engineer, performance reaches a certain plateau. And especially for agentic workloads, I see a lot of applications where after two weeks or a month of prompt engineering, it's just not yet accurate enough, like getting 92, 95, then 95.5, 95.7% accuracy. And to get that next threshold of performance breakthrough, you just got to fine-tune the model. Yeah. And I think we've seen this also with reasoning, right? So reasoning has really taken off in the Frontier models, basically capability in these models where they can think more step-by-step and actually arrive at a more accurate answer as a result of thinking more. But reasoning actually arose in pre-training from these models and reasoning was just inherent in these models. It wasn't very fleshed out. It wasn't a deep type of thinking that the models were doing. So these post-training techniques actually enabled the models to do much deeper thinking, to arrive at answers much more accurately and be able to solve much harder problems, both for math and coding. But I think this also can apply to really interesting domains, other domains, where we can verify what the model's output is, whether it is correct. For example, in material science, if we're actually producing something, a valid molecular structure or something that I've been exploring is around deeper code generation and whether these models can actually generate code that's really, really high performance across different devices. So one of the exciting breakthroughs was when DeepSeq came up with the GRPO algorithm, which allowed a more efficient way of doing reinforcement learning in which an algorithm can try multiple rollouts, try multiple attempts, and then within that group of attempts, in the case of coding, maybe figure out which one actually worked and use that to automatically score or give a reward signal to let the algorithm fine-tune to generate more of the more correct code. Fine-tuning has been around for a long time and has matured. I think what's been very exciting more recently is the development of a lot of very efficient techniques in fine-tuning, and that makes it very easy for developers across many domains to create essentially these lower adapters, these smaller number of weights that they have to change in the model to adapt to different tasks. And they can do that with far less compute, far less data, and these models are actually able to then switch between these tasks very efficiently without needing huge gobbles amount of compute that these frontier labs have. So one of the things that may surprise you if you haven't fine-tuned a lot of models yet is a lot of the challenges of fine-tuning is the same kind of data engineering, data-centric AI engineering practices that you may or may not have seen if you've used supervised learning. A lot of the time it is, got to get the dataset, train it, see where it doesn't work, we call that error analysis, and then go and fix the data, and knowing how to drive a discipline loop where you train the model, evaluate it, see where you can fix the data, and do that efficiently over and over, which Sharon talks about in the course, that's how you actually get these models to work. I think this is one of the most important topics in AI and in improving AI, not only for fine-tuning, but even for prompt engineering that you've explored. But in fine-tuning, essentially, error analysis and evaluation, I think it can be seen as not just how good is this model doing today, but in fact, more like a North Star. Where should I actually be focusing my training efforts? And most of the effort should actually be on evaluation and understanding how good is this model, and where can I take it to the next set of capabilities? One of the things about this error analysis process is it sometimes doesn't feel like the most exciting thing to be doing, because you look at the data, be guided by the data, do it, then your system works better. Maybe that's less exciting in some ways than trying things at random, but if you look at what a lot of frontier labs are doing to build cutting-edge models, as well as what a lot of businesses that are not frontier labs are doing to actually build practical applications, this is what works. It's very practical. You do it, and it just kind of works. I think error analysis underlies the skills of these AI research scientists and AI researchers who are pushing the boundaries of these models, but I think everyone can actually develop these skills. It really is finding patterns in the problems of these models, finding these failure patterns, and then having a targeted approach in improving the models through the data and also through the algorithms themselves. People think a lot about what can AI automate in the future. One of the reasons why I think error analysis is one of the things that's hardest to automate, so learn the skill your job saved for a long time, is because error analysis is a human using their insight to figure out what you can do that AI cannot yet do. Almost by definition, the AI can't do that yet. I find this to be a really valuable skill, and it's actually what I spend a lot of my time doing when building practical machine learning systems. I think that's exactly right. It's by definition that gap that we're trying to fulfill, and we're always going to have that gap if we want to continually improve the model. And then beyond fine-tuning, the other exciting post-training technique that a lot of people talk about is reinforcement learning with multiple favorites, including PPO and GRPO. This is a more frontier cutting-edge technique harder to apply, but Shan's going to talk about that too. Yes. It's a bit of a wild west with RL research on LLM specifically, but it is an exciting one. So these are some of the techniques that underlie a lot of the new agentic behavior inside of the frontier models, as well as the reasoning behavior inside of these frontier models. And in this course, you'll delve into a bit of the specific mathematics underlying PPO and GRPO, and also just the intuitions behind rewards and reward functions, and how these are different from fine-tuning, and also how is it kind of similar, and how do they all fit under this umbrella of post-training? Take a reasoning model. We may give it a complex puzzle, maybe a math puzzle or a coding puzzle, and you want it to take many steps of reasoning in order to arrive at a hopefully correct conclusion. So we don't want to specify whether the one way to reason correctly to get the outcome, and it turns out that reinforcement learning is a great way that allows you to specify reward function to measure whether or not the final output is correct, and then let the algorithm try lots of different reasoning traces, do whatever it wants, and just measure whether or not it gets the correct final output. And so this is proved to be somewhat finicky, but also really effective when you get it to work way to train reasoning models, as well as more generally, other systems like computer use, where we have an LM try to use a web browser. Well, there are lots of ways to successfully carry out the task of web browser. You don't necessarily want to specify what's the one way, but let it try out some stuff in a safe environment, and then reward it when it does well. And so there's a lot of exciting, cutting-edge research being done on reinforcement learning to train these kinds of algorithms right now. That's right. One of my favorite analogies we show in the course is around cooking. So for fine-tuning, you're kind of following the steps that your grandma is using to cook her famous recipe, and you need to follow her steps step by step. And you're kind of graded, you're kind of assessed on every single step and how you adhere to it. But in reinforcement learning, what it looks like is you don't have to adhere to her steps. You just have to produce a final outcome that matches her pasta dish, for example. And you can do any wacky thing in between. So the model is allowed to do any wacky thing in between to get there. And as a result, the model can find more efficient paths to creating the same pasta dish. But it can also find weird patterns. It might think that it needs to throw all the pasta in the air and map that to creating a good pasta dish. Reinforcement learning, as a result, can enable some superhuman capabilities that makes it very attractive, but it's also very unstable today. And a lot of the algorithms you'll learn are focused on how to make it more and more stable so that we can actually run more steps for the model and the model doesn't collapse. One unique aspect of this course is you learn a bit both about what the Frontier Labs are doing as well as what an individual or a team in the business that's not a Frontier Lab could do in a very practical way to just build applications that work better. Yeah. And I think the purpose of understanding how Frontier Labs have been doing this is one, to kind of unveil and look under the hood and understand what the magic was to steer something like a chat GPT, but then also to see what pieces of those you can use to steer the model towards your business direction and your business needs that actually matter to you. Because OpenAI or all these other Frontier Labs may not necessarily know what those needs are for you, but you know what those needs are. And so now with the tools, the same tools that they have to steer the models and align the models, you can use them to steer it towards what you need. So knowing how to carry out post-training, including fine-tuning and reinforcement learning and so on is a very valuable skill today. It's certainly one that many of my teams use to build practical applications. I hope you take this course, learn these skills and go build some cool things with it. So let's go on to the next video to get started.