Quick Guide & Tips

💻   Accessing Utils File and Helper Functions

In each notebook on the top menu:

1:   Click on "File"

2:   Then, click on "Open"

You will be able to see all the notebook files for the lesson, including any helper functions used in the notebook on the left sidebar. See the following image for the steps above.


💻   Downloading Notebooks

In each notebook on the top menu:

1:   Click on "File"

2:   Then, click on "Download as"

3:   Then, click on "Notebook (.ipynb)"


💻   Uploading Your Files

After following the steps shown in the previous section ("File" => "Open"), then click on "Upload" button to upload your files.


📗   See Your Progress

Once you enroll in this course—or any other short course on the DeepLearning.AI platform—and open it, you can click on 'My Learning' at the top right corner of the desktop view. There, you will be able to see all the short courses you have enrolled in and your progress in each one.

Additionally, your progress in each short course is displayed at the bottom-left corner of the learning page for each course (desktop view).


📱   Features to Use

🎞   Adjust Video Speed: Click on the gear icon (⚙) on the video and then from the Speed option, choose your desired video speed.

🗣   Captions (English and Spanish): Click on the gear icon (⚙) on the video and then from the Captions option, choose to see the captions either in English or Spanish.

🔅   Video Quality: If you do not have access to high-speed internet, click on the gear icon (⚙) on the video and then from Quality, choose the quality that works the best for your Internet speed.

🖥   Picture in Picture (PiP): This feature allows you to continue watching the video when you switch to another browser tab or window. Click on the small rectangle shape on the video to go to PiP mode.

√   Hide and Unhide Lesson Navigation Menu: If you do not have a large screen, you may click on the small hamburger icon beside the title of the course to hide the left-side navigation menu. You can then unhide it by clicking on the same icon again.


🧑   Efficient Learning Tips

The following tips can help you have an efficient learning experience with this short course and other courses.

🧑   Create a Dedicated Study Space: Establish a quiet, organized workspace free from distractions. A dedicated learning environment can significantly improve concentration and overall learning efficiency.

📅   Develop a Consistent Learning Schedule: Consistency is key to learning. Set out specific times in your day for study and make it a routine. Consistent study times help build a habit and improve information retention.

Tip: Set a recurring event and reminder in your calendar, with clear action items, to get regular notifications about your study plans and goals.

☕   Take Regular Breaks: Include short breaks in your study sessions. The Pomodoro Technique, which involves studying for 25 minutes followed by a 5-minute break, can be particularly effective.

💬   Engage with the Community: Participate in forums, discussions, and group activities. Engaging with peers can provide additional insights, create a sense of community, and make learning more enjoyable.

✍   Practice Active Learning: Don't just read or run notebooks or watch the material. Engage actively by taking notes, summarizing what you learn, teaching the concept to someone else, or applying the knowledge in your practical projects.


📚   Enroll in Other Short Courses

Keep learning by enrolling in other short courses. We add new short courses regularly. Visit DeepLearning.AI Short Courses page to see our latest courses and begin learning new topics. 👇

👉👉 🔗 DeepLearning.AI – All Short Courses [+]


🙂   Let Us Know What You Think

Your feedback helps us know what you liked and didn't like about the course. We read all your feedback and use them to improve this course and future courses. Please submit your feedback by clicking on "Course Feedback" option at the bottom of the lessons list menu (desktop view).

Also, you are more than welcome to join our community 👉👉 🔗 DeepLearning.AI Forum


Sign in

Or, sign in with your email
Email
Password
Forgot password?
Don't have an account? Create account
By signing up, you agree to our Terms Of Use and Privacy Policy

Create Your Account

Or, sign up with your email
Email Address

Already have an account? Sign in here!

By signing up, you agree to our Terms Of Use and Privacy Policy

Choose Your Plan

MonthlyYearly

Change Your Plan

Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.

Learn More

Welcome back!

Hi ,

We'd like to know you better so we can create more relevant courses. What do you do for work?

Course Syllabus

DeepLearning.AI
  • Explore Courses
  • Membership
  • Community
    • Forum
    • Events
    • Ambassadors
    • Ambassador Spotlight
  • My Learning
Welcome to this course on supervised fine-tuning and reinforcement learning for training large language models. Both of these are techniques under the broader umbrella of post-training techniques, which is an important family of algorithms that are actually really useful both for training frontier models as well as for developers to get their applications to work better. Our third instructor for this course is Sharon Zhou, who's an old friend and also my former PhD student from Stanford. She is VP of AI at AMD, and she was also formerly co-founder and CEO of the startup Laminize. Great to have you here, Sharon. So excited to be back. Thank you, Andrew. So Sharon's worked for many years on Gen AI, including specifically fine-tuning, reinforcement learning, post-training, and I think this has become one of the techniques that is increasingly important for developers to know about to get your own base applications to work really well. When owns came about, a lot of people had to learn to prompt engineer effectively, and now more and more people kind of know how to do that. But to go beyond just prompt engineering, I think there are a lot of businesses, a lot of applications that would be well-served today by knowing how to fine-tune a model and use these more advanced post-training techniques to get an application to work. That's right. And I think these post-training techniques are really exciting because they are a way to steer the models and to align them to different preferences. So those can be different human-based preferences that the Frontier Labs are doing, but they can also be business preferences that you might have for your models. One of the things I've seen a lot of teams do is start with prompt engineering because that is often the first thing to try. But sometimes you prompt engineer, prompt engineer, performance reaches a certain plateau. And especially for agentic workloads, I see a lot of applications where after two weeks or a month of prompt engineering, it's just not yet accurate enough, like getting 92, 95, then 95.5, 95.7% accuracy. And to get that next threshold of performance breakthrough, you just got to fine-tune the model. Yeah. And I think we've seen this also with reasoning, right? So reasoning has really taken off in the Frontier models, basically capability in these models where they can think more step-by-step and actually arrive at a more accurate answer as a result of thinking more. But reasoning actually arose in pre-training from these models and reasoning was just inherent in these models. It wasn't very fleshed out. It wasn't a deep type of thinking that the models were doing. So these post-training techniques actually enabled the models to do much deeper thinking, to arrive at answers much more accurately and be able to solve much harder problems, both for math and coding. But I think this also can apply to really interesting domains, other domains, where we can verify what the model's output is, whether it is correct. For example, in material science, if we're actually producing something, a valid molecular structure or something that I've been exploring is around deeper code generation and whether these models can actually generate code that's really, really high performance across different devices. So one of the exciting breakthroughs was when DeepSeq came up with the GRPO algorithm, which allowed a more efficient way of doing reinforcement learning in which an algorithm can try multiple rollouts, try multiple attempts, and then within that group of attempts, in the case of coding, maybe figure out which one actually worked and use that to automatically score or give a reward signal to let the algorithm fine-tune to generate more of the more correct code. Fine-tuning has been around for a long time and has matured. I think what's been very exciting more recently is the development of a lot of very efficient techniques in fine-tuning, and that makes it very easy for developers across many domains to create essentially these lower adapters, these smaller number of weights that they have to change in the model to adapt to different tasks. And they can do that with far less compute, far less data, and these models are actually able to then switch between these tasks very efficiently without needing huge gobbles amount of compute that these frontier labs have. So one of the things that may surprise you if you haven't fine-tuned a lot of models yet is a lot of the challenges of fine-tuning is the same kind of data engineering, data-centric AI engineering practices that you may or may not have seen if you've used supervised learning. A lot of the time it is, got to get the dataset, train it, see where it doesn't work, we call that error analysis, and then go and fix the data, and knowing how to drive a discipline loop where you train the model, evaluate it, see where you can fix the data, and do that efficiently over and over, which Sharon talks about in the course, that's how you actually get these models to work. I think this is one of the most important topics in AI and in improving AI, not only for fine-tuning, but even for prompt engineering that you've explored. But in fine-tuning, essentially, error analysis and evaluation, I think it can be seen as not just how good is this model doing today, but in fact, more like a North Star. Where should I actually be focusing my training efforts? And most of the effort should actually be on evaluation and understanding how good is this model, and where can I take it to the next set of capabilities? One of the things about this error analysis process is it sometimes doesn't feel like the most exciting thing to be doing, because you look at the data, be guided by the data, do it, then your system works better. Maybe that's less exciting in some ways than trying things at random, but if you look at what a lot of frontier labs are doing to build cutting-edge models, as well as what a lot of businesses that are not frontier labs are doing to actually build practical applications, this is what works. It's very practical. You do it, and it just kind of works. I think error analysis underlies the skills of these AI research scientists and AI researchers who are pushing the boundaries of these models, but I think everyone can actually develop these skills. It really is finding patterns in the problems of these models, finding these failure patterns, and then having a targeted approach in improving the models through the data and also through the algorithms themselves. People think a lot about what can AI automate in the future. One of the reasons why I think error analysis is one of the things that's hardest to automate, so learn the skill your job saved for a long time, is because error analysis is a human using their insight to figure out what you can do that AI cannot yet do. Almost by definition, the AI can't do that yet. I find this to be a really valuable skill, and it's actually what I spend a lot of my time doing when building practical machine learning systems. I think that's exactly right. It's by definition that gap that we're trying to fulfill, and we're always going to have that gap if we want to continually improve the model. And then beyond fine-tuning, the other exciting post-training technique that a lot of people talk about is reinforcement learning with multiple favorites, including PPO and GRPO. This is a more frontier cutting-edge technique harder to apply, but Shan's going to talk about that too. Yes. It's a bit of a wild west with RL research on LLM specifically, but it is an exciting one. So these are some of the techniques that underlie a lot of the new agentic behavior inside of the frontier models, as well as the reasoning behavior inside of these frontier models. And in this course, you'll delve into a bit of the specific mathematics underlying PPO and GRPO, and also just the intuitions behind rewards and reward functions, and how these are different from fine-tuning, and also how is it kind of similar, and how do they all fit under this umbrella of post-training? Take a reasoning model. We may give it a complex puzzle, maybe a math puzzle or a coding puzzle, and you want it to take many steps of reasoning in order to arrive at a hopefully correct conclusion. So we don't want to specify whether the one way to reason correctly to get the outcome, and it turns out that reinforcement learning is a great way that allows you to specify reward function to measure whether or not the final output is correct, and then let the algorithm try lots of different reasoning traces, do whatever it wants, and just measure whether or not it gets the correct final output. And so this is proved to be somewhat finicky, but also really effective when you get it to work way to train reasoning models, as well as more generally, other systems like computer use, where we have an LM try to use a web browser. Well, there are lots of ways to successfully carry out the task of web browser. You don't necessarily want to specify what's the one way, but let it try out some stuff in a safe environment, and then reward it when it does well. And so there's a lot of exciting, cutting-edge research being done on reinforcement learning to train these kinds of algorithms right now. That's right. One of my favorite analogies we show in the course is around cooking. So for fine-tuning, you're kind of following the steps that your grandma is using to cook her famous recipe, and you need to follow her steps step by step. And you're kind of graded, you're kind of assessed on every single step and how you adhere to it. But in reinforcement learning, what it looks like is you don't have to adhere to her steps. You just have to produce a final outcome that matches her pasta dish, for example. And you can do any wacky thing in between. So the model is allowed to do any wacky thing in between to get there. And as a result, the model can find more efficient paths to creating the same pasta dish. But it can also find weird patterns. It might think that it needs to throw all the pasta in the air and map that to creating a good pasta dish. Reinforcement learning, as a result, can enable some superhuman capabilities that makes it very attractive, but it's also very unstable today. And a lot of the algorithms you'll learn are focused on how to make it more and more stable so that we can actually run more steps for the model and the model doesn't collapse. One unique aspect of this course is you learn a bit both about what the Frontier Labs are doing as well as what an individual or a team in the business that's not a Frontier Lab could do in a very practical way to just build applications that work better. Yeah. And I think the purpose of understanding how Frontier Labs have been doing this is one, to kind of unveil and look under the hood and understand what the magic was to steer something like a chat GPT, but then also to see what pieces of those you can use to steer the model towards your business direction and your business needs that actually matter to you. Because OpenAI or all these other Frontier Labs may not necessarily know what those needs are for you, but you know what those needs are. And so now with the tools, the same tools that they have to steer the models and align the models, you can use them to steer it towards what you need. So knowing how to carry out post-training, including fine-tuning and reinforcement learning and so on is a very valuable skill today. It's certainly one that many of my teams use to build practical applications. I hope you take this course, learn these skills and go build some cool things with it. So let's go on to the next video to get started.
course detail
  • Fine-tuning & RL for LLMs: Intro to Post-training
  • Module 1
Next Lesson
Module 1: Post-Training Overview
  • Conversation between Sharon Zhou and Andrew Ng
    Video
    ・
    10 mins
  • Background
    Video
    ・
    5 mins
  • Where post-training (fine-tuning and RL) fits into LLM training
    Video
    ・
    6 mins
  • Intuitions behind fine-tuning and RL
    Video
    ・
    4 mins
  • Key components to making fine-tuning and RL work
    Video
    ・
    10 mins
  • Post-training example: Reasoning
    Video
    ・
    5 mins
  • Post-training example: Safety and security (RLAIF)
    Video
    ・
    4 mins
  • Post-training in the wild
    Video
    ・
    4 mins
  • Module 1: Quiz

    Graded・Quiz

    ・
    30 mins
  • Module 1: Graded Lab

    Graded・Code Assignment

    ・
    1 hour
  • Join the DeepLearning.AI Forum to ask questions, get support, or share amazing ideas!
    Reading
    ・
    5 mins
  • Module 1 Lecture Notes
    Reading
    ・
    1 min
  • Next
    Module 2: Core techniques in Fine-Tuning and RL
  • Quick Guide & Tips