Data is what makes fine-tuning work, and grading is what makes reinforcement learning work. For data, you'll learn about how the inputs and outputs can really shape that model behavior very differently, and that's for fine-tuning. For reinforcement learning, you'll learn how the grading can really nudge the model in different directions, and how that all comes together in an RL training environment. Fine-tuning means really good data, right, so that might be hard to scale, but the data is all what it takes to make fine-tuning work, and that is the most important part for fine-tuning. For reinforcement learning, it's about good graders. You don't have that target output, so it's all about how you grade that input data and how well those graders are created. So, going into fine-tuning, making the data work, what does that look like exactly? So, you could have an input like, what's the capital of France? The target output's Paris. Who wrote Romeo and Juliet? The target output is Shakespeare, so your data might look like this, but when you, you know, actually ask a model, you know, what's the capital of France? It answers Paris correctly, and you ask it, what about Spain? It starts saying Spain is a European country. Oh no, it actually was not able to handle, in this case, chat history. So, what do you do? Instead, what your data could look like is actually passing in that chat history of what is the capital of France, Paris, and then what about Spain, and then the target output is here. So, this actually gives the model examples where the input includes that chat history. That's how you teach a model to actually be able to handle past chat histories. And again, you can keep going, Germany, and learn that Berlin target. Of course, for an actual prompt inside of the model and your actual data, you probably want to put in these tags, these prompt tags of user, then assistant, user, then assistant, which you've probably seen before, to actually denote who is saying what in a chat. And then, so now that you've taught the model with that data, your fine-tuned model will then be able to handle something like what's the capital of US, Washington DC, what about China, right? So, being able to take in that history and be able to say Beijing. In addition to getting chat history to work, there's also different methods of having the model learn not only just the answer to this word problem, but also its rationale, or what it's thinking, quote-unquote, or what it's reasoning. And so, in this case, actually going through these different steps to teach it these steps. This is similar to when you were teaching the model to cook pasta, to actually be able to go through every single step that grandma was doing to cook the pasta. And typically, what it looks like in an actual prompt is using these think tags and answer tags, which you can then use later and extract later from the target output so that you can check whether the answer is correct. Fine-tuning data can also be very powerful for teaching the model how to handle RAG, Retrieval Augmented Generation, misses, or when you have a bad RAG document attached to it. So, of course, in the easy case where it's correct, that's fine. It's able to look at the input and be able to get the target output. But in bad cases, where in this case, the document actually says that Sydney is the capital of Australia, which is wrong, the model can actually recover from that. And you can teach it to recover by giving that target output that there's an error in the document and the capital is actually Canberra. In your data, you're also able to get the model to create guardrails. So, here, the user might say, help me write a computer virus. That's really bad. That's harmful. In your target output, you want the model to refuse that, right? And so, when you teach the model with examples like this, then the model will start to refuse things like this. Here's another example of a guardrail, which might be a bit different than just the safety guardrail. This is if you want to train a model to be an AI banker, right? And so, then you have your model here wearing a banking suit. And the user might ask, what's the capital of Australia? And instead, you want to apply a guardrail here and say, sorry, I'm actually only able to answer questions about AI bank. So, this is adding a custom guardrail for your custom fine-tuned model. Why this is helpful is that if you don't have these guardrails, people can start to use your model for anything. Here is a funny example from Amazon when they were launching their first bot, where instead of looking for specific info about a product, this person is asking the underlying bot to write a React component that renders a to-do list. So, using the model for different things instead of the model here learning to refuse, right? To refuse writing a React component. But that is easily mitigated by having the model learn from those examples of guardrails through fine-tuning. Reinforcement learning. So, for RL, it's all about the graders and making the grading work and less so you don't have that target output anymore to show what's correct. But instead, you can grade what's correct. So, what it looks like here is here's a math problem where Carly has eight apples, buys two more, but then sells five to the local baker. How many now? So, you can use a math grader. The model can output whatever it wants, but ultimately output some answer. And the math grader can tell the model, okay, this was right or wrong. And so, here it could say incorrect if the model's output was incorrect, but it could also give some partial credit. So, it shows its work. That's something that the grader wants to encourage. And so, their total reward or score is then higher. And of course, if it gets it correct, then you can have a total reward or score be higher. Of course, similar to the fine-tuning example with the data in answer tags, this makes it easier to extract that answer. And also, you can have the grader include a formatting grader to encourage the model to actually produce those tags correctly. A lot of these graders here are deterministic. So, whether something is correct or not, whether it shows work, that could be some kind of reg ex. Whether it has stuff in an answer tag, whether there's answer tags exist, that's something that you can easily extract deterministically with functions. That's not always the case, because what if you had this math problem, but then the question is actually, how does Carly feel? Now, your math checker, its mind is blown. It doesn't know how to grade this, right? So, instead, what you can have is another language model or another type of model actually output a reward. You can learn how to grade an output to how does Carly feel. And so, maybe the LLM can give a score based on different parameters like politeness or enthusiasm or engagement. And so, here for this input, greet politely, the model could say, hi there, how are you? And it would give it a high score, high politeness, high enthusiasm, high engagement. However, if your input, you adjust it slightly, and the model now outputs hello, hello, hello, hello, hello, many hellos, the grader might still say it's high politeness, high enthusiasm, high engagement, but you didn't really want the model to output that. It's a bit silly. So, this is a common thing that happens in reinforcement learning called reward hacking. So, that's why the grader is so important to get right, right? If you don't get it fully right, the model will find a way to hack it and find a way around it to get that high score, that high reward, without actually doing what you want it to do. The distribution of inputs that you put into the model also matters significantly. So, in reinforcement learning, you want to give that large distribution of inputs similar to in fine tuning so that the model can actually react to many different types of possible user inputs. This needs to, again, be very representative of what kind of user inputs you expect the model to see. And now, what's a little bit different in reinforcement learning is that in addition to having that grader, typically, you'll create this RL training environment. And that's these inputs, these graders, but also some of these other things that might come into play. For example, in the environment, you might also expect the model to be able to use a calculator tool. And so, you want to make a calculator tool available to the model so the model can then use that tool to, for example, answer this math problem. You might also give tools like a search API, right? Or you might give files to look through your code base. And the model can look through those and use those in this contained environment. And then, of course, the grader can score it based on the model's ability in this environment. How representative this environment is to your real-world use case is going to be as much information and as helpful as you need the model to actually learn from this environment. So, the more representative this environment is to where you expect the model to operate, the better. Now, quick, some considerations is while the more realistic, the better, be careful because some of the tools that you might give the model, right, will be hitting external APIs. And you might be DDoSing some of those external APIs, and it might actually turn out to be unrealistic if you're running this RL environment quite intensively. And you'll see this a bit later as well. So, ultimately, the data for RL is going to look different than for fine-tuning. It's going to look like that input, that model output, and then a reward. And the training environment, of course, is very different from fine-tuning as well. And here are a couple of different training environments. So, one might be around debugging a code base. One might be about that, like, polite greeting. And you want to actually give a right mix of these so the model learns all the different tasks that you want it to learn with reinforcement learning. So, this is multiple training environments producing multiple types of data out to train your model. And you want to kind of balance it correctly. And you'll learn more about how to exactly balance this in the future lessons. So, putting it all together, again, you want to use both fine-tuning and reinforcement learning for a lot of your good models. So, first, it's about getting that fine-tuning data. Then you can fine-tune your model on that input and target output pair. Get your fine-tuned LLM. Then, typically, what happens is then you create those RL training environments with your different distribution of inputs that are representative, but also your graders and other information, like those files or your code base or different tools like a search API. And then you run an RL loop where you get some RL data. So, you're given an input, you get different model outputs, and those are rewarded. So, those are given rewards in your RL training environments. And then you train your fine-tuned LLM with reinforcement learning. And then you keep running that in that loop. So, fine-tuning really goes through that one giant stage of data collection, then the actual fine-tuning and training step. And then RL actually goes through multiple iterations of collecting data and in training the model. Now that you've learned the key components to fine-tuning, which is data, and to reinforcement learning, which is the grading, take a look at how to combine them for a post-training reasoning example.

Fine-tuning & RL for LLMs: Intro to Post-training

Intermediate

6 hours 10 mins

Topics

Fine-Tuning

Collaborator

AMD

Module 1: Post-Training Overview

Conversation between Sharon Zhou and Andrew Ng
Video
・
10 mins

Background
Video
・
5 mins

Where post-training (fine-tuning and RL) fits into LLM training
Video
・
6 mins

Intuitions behind fine-tuning and RL
Video
・
4 mins

Key components to making fine-tuning and RL work
Video
・
10 mins

Post-training example: Reasoning
Video
・
5 mins

Post-training example: Safety and security (RLAIF)
Video
・
4 mins

Post-training in the wild
Video
・
4 mins

Graded・Quiz

Graded・Code Assignment

・

1 hour

Join the DeepLearning.AI Forum to ask questions, get support, or share amazing ideas!
Reading
・
5 mins

Module 1 Lecture Notes
Reading
・
1 min

Module 2: Core techniques in Fine-Tuning and RL