Data is one of the most important cornerstones of post-training, both for fine-tuning and reinforcement learning. Take a look at an overview of the data for both, and then also take a look at how you prevent leakage across different splits of data for both of them so that you can trust your LLM in the end and how it does. So, not to be dramatic, but post-training does live or die on your data. Your data matters a ton. So, for fine-tuning, what that looks like is that pair of input and target output, or for reasoning, you know that target output has those thinking tokens in your answer. So, just to go through some concrete examples quickly, you can have the input be, you know, Alice has three apples and buys two more. How many now? The target output is five with some thinking. So, for reasoning, then you can add these think tags and be able to put that reasoning in there with a final answer of five between those answer tags. For RL data, what this looks like is tuples of input, the model output, and the reward, right, from the environment. So, you have those graders giving that reward, and you're preparing those lists of inputs, and then the models are producing the outputs and reward. So, essentially, this is what a sample RL data point might look like. So, you might have that same input, and you might have the model actually give that output of that thinking and then that final answer, and this is called a rollout. So, the model is kind of rolling out its response there, and then there's a reward, and the reward can be given by some kind of checker here for math, or it can be another model called a reward model. To give that a score for how good that rollout went. And this final tuple of input, model output, and reward is called a trajectory. Another thing that is optional if you're using a reward model is a tuple of your input, as well as two model outputs, like output A and B, and a preference for which one is better for the reward model. So, what that looks like here is you have that same input, and then maybe you have a model output that has the thinking and the answer, and then a second model output that just says hi. So, obviously, the preference is A, right? So, that one's better, and that can be labeled by a person or another LLM, but essentially, you get this final tuple of input, output A, output B for models, and then a preference. So, that's all your data, just laying that all out there. Now, the next step of your data, one of the most important steps is to split your data so that you can actually trust that your LLM is actually doing well. And for fine tuning, the typical split looks like your training data set. So, what you're actually training the model on, this is going to be your biggest set. Next is a validation set. That is where you are doing some hyperparameter tuning, which you'll see later on exactly how to use that for that. And then there's this evaluation set or test set. And this is really held out, really held out. You're not using it for hyperparameter tuning. You're testing the model at the end so that you can see how well your model has actually done. In your code, what this looks like is you can load your data set, and in HuggingFace, at least, your training data set and test data set are separately specified. And in your training data set, a random subset will be chosen for validation by default. For your reward model, a very similar thing can happen here. So, you can also do your training data set and your test data set. You can also have that validation data set in there inside of your training data set. But essentially, instead of just input and target output pairs, it's that input, two different model outputs, and then the preference. For your RL trajectories, you will also have stuff for training versus test. And the way to kind of divvy that up here is for training, you have your reward model and your verifiers, essentially checkers like a math checker, giving your rewards. But for your test, instead of using the same reward model, what you need to be really careful about is to use a new reward model trained on different preference data if you are using a reward model. Otherwise, the model can game it and can see stuff in test where it already saw it in training. And so, that's a form of leakage and spillage, which you'll see soon. Finally, the last piece of evaluation that I highly recommend is to even beyond all of this evaluation test sets that you've held out, that you've spent a lot of time curating, is an additional evaluation set where these are strictly unseen. You're mixing in kind of like inputs that the model will almost never have seen in your original set. And you're mixing in these long tail inputs and these out of distribution inputs and just to see how the model does on them and if it's able to handle it correctly, exactly as you expected. So, finally, just to keep stock here, before we go ahead and take a look at what leakage across these data sets look like, there's your fine-tuning data, your reward model data, your RL data, rollouts and trajectories, and then your final evaluation data of inputs that you'll need for all of these steps. So, what does leakage really mean and how do you prevent that from happening? So, leakage can even happen not with just identical examples, but even similar examples, anything that is similar in distribution, it can ruin your splits. And so, you can control this by doing a lot of deduplication. People in Frontier Labs spend a lot of time deduping data sets just to make sure they can trust that the model can generalize to that test set correctly. And so, dedupe is a really important step. You can use example here like MinHash, where you can see how these are the same or similar example and continue to be similar example if you even perturb it a little bit. You also want to avoid randomly splitting. I think it's very tempting to do a random split, but that will likely result in leakage because there could be similar things across that split. And then finally, one best practice just to learn about is splitting your data based on time could be really interesting. And at first, you might think, oh, maybe that's not fair, but you want your model to be able to generalize over time, right? You want your model to be able to handle, in this case, COVID, if it was trained on 2019 data, still be able to handle certain types of scenarios in 2020 with COVID, even though it hasn't seen it. And so, if you care about your model generalizing into the future, this is a way to split it and a way to kind of avoid some of the biases that may have snuck in by accident when you were collecting your data set in the first place. Just to summarize, get really paranoid about data splitting and contamination. This is a really, really important part of you being able to trust your model and trust that the results you're getting from your model are truly good. And it's something that you can then deploy into the world. So, this is why data preparation is such a massive effort. It is a huge set of teams inside of these frontier labs that is working on just data. And there's been some really interesting heuristics that have come out that labs will use only like 1% of their data because that is the cleanest, most pristine data. And using any more of that 99% will actually degrade model performance. So, getting your data to be really good is super important. In the next modules, you'll take a look at how that held out evaluation set, or all of those evaluation test sets and RL test environments come into play and help you actually avoid over 10 times more experiments and actually get you to a better model. So, this is a very, very important subset of your data to get right. And then you'll also learn how much data you'll actually need. Like really, how much do you need to get to the next good model and how much is it worth it, right, to actually get it to that next model and next task and next frontier that you want the model to learn. Now that you learned how to prepare your data, take a look at how to take that text data and turn it into tokens, which is something that your model can actually process.