There are two important considerations for RL data. One is the rollouts, and the other one is the preference data for your reward model. Let's take a look at both. Okay, so just to recap the data that you need for RL. First, you need a list of inputs, so a nice, varied set of inputs. And that ultimately produces your trajectory. So that's your input, your model output, and then, of course, that reward from your graders. And that includes, you know, graders that are looking across an entire RL environment. Optionally, if one of your graders is a reward model, then it probably is learning from preference learning and these tuples of input, two possible outputs from a model, and then a preference. So exactly what that looks like is you can have an input, model output, and then that reward informed by a potential reward model or a verifier. And here is that preference data. So they are input, your model output, two of them, A and B, and then the preference, which one is better, in this case, A. Great, so just to recap the pipeline, what it looks like. So you're curating a bunch of inputs. These are varied. These are high coverage across the task that you need. You get rollouts, so the model is giving outputs for that particular input. You can have multiple outputs per input. From those rollouts, you want them to get graded. If you're using a reward model, you want to collect preference data and then teach your reward model based on that preference data. You could also have a verifier, and those can apply rewards to your rollouts to get trajectories. You use those trajectories to train your LLM with RL to maximize that reward, and this is done in a loop. So one question is just how much rollout data do you need? So rollout is a certain number of inputs, and for every input, how many outputs you're actually generating from the model per input. One starting point is eight outputs per input. That is the default in HuggingFace. HuggingFace's trainer for GRPO, so it's just for every single input, you're just outputting eight possible model responses. And so I would start there, but then, of course, increasing the number of outputs per input actually enables you to explore a diversity of outputs from the model that get different types of rewards. And you'll take a look at what that tradeoff kind of is, but ultimately having more outputs per input while it can help will ultimately lead to diminishing returns. And that's because you'll probably get either similar outputs or outputs that are scored in a way that doesn't actually make a difference to your RL training, ultimately. So without a reward model, what does it look like? This is just using a very basic verifier for writing a haiku. So write a haiku about maybe your model outputs a haiku, and then the reward, the grader, just counts the syllables. And your dataset could be 10,000 rollouts here, which sounds like a lot, and that's probably enough to train a model that respects syllable count. But it could just be 1,000 inputs. So it could be 1,000 inputs, and these different inputs are just writing about different types of haikus, and each of them has 10 model outputs, so 10 haiku outputs each. That's a lot less preparation than fine-tuning, so I find that really interesting. You only need 1,000 inputs, and the verifier here is pretty easy, and there's no training. Cool. So for a reward model-guided training with a small rollout set, what does that look like? You could also still have that 1,000 inputs as diverse instructions. And then your model output, you can generate a lot of things there. Here there are four, but then you have a reward model now that is scoring every single one of those rollouts, and then you can run PPO for one or two epochs over your entire rollout set. And what this looks like is the result here is you'll be able to actually see noticeable style shifts. The model can be more polite, more helpful, and this is probably where you want to start with a small set here. If you have a good reward model, you can get away with fewer rollouts. So these things are really related to each other, and your reward models' rewards are therefore clear and reliable to the different rollouts, versus if your reward model is kind of noisy itself, then you'll need more samples to get that signal across. So you'll need more rollouts if your reward model itself is not trained super well. So as you scale out that number of rollouts, if your reward model isn't great, or if you're going for something that is heavier, doing heavier alignment tuning at the frontier is looking at hundreds of thousands of rollouts, if not a million. So this is what this would look like for 160,000 rollouts to 20,000 inputs with eight outputs each. And practically, you'll probably see some diminishing returns for any extra rollout after this, and you'll also start to hit up on compute and memory limits. Okay, so you saw a lot of different settings here. This is just a summary of when small is good enough and when large is good enough. So on the order of thousands, maybe up to 20,000 rollouts, you're looking at probing your reward model to shape your LLM performance for midsize updates. What you're looking at is probably 20,000 to 100,000 rollouts. And of course, if you need something that is more general and much larger, or your reward model is really brittle and noisy, you need a lot of redundancy in your data to actually see the signal through the noise. I would say for the first set, that's good for experiments or ablation. For the middle set, that's probably enough coverage without seeing any collapse across your data, but this is all empirical. So you want to be able to test things out, and of course, start small, as always. Now, a few considerations about rollouts. One is that not all your rollouts are going to be useful. So you're generating a ton of rollouts, but not all of it will be informative, right? So you might have a lot of redundancy, or you have a lot of similarities, and those aren't really informative and ultimately doesn't really help with training. So there's a lot of research on filtering rollouts to only those that are informative for training, like you see here. And that's similar to your synthetic data pipelines that you learned about, so you can generate and filter here as well. And that way, you're training on only the high-signal examples, and this ultimately reduces cost and can make learning faster and more effective. It's also pretty important to show examples to the model that will improve it. And so showing examples that are too easy or too hard will result in rewards that aren't really informative and that don't really distinguish among the different examples, among the different rollouts. And you learned that previously as well. That was kind of the whole point of the advantage calculation. So taking that a step further, you can focus rollouts on moderately difficult ones, where you get a good spread of rewards by essentially oversampling them and looking at more of those. And then, of course, you can update your reward model during training, so what's moderately difficult can potentially change. So, again, you should filter out any noise in the data where possible. This filtering can be done using the reward model's uncertainty itself, and that can be a really valuable signal to use as a filter for what rewards were actually high-quality or not. Okay, so at this point, you might be thinking, wow, that is just like fine-tuning. Well, in many ways, it actually is, and a lot of researchers have drawn parallels with this type of RL for LLMs to fine-tuning. And so many of the same principles about data quality matters a lot, and it's really just that the data is formulated a little bit different this time, but ultimately, those underlying principles matter just as much. And fine-tuning is kind of closer if all examples have positive reward signals, and there's some chatter around that actually being the most effective way to teach the model in RL as well. You can use LoRAs for RL as well to make it more efficient. All you're doing is just updating fewer parameters of your ultimate model. And, again, data quality and diversity both matter as much as in fine-tuning as well. You're just getting it in a different way.