A good production system has a lot of different gates and promotion rules between development to staging to production. And here you're going to hone in on that specifically for RL, because it can get a little bit tricky. So you've seen this RL test environment before, it's frozen. And for the production check, in order for you to actually feel comfortable getting this in production, you should definitely make sure that re-running this gets you identical metrics. So this is a very trustworthy test environment. For RAG, if you're also using RAG, you will need to freeze that too. So just to look at kind of an example inside of this, so you might be debugging your search API. So you want to take your RL test environment and actually divide it up in what you'll see as slices. But just to look at one example first, here you might look at debugging your search API. So one input that might be in this test environment is getting a lot of 500s from the debug search API logs. And then the model's output is, here are the steps to debug that issue. And you see that the code compiles, the code runs, seems to work logically, but it is not pinning the version. So it's not pinning what required version is needed, and it's using legacy lookup, which is deprecated in the most recent version. So what could this look like more quantitatively and just like laying out all the different metrics? So you have your reward model saying it passed because it was relevant and fluent. Your unit tests say it's passed and the code runs and compiles inside of that frozen environment, but your version check, that fails, and it's using a deprecated function and that fails. So you're violating the different rules, and you actually are not able to successfully do this on your first run. Okay, so maybe you do some fixes, and now the model is able to output the correct number of steps with a version pin, and that is great. And this is now what it looks like. So now you're passing everywhere, and so it passes, and you put together the quantitative metrics here so that before your fix, the model maybe was favoring fluent but wrong version outputs. And now after doing some of the fixes and post-training, your model can now prioritize those explicit version pins. So what you saw there is just one example of something running inside of your environment. And now generalizing a bit so you can monitor this at scale, you can look at slices. So a slice is essentially something that aggregates across an area or a behavior that you really care about. So debugging the search API could be a slice for anything that's regarding the search API. Maybe there's something regarding updating the user profile, initializing your cache, or generating a report. You want to be able to have basically pass rates within a slice and understand which slices are doing well or not, and it's not just one example. So version pinning could also be its own slice, and it can overlap, right? Its own slice that you really, really care about, that behavior, and you want to monitor that behavior of the model before you feel comfortable actually promoting that from development to staging. So just looking at what possible slices could be from familiar examples in this course. So you could have the headache, red flags. So example input is just, you know, I have a headache, and making sure that the model is actually explicitly recommending urgent care and is being very concise. And those are things you care about. For a math slice, it could be this math problem again, and then making sure the math is correct and the answer tag is there. And then for division, it could be something like this, where it's looking at whether the division was done correctly within some numerical tolerance, and then debugging the search API, of course, right here, making sure the unit tests pass, and it cites the API pin version. So you have your slices of the behaviors that matter to you for the model, and these are really important to monitor so that you can understand what stage you can promote your RL environment to. So first, in your experimentation loop, you've spent most of the course here, which is training the model to be better and better and better. Ultimately, you get to the test evaluation, so you're actually testing it on your held out environments there. And this is where you want to have these pass and fail promotion rules, which you'll see soon. But essentially, across all your slices, what constitutes passing and what constitutes failing quantitatively, and that will automatically then gate what models can actually go into staging. So staging is where you want to compare to your production model, and you can actually try it on a small percentage of live traffic, at least canary traffic, basically just shadow monitoring how it's doing, and understand how it's doing relative to the model you already have in production. And then, of course, in production, a lot of observability there, a lot of possible user feedback to data, flywheel there, and that you can then pass back to the experimentation loop. But this is essentially what it looks like, very similar to other types of software. You just have to consider that the model is not going to be stable the entire time. So for fine-tuning, it's going to be closer to regular software where you get to deploy it, and it's a frozen thing all the time, versus in RL, things are not always frozen, and you haven't always caught all the different reward hacks, and it's just a little bit less stable when you deploy it. So here are promotion rules. So here's an example. You could have a aggregate quality gate at the very top, and then you have certain slices that really matter to you that you are not allowing any regressions on. So maybe your headache one and your debug search API. This is a very strange model, actually, if it's all of these slices. But essentially, you do not allow regressions on these. So if it ever does worse on any of these, it's a no-go. So safety has to have a certain strict cap. Formatting, that has to matter above a certain confidence interval for those format passing and the math correctness. For the division one, it has to actually improve meaningfully. So maybe this is a model capability you want to improve. So if it's not improving on this slice, then it's a no-go, because maybe that's somewhere your users actually want better division. And so if it's not better, there's no point in actually deploying that model. And then, of course, for tools, making sure those are correct within some kind of range. And finally, making sure that the model is actually efficient and meets your cost parameters. So this is what promotion rules could look like to go from development to staging. All right, so maybe all of these things pass, you check it, and you want to promote to staging. That's really exciting. What could happen in staging? So in staging, here is where you want to look at a slice of possible production traffic, right? You're not actually going to have the model interact with the user necessarily, it could be shadow deployment. So you could look at maybe, you know, 5,000 requests, at least, in the next 24 hours, and compare its results with your actual production model, or compare it to maybe it's not a production model, but you're actually using people right now to actually answer some of those requests. But you're comparing it in some way, you're monitoring how it does here. And then if it's good, of course, you want to have additional promotion rules here between staging and production that you're adding here for that canary traffic before you actually release it into production. Now that your model is humming along in production, let's take a look at how the data you can gather in production can then feed into the experimentation loop later.