Time to take a look at evaluation test sets, as well as metrics, like Pass at K, Calibration, and Uncertainty. So, in pre-training, loss and metrics like perplexity are super important to making sure the model is doing well. While this is important to the stability of training and post-training, in fact, evals are probably more important in post-training. And these are massive data sets, as well as massive prep work. But they help guide us to understand how well the model actually does in changing its behavior. So, the reason why loss isn't sufficient in post-training is that while it's great for predicting the next token, like in pre-training, it's fairly meaningless to user experience. And that is what really matters in usable intelligence for post-training. Loss can go down, but that's basically mainly an indicator that stable training is occurring. So, what exactly are evals? Well, evals encompass a bunch of things. One of them is evaluation sets, or test sets. And these are held-out data sets that the language model hasn't seen before. And it represents the desired behavior that you want the model to actually have. So, in fine-tuning, it'll look like that fine-tuning data. So, essentially, there's an input, and then there's a desired output. And then you can compare it to what the model's output is. In preference learning, very similarly, there's going to be an input, there's going to be two model outputs, and then a preference label. And you can compare that to what your reward model actually outputs and see how close that is. In this case, it's a match. Beyond eval test sets, you can also evaluate the models in different ways using different metrics. So, for example, you could get relative scores between two different models. And you could use ELO ratings, which is common in chess, by looking at the win, lose, and draws across a lot of different examples. If you're dealing with rankings and preferences, there's actually a lot of rank correlation metrics that can tell you how close a ranking is to another ranking. Basically, agreement on rankings. When correctness is verifiable, you can do similar things to your graders in RL and actually use eval graders that check for correctness or formatting. Here's another example with JSON output. Another way to evaluate is to output, say, three different outputs from the model and see if any of them are actually correct. This is called pass at three. So that means out of all three examples, one of them, at least one of them, was correct. It's more lenient and suggests if you sample a lot more from the model, it can do well on your task. And you can generalize this to pass at K. And you'll see this a lot in papers reporting results. Beyond just correctness, you want your model to behave well under uncertainty. So calibration is one of those measures. It's about making sure what the model predicts as a probability actually lines up with that probability in reality. The famous example of calibration is when a model predicts 80 percent rain. Does it actually rain 80 percent of the time? Like if model confidence is trustworthy, especially if you use those token probabilities for downstream applications or even ensembling models, for example, like 70 percent might mean different things for different models. I'm sure you know people who are overconfident versus underconfident. You want to be able to measure how well calibrated your model is. There are different ways to evaluate calibration. One, you could just look at the token probability matching the empirical token occurrence. So how often the token actually occurs. You could also do something interesting with multiple choice. So with multiple choice, you could have, you know, say four different tokens, A, B, C, D, and extract the token probability for a multiple choice token. Finally, you can also aggregate across the entire sequence and normalize over token probabilities there to get an understanding of calibration at the sequence level. And finally, metrics like expected calibration error here will help you quantify how well calibrated your model is. So fun fact about calibration, weather people are actually often not calibrated and that's so they don't disappoint you when it actually rains. So more on uncertainty beyond just calibration. If you get the model to say, I don't know, it's probably better than it hallucinating at different times. This is called a refusal. So a way to assess this in the model is to have a confidence threshold like point seven. And if the probability is under that, then the model just says, I don't know, instead of outputting what it thought. And you can find that on, let's say, a hundred math questions. If you don't allow it to say, I don't know. So no refusals, just regular way of testing it. It can answer, let's say, 65 percent correct. But if you allow it to say, I don't know, and refuse some of those answers, and maybe it refused 20 of them, it could actually get a higher score. And so this is really interesting. This means that your model can actually do well and you can actually use the confidence threshold probabilities to help you filter when it should actually be outputting something or not. Finally, you can't really miss efficiency metrics. So there's a bunch of different ways to evaluate how fast you can actually get an output from a model. So that's latency. And a couple of really important latency measures are time to first token. So that's the time it takes to get to that first token that it outputs. And time per output token, or TPOT. And that is kind of a steady state decoding speed of how fast it's able to output an entire sequence for you. There's other different measures in throughputs. You can get tokens per second per GPU. You can aggregate throughput across concurrent requests in terms of scalability. Cost per token is also really important. Cost per sample as well. Input and output tokens are going to be weighted differently often in different APIs and just the cost it takes to do them. Token efficiency is also important. This is kind of like, what is the effective token that you're getting from the model? How verbose is it versus concise it is to get to the answer that you need? So, wow, that's a lot of metrics. It turns out evaluating on all of them is best practice. So you get a holistic sense of how well the model actually is doing. HELM is one example from Stanford. And it stands for Holistic Evaluation of Language Models to evaluate across many different dimensions. And what HELM does is it evaluates models across different diverse tasks like question answering, summarization, classification, like spam detection. And these aren't isolated tests. They represent 42 different scenarios spanning different domains like medical, legal, news, finance. And what makes HELM nice is that it's a living benchmark. They're continually updating it with new scenarios. So while HELM is great and there are a bunch of those types of composite metrics, traditional benchmarks up until recently kind of test the model with simple tasks like those exam questions, coding puzzles, lab scenarios, or shorter prompts. And they're relatively toy benchmarks. And this is useful for progress tracking, especially in the early days of language models. But we're starting to measure real world work now. And GDPval from OpenAI actually moves evaluation to this professional level of tasks. So they include real deliverables, real reference files, and real expert professionals who've contributed to the eval. And specifically, GDPVal has a ton of different tasks. So 1,300 tasks, more than that. And it's from 44 different types of jobs across nine different industries. And the contributors are professionals with over 14 years of experience in their industry, which is huge. And the deliverables are real. So they're documents, slides, blueprints, care plans that actually matter in their jobs. And the environment is real too. The complexity of real files, context in a work environment. And then the grading, they try to make it fair. So they try to do blind expert grading when they release this, comparing the AI and human work. And those blind expert graders are, again, professionals who hadn't previously seen the outputs before. So what's really interesting is that OpenAI ran GPT-4.0 and GPT-5 on GDPVal. And the performance more than tripled in a year. That's a sign that these models are better at real world complexity and are just getting better and better. Expert graders compared the generated deliverables to human deliverables for this. And it's really just approaching the quality of work produced by industry experts. They also found that Claude Opus 4.1 produced outputs rated as good as or better than humans in nearly half the tasks. So as GDPval saturates, we'll need to devise new benchmarks for harder real world tasks. And I'm excited to see what those are because I think we'll be constantly producing new benchmarks to meet the current next frontier of these models. Now that you've seen some of the metrics in eval benchmarks for fine tuning and preference learning, it's time to take a look at RL test environments and how they work.