In this lesson, you'll write a reward function for more subjective task, creating a summary of a call transcript. You'll see how you can use an LLM as a proxy for human judgment, and create reward functions that produce learning signals in situations where the outcomes are not easily verifiable and code. Let's take a look. Let's start by importing our standard dependencies. For this lesson, we're going to be using a different use case than Wordle. In this case we're going to be summarizing earnings call transcripts. So let's go ahead and load this data set from HuggingFace. And take a look at one of the example transcripts. You can see that these transcripts tend to be quite long. And in this case we've even truncated them to a limited number of characters. And let's assume that for the purpose of this task, our goal is to create a summary that would be useful for someone like a financial analyst who just wants the high-level picture of based on the earnings call, What were the key takeaways about the health of the company? Let's go ahead and construct a prompt that we want to use to generate these summaries. In this case the prompt is pretty simple. Generate a concise summary of the information. The following earnings call transcript only respond to the summary. Do not include any extraneous text. And we're going to give it the transcript as a variable. Let's define a function that takes a transcript as input and some number of different samples. We want to generate and generates the summary. To do this we're going to take a summarize prompt and insert that transcript. We're then going to convert this into the chat API format. And using an OpenAI API compatible SDK, we're going to generate a completion given these messages. And we're going to set temperature point 9 to ensure some randomness. And let's go ahead and generate the summary for the transcript that we pulled out from the data set from above. As you can see, the model does generate a summary and it is a lot shorter than the original transcript. But it also still has a lot of unnecessary language in it. Like here is a concise summary of the earnings call transcript. And in general, there's some things here that may not be necessary for our financial analyst. So the next step in this process is going to be thinking about how we can construct a reward function to help steer the summary that's generated more in the direction of what our analysts would be looking for their work. One way we can think about creating a reward function is to use an LLM as a proxy for our analyst's judgment that attempts to rate the summary on a scale from 1 to 10. And then we can use that final score as a reward function score. So here this prompt says write this on a scale from 1 to 10, where 1 is very poor and 10 is very good. And then finally output the final score between some score tags and it takes as input the transcript and the summary. Putting this all behind a reward function that takes as input the transcript, the summary, and a judge model. In this case we can use GPT-4o-mini, but this could be any model and returns a float value at the end. I know this looks pretty long, but it's actually quite straightforward. What we're going to do is take our prompt that we defined above and start the transcript and the summary. Turn that into messages in the chat format as we did before. Then we're going to have our judge model here create a response. Just one response. We're going to set temperature equals zero. So it will give us what it believes to be its best response rather than something a little bit more random. And then we're going to extract out the final score using this regular expression. Convert that into an integer. And then we're going to divide by ten. So we get a nice normalized value between 0 and 1. And if anything goes wrong along the way we're just going to return a score of zero. Let's go ahead and apply our judge reward function to our summary and our transcript. You can see that our judge model provides some reasoning here, which we can use to audit its judgment and get a sense for whether or not this is reasonable, and then provides this final score at the end, which is 0.9. So now let's try scaling this up to eight different samples instead of just one, and get a sense for what the diversity of reward scores from our judge model looks like. We've gone ahead and generated eight different summaries from our original transcript. And then we're going to use our judge model to score each of these summaries according to the judge reward function that we wrote above. This may take a second to run in your notebook, as the judge model needs to generate a lot of tokens as part of its reasoning process, but we sped it up here in the video. As you can see, the scores are generally quite high .8,.7, etc. But importantly, you'll notice that it never really notices that there's anything particularly wrong with any of the summaries, and it never goes so far as to say that any of the summaries are perfect. And this is in general a problem with using LLM as a judge in this very straightforward way, it tends to say that things are generally good because it doesn't want to be called out for being explicitly wrong. And this is ultimately a problem for us, is that we want the model to be very opinionated about whether a particular response is good or bad, so that we can more clearly direct the learning process in a way that encourages us to do what we're wanting it to do. So how do we address this problem? One way we can think about is to try to grounded in something that's a little bit more objective. So instead of just telling the model, what do you think about this summary? Is it good? Is it bad? We can instead think about trying to generate a multiple-choice quiz based on the information and the transcript that we think is most relevant to a financial analyst should be looking at these summaries. And so we give some examples like what was the Q1 earnings per share. And then some sources A, B, C or D along with the answer key at the end. And our goal here is that because all the information that we care about is in the original transcript, it should be a relatively straightforward and objective task for the LLM to construct this quiz. And then during the learning process, we can refer back to the quiz as a way of scoring the summary to see if all the information that we put in the quiz was retained by the summary. This was a technique that one of our customers actually came up with for their summarization problem. Now we could go about generating this quiz by having the model generate some text, and then coming up with a way to parse out that text. But one nice property of our LLMs is that they very commonly support something called structured generation as well, which can make use of a pydantic schema that defines the output structure that we're looking for. So in this case, we don't just want to generate text, we want to generate a quiz that consists of these questions. And every question is going to have the question text, the question options, and then the answer which is going to be an index into which of these options was correct. We're also going to define a couple of helper functions, like a function that helps us shuffle the different options, which we'll come back to why that's important, and a way of rendering these options and these questions as a string. Now that we've defined the question class, we can also define the quiz class which wraps the questions. So this is just going to be a list of questions. And again we're going to have a helper function here that we can use to shuffle all of the options for every question. And a helper that helps us print the quiz itself as a string so that it can be inserted directly into the prompt. Putting this all together, we're going to define this helper function called Create Quiz. It's going to take a string as input which is the transcript. We're going to use our quiz prompt to tell the model to generate a quiz from this transcript. And then we're going to use this Completions parse API, passing in the response format of the quiz and using a temperature of 0.7. As we might want to play around with different variations of the quiz. And then once we run this function, we're going to get an output, which is one of these quiz objects. And then we're going to shuffle all the options for every question in the quiz. The reason why we're going to shuffle the options is because our LLMs tend to be pre predictable in terms of where they put the right answer. So oftentimes the right answer will end up being B, because maybe that's a very common thing that humans guess when they don't know the answer. So in order to account for this implicit bias in the model, we're going to shuffle all the options so that it's a little bit more random than what an LLM would generate. And left to its own devices. So let's go ahead and create a quiz from our transcript as before, and print it out. As you can see, the quiz consists of many different questions, and these all look pretty relevant to the particular earnings call transcripts. And the numbers all look pretty reasonable as well. One question you might have is how do we know this quiz is actually correct? We won't show it explicitly in this lesson in the interest of time, but what we can do is actually have the LLM take the quiz with the original transcript, see which answers are correct, and then discard any answers that are inconsistent between what the quiz says is the right answer and what the transcript says is the right answer. Now that we've generated the quiz, we're going to write the helper function that allows the judge model to take the quiz using the summary. So let's go ahead and define our prompt for this use case. So use the provided summary of a transcript to answer the following quiz prompt try the quiz as input as well as the summary. And this is where our quiz to string function will be helpful. And we're going to tell them all to just respond with a list of answers and no additional text, and tell the model that it must provide an answer to all ten questions. So if it doesn't know, it should answer with zero. And this is because for the purposes of this problem, we don't want the model to take a random guess. Right? If the model legitimately doesn't know what the answer to a particular question is because the information isn't in the summary, it should explicitly say so. Defining our take quiz function. Now we take the summary as input as well as the quiz. We go ahead and generate the quiz string. We insert the quiz string and the summary into the prompt. We go ahead and prompt our judge model again, GPT-4o-mini here with temperature zero. So we say give us your best answers to these questions. Give them the summary. We get its response. And remember that the response is expected to be a list of answers surrounded by brackets. So we're going to strip out those brackets split on commas. And that will give us a list of letters that are answers. Let's run it and see what we get. As you can see, there's a good variety of different answers here. And also at least one occasion where the model was not able to answer the question based on the information in the summary. So finally, we need to score the answers that came out of the take quiz function above. So let's write this helper function score quiz answers. It takes the answers as well as the quiz. We're going to do a simple sanity check here to make sure that the number of answers that we generate equals the number of quiz questions. And then we're going to iterate over every one of the answers and every one of the questions. And if they match then we're going to, you know, plus one to your number of correct answers. Divide that by the total number of questions in the quiz. And that's going to give us the percentage of correct answers in the quiz. Let's go ahead and run that. And we get a score of 0.7. So that was computing the scores for just one of our summaries. But now let's run it on all of the summaries that we generated previously. So it's just going to iterate over every summary, take the quiz and then record the answers and then score those answers using our scoring function and keep track of that as well. Printing out the rewards and the advantages. We can see that our quiz based approach provides a pretty decent amount of variety in terms of the scores, and therefore the advantages that we get from our own as a judge method. And so as a result, we can expect that we'll get a nice amount of learning from this process now, because of the fact that we're having this diversity of different rewards and advantages. In the next lesson, we'll take a closer look at this particular use case and think about some ways that this might be exploited by our reward model to encourage kind of bad behavior, so-called reward hacking. And then we'll come back to the idea of putting this all together into a loss function. And the lesson after that.

Please sign in to view this content

Learn Code

Next Lesson

Reinforcement Fine-Tuning LLMs With GRPO

Introduction
Video
・
3 mins

Introduction to reinforcement learning
Video
・
7 mins

Benefits of reinforcement finetuning
Video
・
4 mins

Can a large language model master Wordle
Video with Code Example
・
10 mins

Reward functions
Video with Code Example
・
10 mins

Reward functions with LLM as a judge
Video with Code Example
・
12 mins

Reward hacking
Video with Code Example
・
7 mins

Calculating loss in GRPO
Video with Code Example
・
18 mins

Putting it all together: Training Wordle
Video with Code Example
・
8 mins

Conclusion
Video
・
1 min

Appendix – Tips, Help, and Download
Code Example
・
1 min

Course Feedback

Community