Optimizing to production level accuracy is one of the biggest pain points we've seen developers experience when working with LLMs. With so much guidance for prompt engineering, RAG and fine tuning out there. Figuring out which optimization you need to hillclimb on your evals can be a difficult problem to frame and solve. Luckily, it appears to be one of the use cases that o1 is very capable of. In this session, we'll focus on how to use o1-mini to work with a set of evals to optimize our prompt for a task and improve on our eval scores. Let's dive in. Welcome to your final lesson on Meta prompting, the practice of using an intelligent model like o1 to iteratively improve the instructions of a less intelligent model. The use case we're going to focus on here is routine generation. This is a common use case where existing knowledge base articles are written for human consumption, and are difficult for LLMs to follow reliably. We're going to use o1 to read in these knowledge-based articles and convert them into a routine, which is optimized for LLM usage. We're then going to feed that routine to a 4o model, which will try and pass an eval with that routine. We're going to do this a number of times, and after every iteration we're going to hand the eval results and the routine we used back to o1, which is going to try and improve the routine to solve the issues that it saw in the eval. This iterative process is known as Meta Prompting, where we use a more intelligent model to improve the instructions or guidance for a less intelligent model, and then use the eval results to iteratively improve them. There's a few steps there. So before jumping in, I want to summarize. Step one: We're going to take on a flight cancellation and changes policy as our example here. Our step one is going to generate a 4o routine. We're going to read in in a policy that was written for human consumption. And we're going to give that to o1 to generate a routine to implement that policy, which contains a prompt and also a set of tools which we've given to o1, to let it know that it can use some external actions to carry out the policy. Step two is to evaluate 4o performance at carrying out that policy on an eval set. That will let us know how well our conversion process has worked. Step three the critical step, is where we'll perform the Meta Prompting, where we'll then do a series of iterations where 4o will try and carry out the routine and the resulting eval will be given to o1, which will then try and improve the routine and give it back to 4o. We'll try this three times, and hopefully one of these iterations will end up being a superior routine, which we can then adopt and take into production. Let's begin with step one. You'll begin, as always, with your imports. We'll ignore warnings. We'll also load up our OpenAI API key and import our libraries. We'll also bring in a GPT model, an o1 model which we've defined in a config file. Feel free to change these if you'd like to try more intelligent models, but in this case, we're going to use GPT-4o-mini, GPT model. And for our o1 model, we're going to use the main line o1 model so that we can take advantage of its ability to use structured outputs and tool calling. We're also going to import a function that will compare two different string inputs and show the differences. So this will come in handy as we go through the Meta Prompting loop, because we may want to review old o1's work as it reviews routines and updates them. What you may find during this Meta prompting is that o1 may overfit to the eval results that it sees, and to check that it's not doing that, we want to do some visual checking of the resulting routines that are coming out of the meta prompt process. You're now ready to begin with step one, where we'll generate a routine from our original human-written policy. We'll begin by printing the policy. You can see here the purpose, a few notes on the tone that the agent should use when they're using this, and then a set of table of contents here. The main thing to take away is that nowhere does it explain to the LLM where actions are external, and where it can fully take action with the contents of this policy. As well, all the details are spread throughout this document, and the model will have to jump up and down the document to answer questions. Using a lot of logic as it works it out. We're going to rewrite this with o1, so it will be much simpler for the model to follow these instructions reliably. Here's a prompt for o1 to carry out this conversion process. This conversion prompt has an objective at the top where we tell the model again what is purposes is, and then some instructions on how to carry out the process. You can see here that we're using the advice from earlier lectures, setting out a structured format that it should use, and also specifying items like it should only use the available set of functions. And we're importing a list of tools from a config file, which are the external actions that the model will be able to use to carry out this policy. We're then going to ask it to convert the following policy into a formatted routine, ensuring that it's easy to follow and execute programmatically. And we're reminding it to only use the functions provided and not to create net new functions. This is something to think about because in real life, we've seen customers use this approach, and sometimes you do actually want the model to suggest new functions, which you might need to create, and then add to your list of function calls for the eventual LLM agent to carry out. But in this case, we're going to restrict it to the available functions. I'll save that prompt, and we'll move on to create a function to carry this out. This is the generate routine function simple function that takes a single argument the policy and uses that conversion prompt. In this case we're going to use o1- mini to give us our baseline quickly, save that function, and we're ready to generate our routine, which I will do with this commit. Congratulations. You have your first LLM generated routine. Let's run this command to take a look at it. You'll see here the model has followed our instructions. It is broken out the various processes that took place within that policy into distinct demarcated sets of instructions. And it specified what function the LLM should call to execute each one in order. This will make it simpler for the model to go through the logic or the if then else of carrying out each of these instructions contained within the routine. The last step before we evaluate this is to check that there are no data quality issues with the routine that our LLM has generated. I'm going to bring in a function here that will let us check for overlap between the functions that we defined at the start and the routine that's been generated. This function will perform some regex between two lists to find out which items exist in one and not the other. This is what we're looking for, so there are no items in our routine that are not in our list of functions. That's great. We don't want that. And there are three functions in our list of function definitions which weren't used in the routine, but that could actually be okay. So we'll run the evaluation and we'll check whether these should have been included. And if so we'll need to make some adjustments. But for now, we've generated our routine from a human return policy. We've completed step one. So we're ready now to move on to step two, which is evaluation. For step two evaluation, I want to go through a little bit of theory of what's going on under the hood before we dive through the code examples for each eval example, we've recorded a request that that this customer is going to start with. And then a bunch of information that the customer can use, as well as an expected response. And that's the correct response that this customer needs to receive from our assistant if they're going to get what they want out of this process. Now, the reason we've designed our eval set like this is because of the difficulties of evaluating a multi-turn interaction, like these customer service routines. To recap what's happening here, we're starting off with our o1 generated routine, which is given to 4o as the policy that it needs to follow. Then we begin with the customer request which will be something hardcoded like I would like to cancel my flight, please. Now our 4o agent might need to do some verification of that customer before actually getting to the refund stage. So to enable the conversation to continue, we also spin up a 4o customer who uses this request and information that we've defined to continue the conversation for multiple turns until we reach a function call, at which point will terminate certain function calls we're going to ignore, like check ticket type, because those are generic function calls, which the 4o agent will need to carry out to figure out whether, for example, the customer is eligible for refund. So a multi-turn interaction could look like this, where the customer provides their booking reference, their name and their flight number, and the 4o and agent goes back and forth with that customer for a couple of turns before selecting an exit tool. And we then compare that to our eval set. And that will be either true or false. We evaluate two things for each record first, whether the right tool is called, and second, whether the correct arguments are provided. This is what each example in our eval set is going to look like. Again, this is a generic framework. This isn't unique to Meta Prompting. This is actually useful pattern for multi-turn evaluation that's usable in any context, even if you aren't going to do Meta prompting. But our hope is that in Meta prompting, it will give the model that's optimizing routine a few more options to optimize to enable the agent to perform better on subsequent iterations. So now we'll jump into the code and try this out for real. To begin the evaluation step, we need to define a few functions. The first one is the agent response function. This enables our agent to respond to the conversation which is continuing. So the inputs to this function are a transcript, which are all the user and assistant messages so far a policy. So the routine that the agent should be following and a model that we're going to use. We're also going to invoke the tools that we provided at the start. And we're also going to disable parallel tool calls. That is a choice for simplicity here to make this easier to run through. But in your use case you may actually want to enable parallel tool calls and allow it to, for example, call a verification tool, and the refund tool together and externally you can decide what order you execute those in. But to make this simpler, but at the expense of more turns, we're going to turn parallel tool calls off. We'll save that function and we'll define a couple more functions. The next two are firstly, a simple function to handle any null strings that we get from each record in our test rows that we're going to send through this function. And then the more important function called process row, which we run on every step in the eval set. And every one of our eval examples will have the key attributes extracted. And we'll also stand up a customer prompt where the synthetic customer who's going to talk to our agent will be defined with any variables that we've input from the eval data. There's a few that we're going to instantiate in the system prompt so that they're present for every interaction for the customer. But for other ones like for example, the ticket type of fare rules, our agent will have to call the write function to ensure that those are present in the conversation. For our agent to respond correctly. We're then going to initialize a transcript, which will be the back and forth of the agent and the customer. And then we're going to loop through a number of iterations. We've put a breakpoint in here just in case it starts looping forever. We're basically going to have to get a fail that iteration and just finish with none for the outputs. And if we do get a response, we're then going to continue the conversation. So if we don't get tool calls, we're just going to get our synthetic customer to respond. Otherwise we're going to make a decision based on what the tool call is. So for tools like Verify identity or Ask Clarification or check ticket type and similar types of functions to that, we're going to continue the conversation. But for some we're going to generate a synthetic customer message which will let us continue. For some we're going to synthesize a tool response, which is usually inputting some of the values for that eval row. So for example if we call check ticket type we're then going to interpolate the ticket type and fare rules for that customer's booking. So this is really the critical function to execute our eval tests. We're then going to wrap that in the evaluate function calls function, which will let us use a thread pool executer to process these in parallel. So I'll save that. And we're now ready to execute the evaluation. But before we do that, you should have a look at the data that we've provided. Here's the eval data set. The key things to pull out are that each customer has a name, a booking reference, a flight number, an expected function, and expected inputs to that function. Those are the two fields that we're going to evaluate for each of our evaluations. We've also included a few details. So for some of them we have a refund amount. For some of them we have fare rules, a ticket type, etc., which may be critical for the agent to extract with certain tool calls to answer the customer's request successfully. The request is the verbatim request that this customer will initialize the interaction with the agent for, and the context is generic context, which goes in the customer's system prompt to enable them to carry on multi-chain conversations with a little bit more context if they need it. These are pretty basic, but you can imagine in a real interaction where there might be a ton of supporting data, that the customer may need the agent to know that you would want to interpolate into the context to enable a multi-turn conversation to take place. These are our eval examples. The last function we'll define before we run the evaluations is a display row function, which will display a single row with the full transcript. So we can see exactly what the back and forth interactions look like in practice. With that, we're ready to run our evaluation. Let's kick that off. You now have some baseline results. We can have a look at this data frame to see exactly how we performed. So our accuracy is 64%. And we can by looking across a couple of these we can see where the model has gone wrong. So for example for our first example it was supposed to be a partial refund. But instead we've offered a flight credit which is incorrect. So that has been assessed as a fail. The second one looks correct. We should have applied a change fee for booking reference DEF456 and we have. Great. I'll print a single row in detail so we can see how the transcript actually unfolds. For one of these. This was our first failed record. You can see here that we actually had quite a few turns. So for example, the user said I want to cancel my flight because of a medical emergency. And through here we can see that the agent decided to verify the identity and once it had verified it, it offered a full cancellation, but it asked for a medical certificate. It asked for the details, and then our model then failed to continue the conversation. We can see a failure of the customer, and in the end, this probably hit the ten turn limit and pulled out. So a failed example. Hopefully, our Meta prompting can fix this. So with that, we've got our baseline run. To recap. We scored a 65%. That's not great, but hopefully a lot of room for our Meta prompting run to improve. Now you go through step three, the most critical step of this process where we improve the 4o routine using Meta Prompting. First step is to define a simple function here to get a response from the API, API chart completion. And the one nuance here is we've added a response format parameter. If we do supply a response format, then we want to use structured output. And the reason for that will become clear in a minute. But the main summary here is that the structured output schema we're going to look for is a final answer, which is going to contain a updated routine from the Meta prompter. So we'll see why that's important in a second. The next step is to define our prompt. And here we have it. So we're going to insert the user message here for o1. It has instructions first of all, where we're telling it that it's an agent responsible for improving the quality of instructions that it's presented with. We also give it the criteria where it's meant to analyze the existing instructions, understand any issues, improve them to address the gaps, and ensure that any changes are compliant. We also repeat the instructions from above that it should only use the tools that are provided. We've also suggested that it should change the format. This is just an example we have seen with Meta prompting that sometimes whether it's an XML or markdown can actually make quite a big difference for certain data sets. The thing I would consider here is what's the specific guidance that you would give o1 based on your use case? There may be particular optimizations that you suspect will be useful for your data set. And here is the place to give them. We then frame to the model that it's been given four items, which is the ground truth policy that it began with the list of available functions, a routine instruction, just the latest one, and then a set of results. And we may provide it with a history of edits and evaluation. So this just means that we may give it multiple, previous routines and evals to show it how well they've all performed it. Then we give it the data that it needs, which is, first of all, the original policy, which will always be static here in the available functions. And then a conclusion which is to return the improved policy exactly as written within the defined Json. This is where we'll use that structured output schema. So effectively the Meta prompter will generate a new routine given all the previous iterations. And these instructions. And then we will extract that using structured outputs into the format we need to input as the routine for the next iteration of that prompt. So I'll run that, and move on to the next step, which is to import tiktoken and set up tiktoken to get the tokens from a list of messages. This will be useful for our Meta prompter again. We will try and insert as many previous examples as we can, so that the model will not repeat mistakes that it did in previous runs of Meta prompting, but we are limited obviously by the token limits. So this is something that we've put in to deal with that. So the next and primary step is now to execute the Meta prompting. So the way we do that is first of all to start with these different lists. The first one is a list of routines. Second is a list of results or data frames of eval results. And then the third is the accuracies which we're going to plot at the end to figure out which of these routines we want to keep. And then we have o1 messages, which is just the o1 message from above. Then we have our max tokens that we're going to append previous eval runs for, and then we start our Meta prompting. So we're going to loop three times. This is just an example that we're doing to try and minimize cost. In fact with this approach the model is going to try all sorts of different methods. It typically will not just gradually bring itself to the correct answer, it will often try completely different things a few times and worsen the results before they get better. So it may be worthwhile to do five, ten, 20 iterations to get this right. What we do here is if we do have any previous eval results, then we start them here. We start to append previous eval messages. If we have the tokens. If not, we just do nothing. We did leave a placeholder here to add some truncation logic. We chose not to for this course, but if you did choose to truncate them and still include them in some way, like maybe summarize them rather than include the entirety of the previous routines, then you can do that here. But in this case, we'll just ignore them. If we don't have tokens for them. Then we get the updated routine from the Meta prompting assistant. So we call our get OpenAI response. We use our structured output and we get a routine Json. We then turn that into a string and then load that into a new routine and append that to our routines object. And then we're ready to repeat the exercise above the evaluation. So we load in our data frame again. And then we get our results and our accuracy. Then we'll append those we'll list the IDs of the failed rows. This will be useful for us just to have a sense check of the sorts of things that our model is getting wrong, and then we start building up a new o1 messages that includes the new most recent routine, and then any other ones in which we can fit it. So this is the logic for the Meta Prompting. So I'm going to execute this and let's see how our model does. As Meta prompting completed. So, we'll dive in and have a look at our results. Our first iteration didn't do too well with 71%, but if we scroll down you can see the errors being logged. We can see a couple examples here. We'll just look at the three results first before we dive in. Oh iteration two looks pretty good 88%. That's not bad at all. So only a couple errors at that one. And then iteration three still pretty good, but not as good as iteration two. So and at 82 and if we take a look at a couple of the results. So for example this one, the request was I need to change my flight within the next seven days. You see that the model did actually get the right function. So process change. No fee. Was the function in both cases, but the arguments didn't match. So we got a failure at actually, this is maybe actually an issue with our eval function being a bit too strict. Or maybe we haven't give it enough context for the synthetic customer to actually properly supply this information, but whatever it is still looks like pretty encouraging. So from our baseline, we've improved over 10% with a couple runs of Meta prompting, so pretty good. Just to chart those out I'm just going to make a quick plot here. And here we have that plot. So our baseline was just over 70%. It's consistent for the first Meta prompting run. Then we got all the way up to 88% and then down to 82%. So this is the run that will go with as our best routine. Gotta bit of code here to just print that so we can get an idea of what that routine looks like. And here it is. Looks like the model figured out a slightly different way of annotating the different sub-bullets. And for whatever reason this seemed to work pretty well. So fairly interesting. And this is an example of some of the less traditional ways that the model will approach Meta prompting. And in this case, it's got a positive performance boost. So potentially we can use this as our golden routine going forward and do further iterations either manually or with more targeted Meta prompting to see how much performance we can squeeze out of this one. So that is the completion of our Meta prompting lesson. To recap what we did. We started off with a policy that was written for a knowledge base, for human agents to use to answer customers questions. We used o1 to convert that into a routine that was optimized for LLM consumption. Then we created an evaluation where we could run multi turning evaluations, where an AI customer would take a request and then playoff against our AI agent that we'd built using our routine to answer questions. And we weren't entirely happy with those results. So we then did three rounds of Meta prompting, which was the third step. Where we got o1 look at each set of eval results and iterate three times to try and get the best possible routine. And in doing so, we got a boost from our baseline of about 70%, all the way up to 88% accuracy on our eval. We hope this lesson on Meta prompting has been both informative, and also served to stimulate ideas of where it could fit in in your workflows, where could you use this? We've used this for customer service here, but really anywhere where we have these multi-turn conversations, we want to iterate over a number of variables. We may be able bunch of tools that we want to iterate through. There's a huge number of applications for this approach. We're extremely excited to see how all of you use these capabilities, and I hope that you've enjoyed the course as a whole. Look forward to seeing what you build and forward to the next one.