In this final notebook here, you can take everything that you've learned and compose it all together to create one large experiment, to be able to test out all the different parts of your agent at once. So you'll combine some of the evaluators that you created in previous notebooks, along with the experiment structure here, to create an easy way to iterate on your agent. As before. Import some different libraries that you're going to use. These should look familiar at this point in your LLM class by some of the experiment functions that you might use. And then from utils some functions including the agent that you've been working on so far. Now you're going to run a few different evaluators. So it's helpful to define a few eval model at the beginning saying this is the evaluator LM judge model you're going to use throughout. And then you're also going to be accessing the Phoenix client. So you can go ahead and add in the Phoenix client. Now you're going to use one big experiment to actually structure this code here. And so the first step is creating the data set that will be run through for that experiment. So to start out with you can create a data set here. and this is a pretty big data set. Here you have the questions that are being inputted. You also have some ground truth data like the SQL results that you expect for those questions, as well as some ground truth data about the SQL generated for some of those questions, because you're combining multiple different evaluators here, you have some multiple examples of ground truth data that's being added. But again, you'll use the same syntax as you used previously to create a data frame out of those examples and then upload that data frame into Phoenix under the name Overall Experiment Inputs. Now, one of the thing that's important to note is because you have multiple keys. This time when you upload a data set into Phoenix, you define which of those keys are input values. So in this case just the question. And then which of them are output values. So these are your expected outputs. So your SQL result and SQL generated keys run that and your data set will be uploaded into Phoenix. You can take a look in Phoenix within the data sets section. You may need to refresh your Phoenix to have that appear. And then you'll now see you have this overall experiments input data set. And if you click into any of these examples you'll see that each one has input keys for question and then expected output keys for SQL result and SQL generated. Now you might be asking why we just have ground truth data for the SQL result in SQL generated. The reason for that is that you can still use LM as a judge when you don't have expected ground truth data to compare against. So for some of the other tools you can use an elements of judge approach where you don't need that expected value. So on the subject of Am as a judge, now it's time to define all of your different evaluators that you're going to use. And there's two different elements of judge prompts that you might want to use here. And these should look familiar from the previous notebooks that you run. So if one is your own as a judge for clarity and your response and then another here is for what we're calling entity correctness. So in this case, this is another type of L1 is a judge that you can run on the output of your model or your data analyzer. Step to check to make sure that any, any variables that are mentioned, any SQL columns or anything like that are mapped correctly in the input, the data throughout the run, and then the output. This is basically making sure that your agent doesn't start referring to SKU or skew columns as store IDs, or something along those lines. Now you can define your evaluator functions. And so you can go one by one for these starting with the router. So for your router you're doing a function calling eval. And because you're using this in the context of an experiment you can expect that this is going to be sent an input. And then an output. In some cases you also have expected output. In this case you don't have ground truth data for function calling. So you just have an input and then the output. This is the original input that agent got. And then this is going to be your output from running your agent. So what you'll do is from that output grab some of the tool calls that were made. That'll be one of the entries in the output. And then you can construct together a data frame using the question as one key. And then any of the different function calls as the other rows here. And the reason that you're constructing a data frame is one so that it works nicely with the classify method. And two, because you might actually have multiple tool calls per question here. So you might have a case where a question came in and your agent decided to call two different tools. So this structure here will create a data frame that's two rows with the question and the tool call column. And then you're going to use your LM as a judge. I will then classify method here again using that data frame. The tool calling prompt template as before exactly the same code that you ran. Previously to run your alarm as a judge for function calling. A couple notebooks ago there. So you can define that method, and then here you're returning back the score. This is going to be the mean score here. Because again, you could have two different elements of judge, or two different functions that were called by your agent. So you're going to have to have some way of reporting back the overall score for this run. Now you can start evaluating your tools. So one of those tools is going to be your database lookup tool. And so you can grab a method to evaluate your database lookup tool. And so in this case it's evaluating the SQL result. And now if you think back to the data frame you'll remember that you have some ground truth data. You have some expected responses for SQL generation. So in this case you will still have the output of your agent. But then you're also going to ask for the expected value. So these are going to be the expected ground truth responses that you have that you defined in your data frame. And so first it's checking to make sure the outputs there. And then it's grabbing any kind of SQL result from the actual output that was run. And in this case it's going to grab all the two responses from the output. And then this code here will actually loop through all the different tools that were called in that particular run of the agent. And then it's going to look for one that matches the name lookup sales data. And the idea there is you want to make sure that you're just looking at the tool call responses here that are for that lookup sales data method. You don't want to look for your other two tools. In this case. And then once you have that as your SQL result, then you're going to grab the actual response from that particular tool. So all of this code here is just getting you to a point where the SQL result equals the SQL that was generated by your agent when it ran. And now you can do one more step here, which is to pull the numbers, specifically the numerical values out of that SQL result. And then also out of the expected target SQL result. So this here is your expected value that's in your data set. And the reason to pull out just the numbers is because in SQL you can define columns in your responses and things like that using different names. And this just allows you to make sure that you're not having a case where your agent use a slightly different column name in its response. And still got the correct numerical result. And then this will return true or false, depending on if those numbers match. Now, one thing you'll notice here is that in this case, you're using a code based evaluator to evaluate your SQL generation. If you think back to a couple notebooks ago, you actually used an, as a judge to evaluate your SQL response. Either of those approaches can be used if you think back to the slides. LM as a judge will work. It may not be 100% accurate. So the method you're looking at here to compare against ground truth is generally going to be a more accurate kind of evaluation. But at the same time it relies on you having some of that ground truth data. So it relies on you having those expected values which you may not have in large amounts or large scales. Okay. Now moving on to your second tool. You have your data analysis tool. And there's actually two different, evaluations you can run for your data analysis tool that match the two prompts you saw above. One is to evaluate the clarity of the response. And so in this case you can use an evaluate clarity method here that takes in the output of your agent and the original input to your agent. And then it will construct a quick data frame out of those using the query and response columns, because that's what's expected by your own as a judge prompt. And then it will call the L classify method. And classify the result of that particular run of your agent. So this one is exactly the same as the previous line as a judge that you ran a couple notebooks ago for clarity. It's just now put into this function that you can call that knows to expect things like the input in the output from different steps. Similarly, you can add another evaluator here for entity correctness. Again, same exact kind of flow. So you'll have a method. You'll create a data frame out of the input and the output. And then you'll call LM classify using your LM as a judge prompt. So now you've got evaluators for your router your tool one. Your tool two. Finally the last piece here is your tool three which again you can use one that you've used previously here to check if the generated code is runnable. once you evaluate 2 or 3 you can use a code is runnable evaluation. Here. Or this time you don't need the input. You don't have any expected values to compare against. You just have the output. So what you can do is you can take that output and you can again pull out any tool call responses. This is similar to what you did for the database lookup tool. So all of this code here is just getting you to a point where generated code equals the code that was generated when the tool name equals generate visualization. And then you can take that and make sure that you have stripped away any extra characters that are used there. And then at the end here you can just run your execute on that generated code. If it executes correctly, return true. If not, return false. So this is the same evaluation that you used a couple of notebooks ago, just with some extra manipulation to get it to look for the specific output here. Great. And now you have your data set, as well as all the evaluators that you want to run on your experiment. The last piece that you need is your task. And so you can define a task method here. In this case run agent task. And it's going to take that example. Grab the input question from it and then run your agent on that. Now you can define that. And one other thing to call out here is there's this process messages method that's called. This is also in your utils function. And it's just going to take the output of your agent and process the messages to match a little bit cleaner of a format to read. Now with your data set and your task and all of your evaluators defined, you're ready to run your giant experiment here. So you can use the run experiment method. You're passing in the data set the run agent task. And for evaluators you can pass in all your different evaluator functions here. And what will happen is now each row of that data set will be run through that agent task. And then the outputs of that will be run through all of your different evaluators. Go ahead and run that. And that will take a second to run. So once that completes you'll see that you run seven different tasks through your agent. And then you'll see a bunch of output here around all of the different evaluations. So you'll see again, you're running 35 evaluations here total because it's five evaluations for each seven case. So you'll have quite a bit of output here. That's all normal and it's all looking good. You'll have some graphs and then you'll finally have some scores at the bottom for some of those different runs. And now if you jump over into Phoenix you'll see a detailed view there as well. To make sure you go to your overall experiments input here. And then you'll see a bunch of different columns for all those different evals. So it looks like all the code is runnable. The evaluation analysis is clear. The entities are correct. It looks like some issues with SQL generation and then function calling. So you could jump into this experiment and see it in more detail. So you could see all the two calls and all the format of response that you have here for some of these pieces. So you could click into those and see more if you wanted to. And now the question becomes what do you do with this information. How do you improve your agent? And so you've got really two options here for how to improve your agent. One is in code. You can go in and start making some changes to your agent and running different versions of your agent using the experiment. So SQL generation was a little bit tough there. So why not start with changing up the SQL generation prompt? So this is your base SQL generation prompt that's being used by our agent. We can add something here like think before you respond. You might make bigger changes to it. This is just a pretty basic change. But one you could try out. And then we have a helper function here to just update that SQL generation prompt. In practice, you probably go into your agent code. You would actually make the change in the agent code. This way we'll do that same kind of thing here. And then you can rerun your experiment and you'll because you've updated your agent. That task is now a new task. So you might want actually change this experiment name to say V2 and evaluating the overall experiment with changes to SQL prompt. And then you can run that experiment again and you'll get a comparable line. And so now if you scroll through those results, you'll see similar kinds of outputs before And then you can continue going down, see all the results that were run. And you'll have your same scores from before. You can jump over into Phoenix to see those in detail. And now you may have to refresh your page. You should see a new entry for your in second experiment for your V2 and then changes to the SQL prompt. And so it looks like if you look at the scores there, looks like SQL didn't really have the change that you wanted. So you might have to have some more changes from there. And one of the things it looks like some of the, responses that came back are less clear or correct this time. That could be due to the change you made. It could be due to just the fact that you're running with seven test cases here, which is a sort of small amount. So it's always good to do multiple runs of each experiment to get some more statistically significant kind of data back as you're testing your agents performance. So you can continue to make changes inside of your agent, either to one part of your agent or multiple parts of your agent to try to boost these scores over time. And you could continue to make those code changes until you get to the results that you want with your agent. There is one other approach that you can use as well too, which is in Phoenix. There's a tool called playground where you can jump into the playground button here, and that will bring your data set into a playground environment where you can make whatever queries you want over your data set. And so you could actually take that same SQL generation prompt and run it directly here on your data if you want it to. So you could copy over your prompt from your notebook and paste that in here. And then in this case, you might want to have it look for specific variables from your inputs and outputs. So in this case if I use curly braces you can add these curly braces. And then you'll actually pull in the question from that particular row. So you could use this. and if you then ran this you would then have results for each row that you could look at. you can use the compare button that will add a second iteration of the prompt. In this case you're using Gpt2 for a mini. So you can switch that to for a mini. And then similarly over here have consistency there. And so you could actually make your changes to your prompts directly in the UI here. And then this will give you a side by side comparison of the responses. So you could always make the changes to your agent in your code itself and then run them as a new experiment. Alternatively, you can use this prompt playground tool to run changes directly in the UI and have a little bit faster of an iteration cycle as you change them. No matter if you use either of these different techniques, you can make changes to your agent and use your new experiment that you've created to have a really easy apples to apples way to compare what are the actual facts of each of those changes, so that you're not just moving things around without understanding what the effect of each change situation might be. So you now have a structure that you can use to iterate on your agent and compare all of those different iterations to each other in a structured way.