In this lesson, you're going to take the agent that you've already built and traced. And now add some evaluations to measure its performance. So similar to before you're going to start by importing some libraries that'll be used here. So there's some that you've seen before like Phoenix and some other libraries here. And then you're adding in some new ones. So specifically from Phoenix emails, importing some things like a LM as a judge prompt template that's going to be used, and some methods like LM classifying that are going to help you run LM as a judge, evaluate. One other one that's important to call out here is this last async EO. You might see this is used to run multiple calls asynchronously and simultaneously to speed up some of your evaluation. You'll see where that comes in in a second. Now again in Phoenix there's a concept of projects. And so just to separate things out from a previous notebook, you may want to use a different project name here. So in this case you could use evaluating agent. And then in this notebook just to keep things simple, what's been provided for you is the entire agent that you run that you set up previously is in the utils file that's attached to this notebook. And so you're going to import a few different methods from there. But if you run this you'll see the same output that you saw before in the previous notebook saying tracing has been connected and set up there. Again this is all the code that you created in the last notebook. It's just a pretty long notebook already, so it's been separated out into a utils file for you. But this has been set up with your agent to run and connected to the Evaluating Agents project. So here, anything that you call to run an agent, start main spans either of those methods will be traced and sent through into the Evaluating Agents project within Phoenix. It just lives in that separate utils file to keep things clean here. You've also imported tools which are going to be used for one of your evaluations as well, to. So now when you're evaluating your agent, you can go for two different kind of approaches to start evaluating it. One starts from a data set of examples. You can run through your agent and you'll learn more about that in the next modules where you'll cover experiments. The other way is to run real world examples through your agent. Trace those examples and then evaluate the performance of the agent using those traces. So in order to do that, you can set up some basic questions for your agent. Things like what was the most popular product skew, total revenue across stores, some other kind of queries that are going to come through there. And then you can loop through your agent and call the start main span method that you imported above with each of those questions. So this will just loop through each of those questions to your agent and have those traced in Phoenix. So this will take a minute or two to run. Go ahead and kick off. Great. Now once your calls have all finished running, you can jump back in to Phoenix and see the traces for all of those different runs. If you start from the projects view, you should now have a new project that says Evaluating Agent. And you got six traces in there. So you'll see a row for each run of your agent. And you could click into those and see more details on what that agent executed. For that get in step, including some calls that are using multiple different tools that you can trace. Now, in order to evaluate your agent, there's a few different pieces of the agent that make sense to evaluate as you learned in the previous slides. So one of those is the router. So you could evaluate this router call that's being made where it's deciding which particular tool to call. So that would be one thing to evaluate. Or you can evaluate some of the skills like the sales lookup data skill or the generate visualization skill. And now the pattern of evaluation in Phoenix typically involves exporting spans from your Phoenix instance, either using LM as a judge or a code based eval to add labels to each row of those spans and then uploading them back into Phoenix. So let's start by evaluating the router itself and using an LM as a judge to do that router evaluation. So you're going to want to export some of these spans from Phoenix. You can do that using code. But just to show how you would filter down to the spans that you care about for the router here you really care about within the router call this first line call that's being made. So if you look at this first oh I'm calling. You can see this is where you have a users question. And then you have the response. Here is one of these tools. So in this case the lookup sales data tool that was the response back from the router. So this is all the information that you want to be able to grab and export from Phoenix. And you can actually look at some of the attributes here to see the information associated with these. The first is that you can see that there's an open inference span kind of of them. So that's one way that you could filter by this. Or you could use any of the other information inside of the attributes here to filter for this particular kind of span that you want to use. Now back in your notebook to evaluate your router using an LM as a judge. There's a template that's provided as part of the Phoenix Library for this kind of evaluation, because this is a very common evaluation that oftentimes people want to run. So this tool calling Prompt template is an LM is a judge template that you can use to evaluate your function calling of your router. So if you look through what this is looking for you can see all of the instructions given to the judge. And one thing that you see here is that there's two variables in this case up here, and one variable down below that need to be passed in to the template. So first is the question the user asked and then the tool call that was made. It also looks for the tool definitions, which is a set of all of the possible tools and their definitions that could be called. So the LM can compare against that. So next you can export the required spans from Phoenix using some of those filters you just saw in Phoenix. So there's a few things that you can do here. So first off you can use this span query method to set some sort of filter that will filter down the spans in your project and export any spans that match the filter. Here you can use span kind of equals LM. And this will give you all of the LM spans for both the router calls as well as other elem spans. So you're going to do a little bit more filtering in a second there. And then you can choose which columns or attributes you want to export as part of that call. And so here you want to export import value and limit tools. And you want to export them using the headers question and tool call. Those headers are important because they match you. Scroll up a bit. They match what's going to be looked for by your template. So question and tool call and then ask for the input dot value and lambda tools. Those are attribute names inside of Phoenix. So if you want to quickly jump in over to Phoenix one more time, you can see in this case in your attributes. You can see input here. Dot value will give you the full messages object that was input there. And if you scroll down further you can see lm dot tools. Here you can see the tools that that were chosen. So now you can export those different values. Run this query on your Phoenix client and that will export those particular spans. You can see they're also passing in the project name. And then one other thing that's being done here to filter down. Because again this will pull up all of the LM spans, even those that are not to do with tool calling or routing in this case. So you can also just use this drop and a method to remove any spans that were exported that don't have a tool call. So they have no value here for L tools. That's an easy way to just filter down to just the LM spans that have to do with routing. So if you run the cell now you should see your tool calls data frame. It's been exported should have two columns question and tool call. And again those match what the template above is going to look for. And then you'll have the question from the user as well as the tool calls that were made. You'll also have this context dot span ID column. This is what corresponds to the particular span inside of Phoenix. And it's going to be what allows you to upload the data back into Phoenix and have it match up in a little bit. Here. So now you have this data frame. The next step is to add labels to each row of that data frame based on your template above. So here's how you can do that. There's a few things going on here. First, the LM classify method is a method that's supplied by Phoenix that will take each row of your data frame and run it through some provided template. So in this case your tool calling prompt template. And then it will take a response. And in this case it will snap it to either a correct or incorrect value. That just ensures that you have really specific labels that are being used. Sometimes the LM might respond with correct with a capital C as opposed to lowercase. And so this rails function here will snap it. So all the values are consistent. And then providing a model that's going to be used to actually run this prompt. So this is your judge model in this case using GPT four. And then this variable here will allow to provide an explanation for why it shows that particular label. There's a couple other things going on inside the code here. One is that this suppress tracing tag up top. This has been added because if you remember, your agent is set up to trace any calls that are made to OpenAI. So if you didn't have this method saying, hey, turn off tracing for everything happening inside here, then you would actually see some spans tracked to your project for all of the evaluation calls that are being made. So you want to make sure your LM as a judge spans themselves. Don't get traced. So you're suppressing tracing. There. And then on the template row here, there's some kind of wizardry going on here to add in and replace the tool definitions with your tools object. And so the idea here is if you quickly see your template from earlier, it's being passed in this tool's definition, which is a definition of all of the possible tools that could be called. And you want to make sure that that's being replaced with the Json dictionary of tools that your agent is using, so that your LM judge knows all the possible tools that could have been called, so it can make its judgment appropriately. And finally, one other thing that's happening here after your Elam judge runs, is you're also adding in a score column that is just one. If the yellow and judge label is correct, zero otherwise, that's used to have a numeric value attached to your judge, which is used in some of the visualization. So now you can go ahead and run that and you'll see some unclassified progress bar being made. These calls are made asynchronously because of the Nest Asensio library you imported all the way at the beginning of this notebook. And then you should see a response that looks like this, where you have some labels as well as explanations. And then if you scroll to the right you'll also see you have some scores, values 1 or 0. So now you've appended some labels onto your span IDs. The last thing to do here is upload these back into Phoenix so you can visualize the data there. So you can use this log evaluations method which will take a couple things here. It'll take the name of the eval that you want to use as well as the data frame that you just created with all of those labels and context span IDs. Now, if you jump back into Phoenix, you may need to refresh your project, and then you'll see there's a new value at the top here of your tool calling eval. And so you'll see a numeric value again because you added that score. And it's able to have a numeric and a percentage value here. And then if you start going into any of your runs and you click on your router span right here, you'll now see that the feedback tab here has a new entry. And you'll see it's now labeled either correct or incorrect. And then it has an explanation as to why that particular label was chosen. And so you can use these to get some more information about what's happening inside of your application. And you can actually filter down as well if you want to go to spans as opposed to traces, to see all of your spans in Phoenix. And then you can set a filter for your evals to be tool calling, either as a name or, you know, and then let's look at just the incorrect ones. You can look at And now you'll see there's cases where your router was incorrect there. And you could click into those and see the feedback as to why that particular item was incorrect. Awesome. So now you've evaluated your router using LM as a judge. You can add in some evaluations for your skills as well. One that might make sense to do would be evaluating your generated code for your generate visualizations. And so one way you could do this you can say okay I've got my tool here called Generate visualization. Why not export all of the spans for that. Generate visualization. And then you can evaluate the code that's been generated. There. So you'll follow the same kind of pattern as before. Start by exporting the spans that you want to evaluate. And in this case you can just use the name of the span equals generate visualization. Super easy there. And then you'll want to export the output of that span. So you can run this. You probably just have one entry here given the examples that have been run. And then instead of using LM as a judge here you can define a code based evaluator instead. So in this case you can do a very simple evaluation of just. Is the generated code runnable. And so you could say okay here is a method that will check for that. It will take some generated code. And it will do some quick checks to make sure it doesn't have any extra string values inside of it. And then it will try to execute that code returning true or false depending on if there were exceptions. And now you can take that method and apply it to your code gen data frame that you exported above. So you can apply the code as runnable method here to your generated code column, and then map to the return values of true to runnable and false to not runnable. And then again, you want to add a score. So you can say that one is runnable and zero is not runnable. And now if you run that method you'll add in some labels. There. You can take a look at your generated data frame. And now you have some labels and scores. And finally you can upload this data back into Phoenix using that same log evaluation method. And now if you look in Phoenix you'll have a new entry at the top this time for runnable code. And it looks like the generated code wasn't runnable. So in this case it's good that caught that for us. So you could use this at scale to run more checks of your runnable code, this time using a code based evaluator as opposed to as a judge evaluator. And there's a couple other tools that you had that might make sense to evaluate. Another good example would be the data analysis tool that you're using to generate analysis on top of any data that's been exported. And so this one's another good one to use LM as a judge for. However there's not actually a template built into Fenix for clarity of analysis. So you might wanna actually define your own template in this case. So you can actually just define your own template as text using whatever variables make sense here. And so in this case you can create your own template and then follow the same pattern as before. Export the information you care about. And in this case one way you could export relevant information is that you could look for all of your agent spans, which are that top level span, and just look at the overall output value, because that's what you can evaluate whether or not that was, correctly and clearly communicated. So you can export your top level agent spans. And then again make sure that when you're doing this, you're matching any of the names here or the column headers here to whatever is expected by your elbow as a judge. Prompt. And now you can use the same LM classify method, with tracing suppressed this time using your manually defined prompt to run your Elam as a judge. And then once that completes, you'll have your new data frame that you can upload back into Phoenix. You can call this one a response clarity eval. And you should see that appear in Phoenix under response clarity. In this case it looks like 100% out 100%. And you had one third and final tool, which is your SQL code generation and database lookup tool. So you can add a similar evaluation for that. Again defining an alarm as a judge prompt. So in this case you can see a SQL eval gen prompt. And then you can export relevant spans. So again in this case you can export your LLM spans. And then you can later filter down a little bit more specifically here to filter whether the question that's being asked contains this. Generate a SQL query based on a prompt method. Oftentimes the hardest part of setting up the evals is filtering down the right set of spans and coming up with the right criteria. So just know that you can filter based on anything inside of the spans, including string manipulation here, or just looking to see if the prompt contains a certain kind of value. export your data there. Once again, run your alarm as a judge. Prompt. And then finally upload that data into Phoenix. And so now you should have four different evaluators. In Phoenix. You have one evaluator for your tool calling and your router. You'll have one evaluator for your SQL code gen, your response clarity and your generated code visualizations. So at this point, you now have at least one evaluator set up for each main part of your agent. And you can use these to get some directional indications of how your agent is performing across different types of queries.