So far you've learned how to evaluate your agents skills. Router and path convergence. You'll now learn how to combine your evaluators into a structured experiment to iterate on and improve your agent. Let's dive in. Evaluation driven development is a concept that involves using evaluations and measurements from your agent to help guide where you spend your time improving and iterating on and developing your agent. Evaluation driven development is made up of a few different steps. First, you curate a data set of test cases or examples that you can send through different variants of your agent. Next you send each of those different test cases through those different variants of your agent, each time changing things like the model you're using, the prompts you're using, or the agent logic. And then once you run those different test cases through those different variations of your agent, you take the results of all of those different experiments and run them through your evaluators. Those evaluators will give you a set of scores that you can then use to compare the different iterations of your agent on an apples to apples basis. Now, this is presented linearly in the visualization here, but in practice this tends to be more of a cycle, especially as you move your agent into production. LLN apps require kind of iterative improvement as you're working on them. And so typically what will happen is you'll you'll go through this full process, your release or agent or you'll get it into some other people's hands. And then you'll realize that you wanted to add in different test cases or different evaluators. And so you can always update and change each of these different pieces as you're going. And then you can create a flywheel here, where you're incorporating information from production back into your development process using evaluation driven development. Diving into each of these steps in a little bit more detail. You start by curating a data set test cases. And the idea here is to be more comprehensive than exhaustive. You can have a set of examples that are representative of the inputs that you expect your agent to receive. And you just really need 1 or 2 examples from each of the different types of inputs that you might get. You don't need to have hundreds and hundreds of examples, especially if there's similar types of examples. and these examples can come from either live runs of your agent or can be constructed beforehand, manually, or even in some cases, and generated using another model. In practice, you often start by constructing the examples yourself, and then add to them from live data. As you release your agent. And then wherever possible, it's always good to include expected outputs along with your test cases. You may not always have expected outputs, but if you can include those, it unlocks different types of evaluations that you can run. As a judge, evals can be used even if you don't have expected output. But certain code based evals do require an expected output to compare against. With your data set in place and your test case is in place. Then you can start to make changes to your agent and track each of those changes. So some tests that you can often do are changing things like your prompts used by your agent, changing the tool definitions that you're passing into your router, changing the router logic itself, changing some of the skills or the skill structure, or just swapping in a new model that you want to test out, and the practice of sending each of your test cases through a variation of your agent is often referred to as an experiment. We are experimenting with a certain version of your agent. And you can use those experiments to record and measure results from each of those different runs. Once you have all your experiment data collected, then you can use all of your evaluators from the previous lessons that you've built and apply them to the results of those experiments. So you can use your code based evaluators, your comparison against ground truth, or checking if generated code is runnable, as well as the convergence evaluators that you run. And you can use your line as a judge or evaluators, like function calling, analysis, clarity, and entity correctness. Now how can you apply this to your agent? As a quick reminder, your agent is set up to use this OpenAI router with function calling and then three different skills that you can evaluate. First, taking an example of how you would set up an experiment around with a router. You might have an example test case that looks something like this where you have an input, which is which stores have the best sales performance in 2021 and then an expected output. In this case, the database lookup tool is the tool should be chosen by the router. And then you might want to experiment with trying different tool descriptions. So if you think back to the Json object that you pass into your Elm router, you might want to modify the description that's given to each tool to see if that improves your router performance. And then once you have those experiments run, you can evaluate the results using either code based comparison against ground truth. Because here you have that expected output. You have that ground truth data. Or you can use a function calling, album as a judge. Taking another component of your agent. You have your database lookup tool or skill, and you might have a test case that looks something like this, where you have the same input from the previous round. And then in this case, the expected output is some SQL code that you see there. and so if you notice here that SQL code is an intermediate step within your database lookup tool. First it generates SQL and then it runs that SQL and gets an output. So that's one thing to keep in mind here is that again you can evaluate just the SQL generation part of your database lookup tool. Invite you to pause the video here for a second, and see if you can think of an experiment and some evaluations that you could run on this database lookup tool. So one experiment you could do would be testing different SQL generation prompts. You could also test different models or other pieces of that as well. And then you can use your code based comparison against ground truth evaluator, similar to what you had looked at in previous notebooks as well, because you have that ground truth to compare against. Next you look at the database analysis tool. In this case, you might have a test case like the one you see here where you have a message from a user, and then you actually have some retrieved data. Because the database analysis tool takes in both a user question as well as some data to analyze. So you'd have to have both those pieces in your test case, and then you don't actually have an expected output in this test case. So once again, I invite you to pause the video and see if you can think of an experiment and then an evaluation that you can run. In this case. So here are you. One experiment could be testing different LR models. You could also test different prompts or other kind of logic changes there too. Then you can use the analysis clarity and entity correctness LM as a judge, evaluations from previous slides and notebooks that you've seen to evaluate the results. and as you structure all of these together and start to run multiple iterations of your agent with multiple different evaluators and create a whole process around this, you can end up with, dashboard or a heads up display that looks something like this, Each row has one run of your agent, and then all of your evaluators, those different columns here, and you can measure, okay, what are the effects of each change that I'm going to make to my agent across not just one part of the agent, but each part of the agent holistically. And then again, as you start to move your agent into production, you'll find that you're going to come up with new test cases, new evaluations, and new changes that you want to make using some of that production monitoring data that you have. And then you can bring that back into your testing and development process. So this whole experimentation framework and this whole evaluation driven development framework enables you to not just create a strong application in development, but also incorporate production learnings that you have into your development process and create a large flywheel here that you can use to create a better and better agent over time. In this lesson, you learned the purpose of evaluation driven development, what it is and how it works, and you learned how you can structure experiments around your evals to scale them. As you continually improve your agent. In the next notebook, you're going to implement some of these techniques and create the visualization that you saw just a few slides ago. By adding lots of evaluations to the agent you've been working on so far, and structuring them into an experiment.