Welcome to Evaluating AI Agents built in partnership with Arize AI. Say you're building an AI coding agent, you might have to carry out a lot of steps to generate good code such as plan, use tools, reflect, and so on. And using an evaluation driven development process would make your development much more efficient. In this course, you'll learn how to add observability to your agent-based applications. Said this, You will see what it is doing every step of the way, so you can evaluate it components wise and efficiently drive improvements at the components. And then also at the whole system level. So if you're asking yourself questions like should you update a prompt at the last step, or should you update the logic of the workflow, or should you change the large language model you're using? Having a disciplined, evaluation driven process will help you a lot in terms of making these decisions in a systematic rather than random try a lot of things and see what works, kind of way. If you've heard of the idea of error analysis, which is a key concept to machine learning, this teaches you how to do that in the agentic workflow development process. If you haven't heard of error analysis, that's fine too. But in this course it takes an important set of ideas and shows you how to do it to develop agentic workflows efficiently. The instructors of this course are John Gilhuly, whose head of developer relations and Aman Khan, who is director of product at Arize AI. It's been fun working with you on this course. Thank you. Andrew. We're excited to teach this course. Thank you. Say you're building a research agent that searches the web, identifies sources, collects content, summarizes findings, and then maybe iterates if it identifies any weaknesses as an output. When you're building this complex system, you need to evaluate the quality of each step's output. For example, for source selection, you might create a test set that comprises research topics and the corresponding set of expected sources, and then measure the percentage of times that the agent chooses to correct sources. Or for open ended tasks like summarization, you can prompt a separate large language model or apply what we call large language model as judge. In order to evaluate the quality of this more open-ended output of a text summary. Apart from testing and improving the quality of your agent's output, you can also evaluate the path taken by the agent to ensure it doesn't get stuck in a loop or repeat steps unnecessarily. And so, in this course, you'll learn how to structure your evaluations to iterate on and improve both the output quality and the path taken by your agent. You'll do this by creating a code based agent that operates as a data analyzer. The agent will have access to a set of tools that allow it to connect to a database and perform analysis. A router that identifies what tool to use, and a memory that keeps track of the chat history. You'll collect and evaluate traces of the steps taken by your agent to process a query and visualize the collected data. You'll then learn how to evaluate each tool in your agent workflow using different types of evaluators. You also evaluate if the router chooses the right tool based on the user's query, and if it extracts the right parameters to execute the tool and assess the trajectory taken by the agent. Finally, you put all of your evaluators into a structured experiment that you can use to iterate on and improve your agent. While the course focuses on applying evaluation during development. You'll also learn how you can monitor your agent during production. Many people have worked to create this course. I'd like to thank Mikyo King, Xander Song, and Aparna Dhinakaran from Arize AI. And from DeepLearning.AI Hawraa Salami has also contributed to this course. John and Aman are both experts on the important topic of how to evaluate AI agentic workflows. Let's now go on to the next video, and I hope you enjoy the course and learn a lot from John and Aman.