Evaluating AI Agents

Instructors: John Gilhuly, Aman Khan

Enroll for Free

All Courses
Short Course
Evaluating AI Agents

Beginner
2 hours 16 mins
15 Video Lessons
6 Code Examples
Instructors: John Gilhuly, Aman Khan
Arize AI

What you'll learn

Learn how to add observability to your agent to gain insights into its steps and know how to debug it.
Learn how to set up evaluations for the agent components by preparing testing examples, choosing the appropriate evaluator (code-based or LLM-as-a-Judge), and identifying the right metrics.
Learn how to structure your evaluation into experiments to iterate on and improve the output quality and the path taken by your agent.

About this course

Learn how to systematically assess and improve your AI agent’s performance in Evaluating AI Agents, a short course built in partnership with Arize AI and taught by John Gilhuly, Head of Developer Relations, and Aman Khan, Director of Product.

When you’re building an AI Agent, an important part of the development process is evaluations or evals. Whether you’re building a shopping assistant, coding agent, or research assistant, having a structured evaluation process helps you refine its performance systematically—rather than relying on trial and error.

With a systematic approach, you structure your evaluations to assess the performance of each component of the agent, as well as its end-to-end performance. For each component, you select the appropriate evaluators, testing examples, and metrics. This process helps you identify any areas of improvement so you can iterate on your agent during development and in production.

In this course, you’ll build an AI agent, add observability to visualize and debug its steps, and evaluate its performance component-wise.

In detail, you’ll:

Distinguish between evaluating LLM-based systems and traditional software testing.
Explore the basic structure of AI agents – routers, skills, and memory – and implement an AI agent from scratch.
Add observability to the agent by collecting traces of the steps taken by the agent and visualizing the traces.
Choose the appropriate evaluator – code-based, LLM-as-a-Judge, and human annotations – for each component of the agent.
Set up evaluations for the skills and router decisions of the agent example using code-based and LLM-as-a-judge evaluators, by creating testing examples from collected traces and preparing detailed prompts for the LLM-as-a-judge.
Compute a convergence score to evaluate if the example agent can respond to a query in an efficient number of steps.
Run structured experiments to improve the performance of the agent by exploring changes to the prompt, LLM model, or the agent’s logic.
Understand how to deploy these evaluation techniques to monitor the agent’s performance in production.

By the end of this course, you’ll know how to trace AI agents, systematically evaluate them, and improve their performance.

Who should join?

Anyone who has basic Python knowledge and wants to learn to evaluate, troubleshoot, and improve AI agents effectively—both during development and in production. Familiarity with prompting an LLM model would be helpful but not required.