AI agents are software applications powered by a lens. In this first lesson, you'll learn how evaluating LMS systems differs from traditional software and testing. We'll then examine what agents mean and discuss what you need to consider when evaluating agents. Let's dive in. When you think about evaluation in general, there are typically two layers. On the left hand side you have lab model evaluation. This focuses on how well a large language model performs a specific task. You might have seen benchmarks like MLU for questions spanning math, philosophy, medicine and more, or human eval for code generation tasks. Providers often use these benchmarks to showcase how their foundation models stack up. On the right hand side. You have an alarm system or application evaluation. This zeroes in on how well your overall application, of which the LM is just a part of, performs. Here, the data sets being used for evaluation are created either manually, automatically, or synthetically. Using real world data. When you incorporate an LM into a broader system or product, you'll want to understand if the entire system including the prompts, tools, memory, and routing components are delivering the results that you expect. You may have experience with traditional software testing, where systems are largely deterministic. You can picture that like a train on a track. There's usually a well-defined start and end, and it's often straightforward to check whether each piece, the train or the track is working correctly. Land based systems, on the other hand, can feel more like driving a car through a busy city. The environment is variable and the system is non-deterministic. With software testing, you rely on unit tests to check individual parts of the system and integration tests to make sure they fit together as intended. And the results are usually pretty deterministic. Unlike traditional software testing, when you give the same prompt to an LLN multiple times, you might see slightly different responses. Just like how drivers can behave differently in city traffic. You're often dealing with more qualitative or open ended metrics, like the relevance or coherence of the output, which might not fit neatly into a strict pass fail test model. There are a number of common evaluation types for LLN systems. We have hallucination. Is the alarm accurately using provided context or as in making things up? Retrieval relevance. If the system retrieves documents or context, are they actually relevant to the query? Question answering accuracy. Does the response match the ground truth or user needs? Toxicity. Is the LM outputting harmful or undesirable language and overall performance? How well is the system performing at its goal? There are many open source tools and data sets that can help you measure these aspects, as well as help you develop your own evals. And we'll cover a few of these later in the course. As soon as you move from an LM based app to an agent, you're adding an extra layer of complexity. Agents use LMS for reasoning, but they also take actions on your behalf by choosing among tools, APIs, or other capabilities. an agent as a software based system that can take actions on behalf of a user utilizing reasoning. An agent typically has three main components. We have reasoning which is powered by the other LM routing, which is deciding which tool or skill to use and the action which is executing. The tool. Call, API call or code. You may have already seen examples of agent use cases, like personal assistant agents that help you take notes or transcribe information for you. Desktop or browser based agents that help automate repetitive tasks. Agents for data scraping and summarization and agents that can conduct search and compile research. Let's use an example use case to illustrate how an agent actually works. Suppose you want an agent to book a trip to San Francisco. There's a lot going on behind the scenes. First, the agent has to figure out which tool or API it should call based on your request. It needs to understand what you're really asking for and which resources will help. Next, it might call a search API to check available flights or hotels, and it could decide to ask you follow up questions or a fine. How it constructs the request for that tool. Finally, you wanted to return a friendly and accurate response, ideally with the correct trip details. Now let's think about how you'd evaluate the step by step. Did the agent picked the right tool in the first place? When it forms a search or booking request that it call the correct function with the right parameters? Is it using your contacts, for instance, the dates, preferences, and location accurately? How does the final response look? Does it have the right tone and is it factually correct? In the system, there's plenty that can go wrong. Maybe the agent returns flights to San Diego instead of San Francisco. That might work for some people. Then those folks would be unhappy if they wanted San Francisco and ended up with San Diego. This highlights why you need to evaluate not just the Lems raw output, but also how the agent decides on each action along the way. You might see issues like the agent calling the wrong tool, misusing context, or even adopting a snarky or inappropriate tone. Sometimes users also try to jailbreak the system, which can create even more unexpected outputs. To evaluate each of these factors. You can use human feedback or human in the loop or in less and lime itself as a judge to assess whether the agent's final response is truly fulfilling your requirements. We'll dive deeper into how LLN as a judge works with agent evaluation later in this course. Finally, remember that even small changes to your prompts or code can have unexpected ripple effects. For instance, adding a simple line in like remember to respond politely might help you improve on a number of use cases, but can also cause a regression in test cases that you may not expect. That's why you want to maintain a representative set of test cases or data sets that reflect your critical use cases each time you adjust your system. You can rerun evals on these data sets to catch regressions and keep building out new agent capabilities over time. This approach is key to developing robust agent evaluations. Just like you do with traditional software. You want to iterate on an agent's performance. However, you can't rely on purely deterministic checks. Agents themselves are non-deterministic, can take multiple paths, and might regress in one scenario even as they improve in another. To handle this, you want a consistent set of tests covering different user cases that get run every time you make a change. These tests often loop back from production data, like real world queries and real user interactions. Back to development, where you can refine on prompts, tools, or your approach This can help you catch regressions when your agent is deployed in production, and continuously expand your test coverage to improve on your system over time. A few of the tools that we're going to be covering in this course are trace instrumentation to understand what's happening with your agent. Underneath the hood, an eval runner which contains LM as a judge, data sets which you can use to rerun experiments, human feedback that allow you to capture human annotations and production, and a prompt playground you can use for your data for iterating on your data. In this lesson, you took a broad look at how LLN model evaluation compares to system evaluation, why LLN based apps require different testing approaches than traditional software. The extra complexity that agents bring with reasoning, routing, and action common pitfalls with agents like wrong tool selection for context usage and the importance of iterative testing using a single set of tools for development to production. You'll get to see all of this in action and dive into practical code soon. In the following lessons, you'll look even more closely at how to evaluate an agent, what data to collect, and how to structure these evaluations to keep your agent on track in the real world.