After you develop your agent and improve it until it's production ready, you need to apply the same techniques of tracing evals and experimentation to your agent in production during production. Things like code changes or model updates can degrade the performance of your agent. you can use your evals and run experiments to continuously monitor and improve your agent. Let's get to it. As a quick recap, you've covered four main steps that take you from an initial agent prototype all the way through to a production ready system. Choosing the right architecture, you need to decide on an agent framework that matches your use case. Deciding which eval to use. Figuring out which metrics really matter for your system, such as accuracy, latency, and convergence. Building your evaluation structure, which includes the prompts, tools, and data you'll use to measure performance. Iterating with your data, you keep refining your agent by analyzing results, adjusting prompts or logic, and testing again. This cycle repeats as you move from development to production and back again, so you can catch issues early and refine based on real world feedback. When you actually reach production, you may find that the simple agent designs you use initially need to scale up and complexity. You might end up adopting a multi-agent system where agents can call other specialized agents, or a multimodal system where your agent handles different types of data, like text or images or audio, or a continuous improving system where your agent learns from user interactions in real time, either through manual updates or automated processes. So what's different about production? Well, you can often discover new failure modes. Users might ask you questions that you never saw in development or they reference something that your system doesn't know about yet, like a brand new product. If your agent now calls out to additional APIs or other agents, there are more chances for errors or unexpected outputs. You might also try a B testing or different model strategies that introduce surprising regressions that you didn't anticipate in a controlled environment. The encouraging part here is that a lot of the tools you use in development, like instrumentation and feedback loops, are just as valuable in production. You'll continue to collect metrics that you define for evaluation, and you'll rely on continuous integration and continuous delivery flows and ongoing experiments to keep tabs on your agent's performance. In other words, you can use those same tracing and annotation methods you had in development, only now they're enriched by real user data. You'll gather feedback from actual interactions, label any problematic outputs, and identify where your agent might be struggling. Collecting user feedback for evals is crucial in production. You can take real usage data, for example, every user query or interaction, and attach human labeled annotations to highlight issues or successes. If your eval metrics disagree with what real users say, that's a sign you might need to recheck your system or your eval. Maybe you're measuring the wrong metric. Or maybe there's a deeper logic flaw and the agents flow. You also want to keep track of metrics over time to understand efficiency or execution dependencies. For example, if you're using convergence evals, you can see how many steps it takes for the agents to reach a correct answer. If that number grows after you tweak your prompts, it can indicate a regression since you're likely calling external LM APIs. You should track those calls, too. They can have a significant impact on latency and costs, which can impact your end user experience. Your choice of model. Perhaps a large reasoning model versus a smaller, faster 1st May vary based on the complexity of tasks in production. Keeping an eye on those decisions and their downstream effects is part of robust production monitoring. When you gather human feedback and run your evals on real traffic, you'll gain a clearer picture of how changes like swapping out a model or updating a prompt, impact the overall system. You can rerun the same experiments you used in development, but now on new data or new failure modes that you've discovered in production. This is why you'll want to maintain consistent data sets and keep augmenting them with production samples. One effective approach is curating golden data sets that capture your most critical use cases and known failure modes. Each time you push a change, like adjusting a prompt or logic, you can recheck these data sets to be sure you haven't broken something you've already solved. Experiments can act like gates for shipping changes, so you can decide whether to roll forward or roll back a change. Let's imagine you have a Self-improving agent that collects feedback automatically whenever a user interacts with your system. You can add those user examples, both successful and failed interactions, to a continuously updated data set. Then you run CI CD experiments on this data set, checking if new versions of the agent do better or worse on the very latest real world scenarios. When you refine the agent logic or tweak prompts, you can also incorporate few shot examples from the newly collected data. This lets your system learn from mistakes and gradually converge towards better performance on exactly the tasks that matter most in a self-improving and automated manner. What you're doing here is essentially applying evaluation driven development in production. You're watching for new failure modes. Monitoring how well your agent handles them and automating feeding production feedback back into your evals. In this lesson, you've learned what to watch out for new queries, complicated architectures, and unexpected user behavior, and how to keep your agent on track by continuously measuring and refining its performance. You've also seen how the tools and data you used in development carry seamlessly over to production, and how you might scale those approaches. For even more robust agent systems, by using CI CD pipelines with consistent experiments, you can ensure that every update to your agent maintains or improves on the quality you've already achieved. Now, you have a solid foundation for monitoring your agents in production and continuously refining their performance based on real world data.