In this module, I'd like to share with you practical tips for building agentic AI workflows. I hope that these tips will enable you to be much more effective than the typical developer at building these types of systems. I find that when developing an agentic AI system, it's difficult to know in advance where it will work and where it won't work so well, and thus where you should focus your effort. So very common advice is to try to build even a quick and dirty system to start, so you can then try it out and look at it to see where it may not yet be working as well as you wish, to then have much more focused efforts to develop it even further. In contrast, I find that it's sometimes less useful to sit around for too many weeks theorizing and hypothesizing how to build it. It's often better to just build a quick system in a safe, reasonable way that doesn't leak data, kind of do it in a responsible way, but just build something quickly so you can look at it and then use that initial prototype to prioritize and try further development. Let's start with an example of what might happen after you've built a prototype. I want to use as our first example the invoice processing workflow that you've seen previously, with the task to extract four required fields and then to save it to a database record. After having built such a system, one thing you might do is find a handful of invoices, maybe 10 or 20 invoices, and go through them and just take a look at their output and see what went well and if there were any mistakes. So let's say you look through 20 invoices, you find that invoice 1 is fine, the output looks correct. For invoice 2, maybe it confused the date of the invoice, that is when was the invoice issued, with the due date of the invoice, and in this task we want to extract the due date so we can issue payments on time. So then I might note down in a document or in a spreadsheet that for invoice 2, the dates were mixed up. Maybe invoice 3 was fine, invoice 4 was fine, and so on. But as I go through this example, I find that there are quite a lot of examples where I had mixed up the dates. So it is based on going through a number of examples like this, that in this case you might conclude that one common error mode is that it is struggling with the dates. In that case, one thing you might consider would be to of course figure out how to improve your system to make it extract due dates better, but also maybe write an eval to measure the accuracy with which it is extracting due dates. In comparison, if you had found that it was extracting the biller address incorrectly, who knows, maybe you have billers with unusual sounding names and so maybe it struggles with billers, or especially if you have international billers whose names may not even all be English letters, then you might instead focus on building an eval for the biller address. So one of the reasons why building a quick and dirty system and looking at the output is so helpful is it even helps you decide what do you want to put the most effort into evaluating. Now if you've decided that you want to modify your system to improve the accuracy with which it is extracting the due date of the invoice, then to track progress it might be a good idea to create an evaluation or an eval to measure the accuracy of date extraction. There are probably multiple ways one might go about this, but let me share with you how I might go about this. To create a test set or an evaluation set, I might find 10 to 20 invoices and manually write down what is the due date. So maybe one invoice has a due date of August 20th, 2025, and I write it down as a standard year, month, date format. And then to make it easy to evaluate in code later, I would probably write the prompt to the LLM to tell it to always format the due date in this year, month, date format. And with that, I can then write code to extract out the one date that the LLM has output, which is the due date, because that's the one day we care about. So this is a regular expression, pattern matching, you know, four numbers of the year, two for the month, two for the date, and extract that out. And then I can just write code to test if the extract date is equal to the actual date, that is the ground truth annotation I had written down. So with an eval set of, say, 20 or so invoices, I build and make changes to see if the percentage of time that it gets the extracted date correct is hopefully going up as I tweak my prompts or tweak other parts of my system. So just to summarize what we've seen so far, we build a system, then look at outputs to discover where it may be behaving in an unsatisfactory way, such as due dates are wrong. Then to drive improvements to this important output, put in place a small eval with, say, just 20 examples to help us track progress. And this lets me go back to two prompts, try different algorithms, and so on to see if I can move up this metric of due date accuracy. So this is what improving an Agentic AI workflow will often feel like. Look at the output, see what's wrong, then if you know how to fix it, just fix it. But if you need a longer process of improving it, then put in place an eval and use that to drive further development. One other thing to consider is if after working for a while, if you think those 20 examples you had initially aren't good enough, maybe they don't cover all the cases you want, or maybe 20 examples is just too few, then you can always add to the eval set over time to make sure it better reflects your personal judgments on whether or not the system's performance is sufficiently satisfactory. This is just one example. For the second example, let's look at building a marketing copy assistant for writing captions for Instagram, where to keep things succinct, let's say our marketing team tells us that they want captions that are at most 10 words long. So we would have an image of a product, say a pair of sunglasses that we want to market, and then have a user query, like please write a caption to sell these sunglasses, and then have a LLM, or large multimodal model, analyze the image and the query and generate a description of the sunglasses. And there are lots of different ways that a marketing copy assistance may go wrong, but let's say that you look at the output and you find that the copy or the text generated mostly sounds okay, but maybe it's just sometimes too long. So for the sunglasses input, generate 17 words, if you have a coffee machine, it's okay, stylish is okay, blue shirt, 14 words, blender, 11 words. So it looks like in this example, the LLM is having a hard time adhering to the length guideline. So again, there are lots of things that could have gone wrong with a marketing copy assistant. But if you find that it's struggling with the length of the output, they might build an eval to track this so that you can make improvements and make sure it's getting better at adhering to the length guideline. So to create an eval, to measure the text length, what you might do is create a set of test stars, so mark a pair of sunglasses, a coffee machine, and so on, and maybe create 10 to 20 examples. Then you would run each of them through your system and write code to measure the word count of the output. So this is Python code to measure the word count of a piece of text. Then lastly, you would compare the length of the generated text to the 10 word target limit. So if word count is equal to 10, now I'm correct, plus equals one. One difference between this and the previous invoice processing example is that there is no per example ground truth. The target is just 10, same for every single example. Whereas in contrast, for the invoice processing example, we had to generate a custom target label that is the correct due date of the invoice, and we're testing the outputs against that per example ground truth. I know I used a very simple workflow for generating these captions, but these types of evals can be applied to much more complex generation workflows as well. Let me touch on one final example in which we'll revisit the research agents we've been looking at. If you look at the output of the research agents on different input prompts, let's say that when you ask it to write an article on recent breakthroughs in black hole science, you find that it missed some high profile result and a loss of news coverage. So this is an unsatisfactory result. Or if you asked it to research renting versus buying a home in Seattle, well, it seems to do a good job. Or robotics for harvesting fruits. Well, it didn't mention a leading equipment company. So based on this evaluation, it looks like sometimes it misses a really important point that a human expert writer would have captured. So then I would create an eval to measure how often it captures the most important points. For example, you might come up with a number of example prompts on black holes, robotic harvesting, and so on. And for each one, come up with, let's say, three to five gold standard discussion points for each of these topics. Notice that here we do have a per example annotation because the gold standard talking points, that is the most important talking points, they are different for each of these examples. With these ground truth annotations, you might then use an LLMs judge to count how many of the gold standard talking points were mentioned. And so an example prompt might be to say, determine how many of the five gold standard talking points are present in the provided essay. You have the optional prompts, the essay text, gold standard points, and so on, and have it return a JSON object with two Gs that scores how many of the points, zero to five, to the score, as well as an explanation. And this allows you to get a score for each prompt in your evaluation set. In this example, I'm using LLM-as-a-judge to count how many of the talking points were mentioned because there's so many different ways to talk about these talking points, and so a regular expression or a code for simple pattern matching might not work that well, which is why you might use an LLM-as-a-judge and treat this as a slightly more subjective evaluation for whether or not, say, event horizons were adequately mentioned. So this is your third example of how you might build evals. In order to think about how to build evals for your application, the evals you build will often have to reflect whatever you see or you're worried about going wrong in your application. And it turns out that broadly, there are two axes of evaluation. On the top axis is the way you evaluate the output. In some cases, you evaluate it by writing code with objective evals, and sometimes you use an LLM-as-a-judge for more subjective evals. On the other axis is whether you have a per-example ground truth or not. So for checking invoice date extraction, we were writing code to evaluate if we got the actual date, and that had a per-example ground truth because each invoice has a different actual date. But in the example where we checked marketing copy length, every example had a length limit of 10, and so there was no per-example ground truth for that problem. In contrast, for counting gold standard talking points, there was a per-example ground truth because each article had different important talking points. But we used an LLM-as-a-judge to read the essay to see if those topics were adequately mentioned because there's so many different ways to mention the talking points. And the last of the four quadrants would be LLM-as-a-judge with no per-example ground truth. And one place where we saw that was if you are grading charts with a rubric. This is when we're looking at visualizing the coffee machine sales, and if you ask it to create a chart according to a rubric, such as whether it's clear access labels and so on, there is the same rubric for every chart, and that would be using an LLM-as-a-judge but without a per-example ground truth. So I find this two-by-two grid as maybe a useful way to think about the different types of evals you might construct for your application. And by the way, these are sometimes also called end-to-end evals because one end is the input end, which is the user query prompt, and the other end is the final output. And so all of these are evals for the entire end-to-end system's performance. So just to wrap up this video, I'd like to share a few final tips for designing end-to-end evals. First, quick and dirty evals is fine to get started. I feel like I see quite a lot of teams that are almost paralyzed because they think building evals is this massive multi-week effort, and so they take longer than would be ideal to get started. But I think just as you iterate on an agentic workflow and make it better over time, you should plan to iterate on your evals as well. So if you put in place 10, 15, 20 examples as your first cut at evals and write some code or try prompting an LLM-as-a-judge, just do something to start to get some metrics that can complement the human eye at looking at the output, and then there's a blend of the two that can drive your decision making. And as the evals become more sophisticated over time, you can then shift more and more of your trust to the metric-based evals rather than needing to read over hundreds of outputs every time you tweak a prompt somewhere. And as you go through this process, you'll likely find ways to keep on improving your evals as well. So if you had 20 examples to start, you may then run into places where your evals fail to capture your judgment about what system is better. So maybe you update the system and you look at it and you feel like this has got to work much better, but your eval fails to show the new system achieving a higher score. If that's the case, that's often an opportunity to go maybe collect a larger eval set or change the way you evaluate the output to make it correspond better to your judgment as to what system is actually working better. And so your evals will get better over time. And lastly, in terms of using evals to gain inspiration as to what to work on next, a lot of agentic workflows are being used to automate tasks that, say, humans can do. And so I find for such applications, I'll look for places where the performance is worse than that of an expert human, and that often gives me inspiration for where to focus my efforts or what are the types of examples that I maybe get my agentic workflow to work better than it is currently. So I hope that after you've built that quick and dirty system, you think about when it would make sense to start putting in some evals to track the potentially problematic aspects of the system, and that that will then help you drive improvements in the system. In addition to helping you drive improvements, it turns out that there's a method of evals that helps you hone in of your entire agentic system. What are the components most worth focusing your attention on? Because agentic systems often have many pieces. So which piece is going to be most productive for you to spend time working to improve? It turns out being able to do this well is a really important skill for driving efficient development of agentic workflows. In the next video, I'd like to deep dive into this topic. So let's go on to the next video.