You have learned so much through this course. I bet you're probably excited to get all these systems into production. But now I want to focus on all the things that goes into making sure that these systems run in a reliable way. So this video is going to be packed with tips and tricks and actually nailing down that reliability to make sure that your crew is running in the best possible quality, from testing to training to code execution to reasoning agents and so much more. And here's where we started talking about reliability and repeatability. And this is everything in production. So let's talk about tactics for getting your agents into production. So by now, if you have followed the lessons in the Jupyter Notebooks, you should be super comfortable building agents. And you can create them in Jupyter Notebooks, through a CLI in your computer, or even using the CrewAI platform with Crew Studio. But the question here is whether you can run them reliably. How consistently they produce good results of 100 runs. And more important, in the end, it's all about trust. Can you trust these agents? Can you confidently build and deploy these agents into production, knowing that they're going to work? And if you don't, well, then you have a problem. But we want to make sure that we fix that. So how you build agents that you trust? If you really want to get into a position where you can trust it, and where you can stand by its results, that's where a lot of the things that we talked about early on, come together at one. The ability for you to bring some deterministic controls to the agentic systems is what is going to allow you to give them reliable, repeatable outcomes. And therefore, agents that you can trust as well. Some of these controls you're already familiar with, for example, flows that we just talked about a few lessons ago, that you can use to add an entire backbone and structure for your agents. Or guardrails that we also talked about a few lessons ago, so you can check specific outputs. And those two, we already have talked about very deeply. So there's still a few others that we can use that we didn't touch, and that might be worth flagging. For example, reasoning agents, where you force your agents to do a reasoning step, where they think about the plan that they will follow once they actually start doing the tasks. Or human-in-the-loop oversight, where you can ask your agents to check back with you before they call the work done, so you can provide extra feedback to them in case they need to do anything more. You also have testing, where you can compare many different LLMs to decide what is the best LLM for you, in terms of quality and speed and everything in between. And also training, where you want to make sure that you're trying to enforce a specific format into your agent's behavior, so you train them to behave in a certain way by forcing memories into them. And we're going to talk about that in a second. I also want to flag structured output as a way to enforce a specific outcome, and safe code execution. Now, again, there's a lot of things in here that we haven't talked about yet, and I want to make sure that we spend some time with them. So let's kick things off with reasoning agents. How do you add reasoning for your agents? Basically requiring them to do some pre-planning. All you got to do is turn one single flag into your agent, and what the agent is going to do now is, before it takes an action, before it tries to complete any task, it will actually think through the steps that it should take to get that work done. So by doing that, it will draft and self-check a plan, and you can limit the number of iterations in there, all the way to getting something satisfactory, so that it then proceeds into actually performing the work, and following the plan that initially draft. So enabling this, as you can see in here, is extremely easy. All you got to do is turning one flag, called reasoning, to true, and then it will automatically turn it on the reasoning step for your agent, but then you also can set the max reasoning attempts, that define how many times your agents can try to create this plan before it actually kicks things off. So this is a very easy way for you to guarantee that your agents are doing some extra thinking before it actually gets the work done, and can lead to better outputs depending on their use case. Now let's talk about human loop. We already talked about guardrails in the previous module, but I want to talk about the ultimate guardrail. The ability to have a human, you, to basically verify and check and validate if that output is actually good or bad, and give feedback to the agent in the process of doing so. So the way that this works is like, as soon as an agent finishes a task, before it actually calls the work done, it will ask you for an input, whether it's doing good or bad, and whether you like the work that it has done. And you can provide this extra feedback, and based on this feedback, it will then go back into doing the work again. So an example of what this will look like, is that at the end of the execution, you're going to have a prompt like this in your terminal, asking you to provide feedback, whether you can type something out, asking you to do more or less, or you can just not type anything, and the agent will assume that everything is good to go. But in this example, we can say, well, maybe it's returning a list of 10 things, but you actually want 15 instead. You can literally type it out, let's do 15 items instead, and that will send that information back to the agent, that is going to redo its work to make sure that it's fulfilling your feedback, and asking you again to sanity check it before it's done. So you can see how human the loop can be extremely helpful, especially if you want to guarantee a certain oversight of this agent. And when you deploy these agents, especially if you're using CrewAI, this human the loop logic actually becomes an API, and this API allows you to integrate this with external systems that you might have. So you won't need to have people plugged into terminals, they can actually be adding this feedback to Slack, or Teams, or any other communication patterns that you might be using. Now, we also want to talk about how you train your crew. And training here means that through an interactive process of giving them feedback, they're creating memories of the things they're doing right, and the things they're doing wrong, they're going to be reused later throughout all the executions to make sure that they follow that standard. This is extremely interesting. So in training, just to recap what you saw in the first module of this course, these agents will run as normal, but they will stop after its initial output to ask you for feedback. And when they do, they will try to use that feedback to improve its answer, and then generating learnings based off that, that will go in its memory. So on each iteration, the system records for every agent in the crew, what was the initial output of the agent, what was the actual human feedback that you gave it, and what was the improved output, and what are the things that it did along the way. So you can see that in the conclusion of the training, and how this feedback is consolidated per agent, it now becomes a source of interesting suggestions that are clear, actionable instructions distilled from your feedback, and the difference between the initial and improved outputs. Together with a quality score for every single memory generated. So then during a normal execution, each agent will load its consolidated suggestions as part of its context to make sure that it's taking that into account. So you're going to have a physical file in your file system now storing the suggestions, the quality, and the final summary for you. So how you got to do to kick off training is basically either running CrewAI train on your terminal by using the CrewAI CLI, or you can also use the train function in Python that will automatically kick off that training process for you. And when you do that, you're going to see that execution happening and the feedback prompting coming up. And at the end, you're going to realize how much of a better agent you have by just letting those a few runs based off your feedback. Now, I also want to make sure that you have the right tools to help you choose the right LLM for each agent. So in addition to training, you may also want to test your crews and flows to see how they're performing to get some metrics around if they're doing well or not, and even compare a few different LLMs. So in testing your AI agent will run the same task, but many times for a number of the execution. And you can also specify a specific LLM that you might want to use as a judge. That LLM can be the same one that you're using for the actual task, or it might be a different LLM that you want to actually judge this request. In the end of the day, what that will create is a series of plots around quality for each one of those tasks. And you can trigger this off by either using the CLI, where all you got to do is write crewai create a task on your terminal, or you can also use Python for that using the task function. And what is going to happen there behind the scenes is that the crew and the agents will perform all their tasks, then the output of those tasks and the expected output of those tasks will go to a judge LLM, and the judge LLM will score those tasks on a 0 to 10 value, telling you how good that is and how well the actual task output maps to the expected output that you define on the task itself. So here you can see that it's broken down across two different tasks and what was the actual score based off the result for each task for three runs with an average at the end that includes not only how good it was, but also how long it took to run those tasks. And this can be extremely powerful, especially because as you change your agents, as you change your tasks, and as you change the LLMs that you're using, you want to have a way to compare a little bit of apples to apples to understand if you're moving the needle and drive things forward as you do these changes. So in this case, you can see that the crew run three times and you can see that each execution time varied between 40 to 59 seconds and you also get an average output from there. So now you can judge yourself whether that score is good enough for your use case or if you need a different strategy to improve. Then you can choose among the many strategies that you have learned so far in this course, including training and reasoning agents and improving your context or even choosing different processes to see how that impacts the actual output of your crew. Now we have covered a lot of ground in terms of how you can get a better output, but there's one thing that we didn't talk about and that is safe code execution. And the reason why I want to talk about this now is because code execution can be so powerful for agents. It's the ultimate tool because it allows them to do anything. They can write code to do anything that they want. They can write code, for example, to generate charts. They can write code to generate markdown files. There's a lot of different things that it can do once these agents go into coding mode and agents are especially good and LLMs are especially good at coding. So there's a few ways that you can enable safe coding execution for your agent and there is many tools out there that also allow your agents to do coding if you want to. The one thing that you got to be mindful though is that once that these agents are getting into code, the sky's the limit. There's a lot that they can do. So you want to make sure that you're doing things in a way that feels safe and you're only going into coding if you actually need to. Most of these agents can actually use custom tools that you have built before and we have learned about to get a lot of this work done so they might not necessarily need to tap into code. But if you do want to, we have a code interpreter tool that is automatically embedded into your agents. So CrewAI offers you a few options to set that up, including the safe execution mode. And you can enable and disable this code execution so you don't need to worry about your agents automatically writing code unless you enable them. And if you're doing a safe way, other code is going to run into a container using a docker image. So you have a bunch of options in here as well. If you look at the code, you can see that enabling the code execution is extremely easy. All you got to do to enable these for agents is basically set that allow code execution attribute to true and now your agents are able to code if they are required to do so. And then you can also set the code execution mode where you can set that up to safe to make sure that it's running that code in a containerized manner. So it's all happening in a way that it's isolated from your computer and therefore more safe. Now, reliable agents are just accurate. They need to be predictable, measurable and recoverable. And that was a lot of information that we just covered. So I'm glad that you stuck with me. Between guardrails, LLMs, code-based reasoning, human-in-the-loop, pre and post hooks and flows, you have many tools to control your agents now to ensure desirable outcomes. And reliable agents, again, they don't need to just be accurate. They do need to be recoverable and measurable as well. So as you develop agents, focus on using these features to consistently achieve good outputs even in parameters change or unexpected inputs might occur. Now, after you have built your agents and you got into a place where you're confident about the way that they are working and they're working a smooth way and you're ready to move into production, the first thing that you start to think about is how you actually track and monitor your crew. How to make sure that its performance stay high and latency stay low. How do you make sure that you're tracking hallucination and therefore, by the combination of all these things, make sure that the users of this are actually having a good experience. In the next video, we're going to talk all about monitoring and observability for agentic systems. So you need to follow me there because we're going to talk so much about these things and that can make a huge difference once you actually bring these things in production. I will see you right there in a second.