Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
You will now learn how to evaluate whether your agent's GPA, goal, plan, and actions are all aligned. Let's get started. Let's quickly recap from the introduction how to measure your agent's GPA. Our goal plan action alignment. You'll recall that that'll make use of four evaluations that we will get into in this lesson. First, we have plan quality, which sits at the intersection of goal and plan, and measures how well the plan achieves the goal. Next we have plan adherence, which measures how well the actions that the agent executed aligns with the plan that it had created. Execution efficiency sits at the intersection between goal and action, and measures whether the execution trace is an efficient one, that is, that it's an optimal path for achieving the agent's goal. And finally, logical consistency sits at the intersection of all three of goal planning action looks for inconsistencies between plans, planning steps, or between planning steps and action steps within action steps as well. Now let's get into the notebook and see how we can augment the data agent and evaluation steps from lesson four with support to measure the agent's GPA. As a first step, we will load up the environment, In the next step we set the GPA invalid provider model to be the open AI. GPT 4.1 model. And one reason for that is that GPT 4.1 as support for long contexts, and the execution traces that will be fed as input in computing our GPA values do require that, because the traces can be quite long. Now that we have the environment set up as well as, we have picked the model provider, we will go over each of the four evals to compute an agent's GPA one by one. For each eval. I will also show you a simple example to illustrate the kinds of failure modes the agent can fall into, and how these evals can catch these failure modes. Later in the lecture, I will run the evaluations for the full data agent, and we will see how well it does with respect to these evaluations. Let me begin with the first. Evaluation, which is land quality. To get going we will set a specific user query, which in this case is which sales leads should we prioritize. This week and what specific action items should we take for each? As well as a plan that an agent could have come up with to achieve this goal or answer this query and these are the four steps in this plan. it involves pulling all the sales leads from the past 12 months from the CRM. That's step one. Step two is for the largest. Many leads compile any notes, call logs, and related tasks from the CRM. Step three is to summarize each leads current stage in the pipeline and finally to present the summary and recommendation in a single table. Now let's define the plan quality feedback function. Give it the name plan quality. We will set the valid model provider to make use of the GPD 4.1 model. And by specifying that we want the plan quality, the chain of reasons we are setting it up so that in addition to a score, it also gives a reason for why it came up with that score for plan quality. And finally, we will provide as input to this Land Quality Evaluator or L11. Judge the full trace, the full execution trace to examine so that it can come up with its evaluation. Now let's run the evaluation. The plan quality evaluation with the example goal and plan that I had set up earlier over here. The output of this will be a score between 0 and 1, where zero means a particularly poor plan. And one would be a perfect plan. And the reason is the rationale or chain of thought, reason for why the plan quality judge came up with that score. Let's run it now and see what it comes back with. So the judge came up with a score of 0.66. Two thirds. That's about 66% of the way towards an ideal plan. And you can see two things. One is the criteria that the judge used to grade the plan quality. And secondly, down here you can see the supporting evidence or the chain of thought reason for why it gave the plan with respect to the goal or query the specific score. The criteria for evaluation is that the plan should be generally well structured, that it addresses the query with clear steps. And includes justifications for most actions. However, it lacks explicit rationale for selecting the largest 20 leads and does not detail the method for prioritization or generating specific action items, so no replanning is present. So you can see that the explanation highlights where the gaps are in this plan. What it's doing well, as well as where the gaps are in this plan. And then it goes on to provide supporting evidence, which spells this out much more clearly in a way that's aligned with the steps of the plan. So, for example, it says that step one establishes the data set clearly that step two prioritizes large leads, but does not justify the number or metric where it had come up with 20, for example. Step three summarize this pipeline stages supporting prioritization. So that's been done well. Step four provides actionable output. However, the plan omits explicit prioritization criteria and recommendation methodology, and no replanning occurs. So in this way, the Plan Quality judge is showcasing areas where the plan has done well and where it exhibits weaknesses and opportunities for improvement. I will note that when we run the full data agent, the plan and the, execution trace, will be under the control of the agent not created by us. So we will not be able to directly edit the plan to address these kinds of weaknesses. However, we will see in the next lesson system, attic ways of addressing these kinds of weaknesses in planning through the use of updated prompts and in line evaluations. If I were to summarize some of the weaknesses in this plan, I would say that there are three areas where the plan could have done better. One is the selection criteria was, vague. It said all sales leads from the past 12 months, which lacks urgency constraints tied to the goal. Second, the prioritization approach was weak. The largest 20 ignores the lead score, stage urgency or upcoming deadlines. There were missing action, ability related challenges as well, where no instructions to create specific next actions or owners were laid out. And finally, the output was not specific a single table without required fields that were tied to the goal. These are the kinds of issues that we have seen and and now we want to construct a better plan that will close these gaps. And here is a proposal where we keep the user query the same. But now if you look at the various steps of the plan. We are trying to close the known weaknesses that were surfaced by the previous round of evaluation. For example, here in step one. We are talking about a next action date within 14 days as a criteria for pulling leads with open opportunities from the CRM, we are filtering to leads with the deal value. That's greater than ten K or has a high lead score, meaning that these deals are important either from a revenue standpoint being sufficiently high or the probability of conversion is high. We are going to be sorting by the deal stage urgency, for example, by close date approaching at risk of going called potential revenue impact, getting more specific to close that particular kind of gap. And finally, for each of the prioritized leads, we are now, in this plan specifying that we will retrieve the latest interaction notes, key decision maker info, and current blockers, identify overdue or missing action items, propose specific high impact next steps such as scheduling a product demo, sending a proposal revision, escalating to the sales manager. And finally, the group. These recommendations into this week's priority list with on our assignments and deadlines present results in a table with columns, such as lead name, value stage, urgency, score, next action, due date, and owner. Presenting the results in a much more organized and actionable manner. Now let's rerun the evaluation with this better plan. And see what we get in terms of the quality of this improved plan. Now the evaluation comes back with a score of one. And the explanation here is, essentially saying that the plan is well structured. It's optimal. It directly addresses the users query, and then all the steps are clear and logically ordered and so forth. I will leave it to you to read this in greater detail. But the key takeaway here is that number one, plan quality evaluation can be very helpful in identifying areas where a plan generated by an agent is good and has strengths and areas where it has gaps and could improve. And this information can then be leveraged to help the agent plan better and achieve better outcomes. Now that we have finished the section on plan quality, let's look at evaluations for plan adherence. So there on the second evaluation. Metric for our agents GPA plan adherence to remind you is going to check how well the actions executed by the agent adheres to or follows or conforms with its articulated plan. For this purpose illustrative purpose, I will consider, simple sequence of actions that an agent could have executed. And here are six steps where it pulled all open opportunities from the CRM, but without applying a next action date filter, it applied the deal value filter only, but skip the lead score filter it sorted leads solely by deep value. So in based on the revenue impact, notice that the good plan was asking it to also consider other factors such as urgency around the deal or potential to close, or for the deal to grow cold. Step four looks at retrieve latest notes and contact names that skipped blockers. Step five listed the CRM as existing next action field without review or update. And finally, step six output a table with lead name, value stage and next action. So you can see that I have set up this sequence of agent actions intentionally to have some gaps that you or I can look at eyeball and tell that it's not an ideal sequence of actions, given good plan that we just looked at in the previous section. So next I'm going to define the plan. And here in this function and then run it on. This sequence of agent actions along with the good plan from the previous section. The plan adherence function, it's structure very much follows the structure of plan quality. The GPA value provider is again set to run plan adherence with reducing chain of thought reasons so that we don't just provide a score, but also give a reason for why that score. And as input to this evaluation function or alum judge, we are providing the full trace. So the plan adherence function is looking at the full trace and checking how well the actions of the agent on the trace correspond to or adhere to the plans that are also on that trace. So we have the function definition in hand. Now let's run it with our running example. And as before, it's set up to output a score as well as provide a reason. And we will now run and examine the results of the evaluation. And surprise, surprise that comes back with a score of zero. This should not be particularly unexpected given that I had intentionally planted, bunch of adherence issues in the execution trace. We can check here. If those were caught. So you can see a summary of the issues in under this criteria. And then more detailed description of the issues in the supporting evidence section. Let's actually look at the supporting evidence section. First. Step one the judge got that the agent did not filter by the next action date. Or what's missing the next action date as required. It pulled all open opportunities instead. So this absence of filtering, which was there in step one here in the trace. Was detected. This was an example of failure of adherence with the plan which if I scroll up to the plan you can see that the plan was saying not that you pull all the open opportunities, but only the ones that have a next action need within the next 14 days, or no next action assigned. this filtering criteria was missed by the actions. And that's why in this step there is a violation. You can yourself go step by step over the other steps, and you will see other missing pieces in a number of these steps, which the judge was able to catch and highlight. The criteria section, if we return to it now summarizes the higher level issues. Multiple plan steps were omitted. Sometimes they are performed out of order, are replaced with unplanned actions. No meaningful attempt was made to explain, justify or document plan changes. Our new actions. The plan was largely ignored or disregarded in execution. Our steps were not completed as intended. If replanning was necessary, the revised plan was not followed. So this is, a pretty significant and consistent deviation of the agent's actions from its plans. It might as well not have come up with a plan, because it decided to ignore almost everything that was in it. And so it correctly gets a very low grade A score of zero. It's not going to do so well. On his GPA. Now let's look at a better set of agent actions that adhere more closely to the plan and for which we expect the grading from the plan. Adherence. Judge to give us a higher number. So you'll see that if you look through these steps, there are additional details that are added in which were not available in the previous sequence of agent actions. For example, in step one, when in the agent, all the all the leads with open opportunities, it also only focused on the ones that are have a next action date within 14 days are known next action assigned similarly is grouping of recommendations into this week's priority list. Now let's. Set up the. Line and. Along with. This sequence of actions. And run the a little, judge feedback function for plan adherence given. Better plan from before and the sequence of agent actions. And the judge comes back with the score. A perfect score of one. And if you go in and look at the details of the criteria and the supporting evidence, you will notice that all planned steps are well accounted for. There are no omissions or reordering. So a number of those failure modes, which were flagged by the judge for the previous sequence of actions, all are addressed with this sequence of actions. Remember we are illustrating the power of our plan quality and plan adherence. Judges the GPA values in general through this process of looking for alignment between goals and plans and actions. In practice, the you, the developer, will not be directly editing either the plans or the sequence of actions. You will be. However, updating the agents either through its problems or through other techniques that we will discuss in the next lesson to address these kinds of failure modes. Great. So now. Let's look at. Another sequence of actions. And we will now use that as an illustrative example to introduce the concept of execution efficiency. Execution efficiency. If you remember, is capturing how well the sequence of actions in the trace. How efficiently does it achieve the goal. And in particular, it will flag actions that are redundant or unnecessary steps in the execution trace that introduce inefficiencies and could have been avoided if the agent had planned better and acted in conformance with a better, more optimal plan. Let's go back to the notebook and look at an execution trace. To illustrate this evaluation metric. The execution trace is similar to the ones that we have seen before in structure. The only point of difference is that we are also summarizing results, such as from step one, the number of retrieve leads, or 96. Then in step two you applied the filter yielding 54 leads and so forth. Now let's define the. Execution efficiency feedback function. It has a very similar structure to plan quality and plan adherence. As before, we are asking the judge to output the chain of thought results in addition to the score. And it's given the full trace as input in order to grade it for execution efficiency. Let's go ahead and run this evaluation function on the action sequence that we just saw, and see how well it does. It gets a score of 66%. It's about two thirds of the way to an ideal score. And if you look at the, criteria and the supporting evidence, it's pointing out that there were some redundant retrievals. There was duplicate work where the agent reapplied the same filter, a couple of times. Step six export it to both, XLA and CSV format. The only one was needed. Other steps for logical but there were a couple of these redundant steps that the judge was able to figure out. And identify. So this is the kind of information you will find. And let them judge for execution, efficiency, highlighting on traces which don't represent an efficient execution path. Finally, the fourth. Evaluation, will consist of logical consistency. And logical consistency. Looks for contradictions during an agent's execution. It could surface logical flaws in its reasoning. It could surface inconsistent planning steps where an agent might make an assertion at one point and then contradict itself later on without justifying the reason for the change. Or it could even highlight ungrounded assumptions where the agent might be making an assumption based on its parametric knowledge, but not one that is traceable back to, retrieved pieces of information. let's look at a sequence of actions, another sequence of agent actions, and see how it does when we run. An evaluation for a logical consistency on it. Let's now define the logical consistency feedback function. The structure of this function is exactly the same as the one for the previous functions that we are requiring at the chain of thought reasons, and the only input to it is the full execution trace. Now that we have defined this function, let's run it On the previously defined sequence of actions. And it comes back with a score of 33%. So it has identified some inconsistencies in the execution trace. For example, between step one and step two their contradictory lead counts. One step says the number of leads is 96. In other steps, says that it's hundred and 13 without an explanation. In step three, there's a ranking logic for minimal engagement that's used that's not fully justified. Step four has a DVD decision maker with active next steps which are not explained. Five and six are fine, but you can see a number of these inconsistent and C's that the. Judge for logical consistency has flagged. These are all opportunities, for improvement for the agent. But as it currently stands, these forms of inconsistencies can impact the accuracy and correctness of the final response from the agent. With that, we are wrapping up the section introducing the four more evaluations that are helpful for grading an agent's GPA. Now that we have introduced the concept of GPA vowels, the four L11 judges for plan quality plan adherence, execution efficiency and logical consistency using some simple running examples, let's go back to running the same set of evaluation words over the data agent that we created in less than four. Let's set up a true and session for logging the execution traces and evaluation results into this database. This is a step that we have seen before. Next, let's build the graph. In the agent, this is again identical to what you have seen in the previous lesson. Next we will register the agent with true length. This step is also very similar to what you did in less than four. The only additions are that in addition to the rag triad of goal completion evals, we have now added these four evaluations to measure the agent's GPA. And now we are ready to. Run and record. The agent's execution and the associated evaluations on a set of queries. Let's begin with the first query. Where we are asking, what are our top three client deals? Chart the deal value for each. This should be a relatively simple query for the agent that should primarily make use of the text to SQL capabilities. So the data agent has come back with the response. Let me quickly run through a couple more examples of queries. And then we will spend time on the tool and dashboard to understand the tracing and evaluation pieces for these queries. The second query is on identifying or pending deals research. If they may be experiencing regulatory changes. I'm using the meeting notes for each customer to provide a new value proposition for each given the regulatory changes. So this query, if you recall, is similar to what we saw right in the introduction and in the previous. Lesson And this is the third query. And we'll now run the agent and the evaluations on this query as well. Now we have the results from all the queries, and we can look at them on the true lens dashboard. Let's run this short code snippet to set up the True Lengths dashboard. And then this link will now take us to the true lens dashboard. Let's now examine the results of our evaluation using the True Lens leaderboard. Over here you can see the result of the evaluations for the data agent that we ran through in this lesson. The goal completion events related to the rag dried I have some room here for improvement. Context relevance and groundedness for example, are not that high. And if we look at the GPA vowels. Land quality is generally good. So it's logical consistency. But plan adherence and execution efficiency show room for improvement on this screen. The numbers that you're seeing for these metrics represents the average across the three queries that we ran from our notebook. We can click on the Examine Records tab to look at the record level view, where we can examine in greater detail for each query what evaluations we got. So you can see here each row here in this table represents a query which is the input. The output response, as well as the results of the evaluation on all of these metrics. Let's pick one example query for which the GPA levels are doing quite well. So if I scroll to that section it looks like for query this second query the GPA values are doing well. Let's select this. And now down here we can examine this in greater detail. This was the question where we asked the data agent to identify our pending deals. Research if they may be experiencing regulatory changes, and use meeting notes for each customer to provide a new value proposition for each. Given the regulatory changes and down here you can see the output of this process where the final output looks identifies these five, deals. And then for each it identifies for the first one, there's no specific regulatory changes. For the second one, it's affected by shifts and trade policies and so on. We can scroll down here to look at the feedback results. And in particular we can see that the GPA values are quite strong for this particular query. Plan adherence, plan quality, logical consistency and execution efficiency are all, perfectly graded. Context, relevance and groundedness, on the other hand, are not that high, indicating that there is still room for improvement in the retrieval steps that this agent went through. Let's look at another row where the GPA values are not as high. The first row falls into this category. While it does well on plan quality and logical consistency, the agent does not adhere well to the plan and the execution efficiency is also, quite poor. So I can select that row and look at it in some more detail. So here you can see, the query this was about identifying the largest land deal and some additional qualifiers around it. And this is what the output looks like. Let's look at the plan adherence evaluation results. You can see that the score here is zero. And the explanation provides a detailed summary of why the score is so low. It mentions that multiple planned stance steps were omitted are not completed as intended. No meaningful attempt was made to explain or justify planned changes beyond stating a capability limitation, and that the plan largely was disregarded after step one. Then it goes on to show more specific supporting evidence. If you want to manually check the work of the judge you can click on the planner node and look at the details of the plan that it had come up with. And here you can see the various steps are laid out. Step number one, step number two, and so forth. And then you can look at the execution trace and check how closely the execution actions followed. This plan. So this particular trace suggests that there is room for improvement for how this agent is adhering to its plan. There's also room for improvement in the completion properties and the completion goals related to on the relevance. In particular, as well as groundedness. With that, let me wrap up lesson five, and in the next lesson we will see systematic techniques to improve the agent's GPA that will address these shortcomings that we are observing and its execution.