In this lesson, we will explore Agent Q, a framework designed to teach AI agents how to self-correct for various techniques used by AI researchers. We'll dive into the key methods for improving agent performance, We also provide a comprehensive overview of the Agent Q framework and examine how it effectively addresses common AI agent challenges. Let's dive in. Remember, from lesson one, the existing frameworks have various limitations. First, reliability and trust. Second, decision-making errors. Third, plan divergence and looping. There are four steps for agent execution: plan, reasoning, where significant issues for agent execution come from the reasoning part. And this is what we are going to address in this lesson. Now, let's learn about Agent Q. Which is a state-of-the-art AI algorithm that addresses these issues and improve reasoning in web agents. It combines the following methodologies to teach agents to self correct. The Monte Carlo Tree Search that we're going to explore search spaces. Self-critique mechanisms for continuous improvement. Web agents receive real-time feedback. And that Direct Preference Optimization (DPO), which is a reinforcement learning algorithm that improves based on the current experience. You will learn how these methods work together to create a powerful framework for autonomous agent reasoning called Agent Q. Our research team has published this findings in a white paper online, that can be found on Arxiv. In the next lab you'll explore some of these concepts. Now, as step one, in understanding Agent Q, we will learn about how Monte Carlo Tree Search or MCTS works. MCTS is search methodology for backtracking and exploring. And it allows decisions may be planned several steps ahead. In MCTS, we start at the root node which is our current situation. From here, the algorithm follows what it thinks is the best path based on what it knows so far using an exploitation strategy. The algorithm continuous down the tree, selecting what seems to be the most promising, until it reaches an end node where no further exploration is needed. At this point, the algorithm expands a tree, by creating a new possibility and explores this unchartered territory. Once we reach the end node, which is as far as we can go, that algorithm predicts an expected Future Reward of continuing on this path as a Monte Carlo estimate. Think of this as playing out a quick point scenario to see how good this choice might be. After the future reward estimate, MCTS review of all the nodes as we see on the way down. This call back propagation essentially telling the algorithm: Hey, this path led to these results. so update your knowledge accordingly The magic of MTCS is that this entire process repeats many times. But this situation the agent gets smarter, focus more on the promising parts and less on dead ends. Just like how you might get better at chess, by thinking 2 moves and their consequences. In understand step two of Agent Q self-critic mechanisms and for supervision. This is an approach that uses an LLM critic to give feedback and improve reasoning capacities of the actors. When an user asks something like booking at a restauran reservation, the AI receives a request along with any context or history from previous interactions. Like part of the AI because it's every possible actions it can take. It might set a second date and time first. Or selecting a specific restaurant. Go to the OpenTable homepage, to start from there. This is where the critique comes in. Think of it as the actor's thoughtful advisor. The critic analyzes, what the current run is showing, what the user actually asked for, what information is most relevant right now. Based on this analysis, the critic can determine which actions make the most sense. It provides detailed feedback on each option in our restaurant booking example, instead of randomly trying actions. The critic might suggest a better sequence. First, go to the OpenTable homepage, then search for the restaurant name and finally search the date and time. This ranking of actions helps the AI make smarter choices, similar to how you might mentally consider different approaches before deciding on the best way to solve the problem. For step three, for Agent Q, we understand Reinforcement Learning from Human Feedback or RLHF. Which is a method that allows AI assistance to learn to make better decisions by incorporating human judgments about what works best. It's like training a helpful assistant through ongoing feedback and guidance. Now, we understand how all this comes together in a real-world example of booking at a reservation at OpenTable. When the agent runs through the Monte Carlo Tree research and a self-critic process, it creates what we call 'preference data'. These are comparisons between different outcomes from the same user query and tell us which outcome is better. In this case, the outcome one is preferred over outcome two. This preference data is crucial for improving the system and modeling what we call a reward model. The reward model labels different actions, with rewards, giving a score of plus one, the good behavior and minus one to the less useful ones. This rewards help fine-tune the AI agent to make better decisions. The system then generates sample outputs and fix them back to the reward model for evaluation This creates a continuous learning loop the AI tries something, gets feedback, learns from it, and improves. The cycle is the essence of reinforcement learning. Just like how we humans learn from experience and feedback over time. If you want to learn more about RLHF, the Deep Learning course is available in the link below. Now we understand step three of Agent Q Direct Preference Optimization or DPO, a faster approach to reinforcement learning from human feedback. DPO streamlines how AI learns from preferences by directly updating the model without creating a separate reward model. It refines the AI's decision-making through direct feedback loops. Essentially creating a shortcut in the learning process. Instead of building an immediate reward model, DPO takes preference data and directly fine-tunes the AI model itself. This powerful approach complements our earlier techniques, Monte Carlo Tree Search, explores possibilities. Self-critique evaluates options. You can read more about DPO in the research paper reference at the bottom of the slide. This approach significantly speeds up learning while maintaining quality and improving response quality. To repeat, the Agent Q framework comprises of three methods. First, Monte Carlo Tree Search, a method for structured exploration in search. Second, self-critique mechanism and process supervision, incorporating feedback for better decision making. That's part of the search process. These two methods combine to search, identify failures and find out better options. Direct Preference Optimization is a reinforcement learning algorithm that optimizes based on what the agent has seen before. These three methodologies address the main challenges of web agents identified before. Here, let's look at an example on booking a reservation on OpenTable. First, the agent incorrectly navigates to the wrong restaurant. Then, Agent Q identifies the failure and navigates back to the homepage. It then corrects to go to the right restaurant, then it accidentally selects the incorrect date in which case it can be corrected by opening the date selector. And then, going to the seat selection for the correct date. And finally completing the reservation. We tested Agent Q in real world scenarios in booking on OpenTable. Every benchmark Agent Q against other methods We find that GPT-4o is only able to reach a success rate of 62.6% in terms of successful OpenTable reservations. Agent Q, without including any AI feedback, which has an accuracy of 75.2%. Similarly, Agent Q without including MCTS reaches an accuracy of 81.7%. Our final agent algorithm including all the three methodologies here discussed so far, is able to reach an amazing score of 95.4% in these scenarios. To recap, we learned Agent Q and its main three elements. MTCS, Self-critique, and DPO. In the next lesson, you're going to explore MTCS and Agent Q in action. Alright! See you there.