In this lesson, you will deep dive into Monte Carlo Tree Search. A powerful algorithm for finding optimal solutions in complex decision spaces. We will then explore MCTS approach in AgentQ framework to create autonomous browser agents capable of completing web tasks. All right. Let's code. In this lab, we're going to go over how can we can implement MCTS over the grid world. And in this example, you can see we have our starting state, which is green at cell 7,6. And an end state in the red color. Which is at 2,4. This marks are starting and end state and ones with w are walls or obstacles. And in this example we want to traverse the grid to reach from start state to end state Where you can traverse through the white nodes. The agent can only travel north, south, east and west. And now we'll go over the different examples of exploring how weights can affect the speed of convergence to the optimal path. The final Q-value and the exploration pattern across the grid. Once we implement, MCTS, in the grid world example, we will take this to our AgentQ example, Where we run MCTS with our browser agent. Let's now code and have fun. Let's start with the lab. First, we start by importing all necessary packages. Here we have already created one grid world example, where we have a MCTSGridWrapper and the DFS algorithm to add labels to our grid. Let's start with our MCTS example for Gridworld. So we first start by creating our initial grid state. And here the zero represents our possible path that we can take to traverse the grid once the obstacles. Two is a starting position and three is end state, the final state that we want to reach. Here we first start by constructing a grid, adding our colors, defining our boundaries. And we plot our graph. Let's see our initial map. So in the map, have our starting position which is green at cell 6,7 and the end which is red at cell 2,4 and ours obstacles which are marked by W classifying them as walls. the goal is to reach from the starting position to the end state by taking the shortest path possible, and then we will use MCTS to find is the most optimal and the shortest route that we can get. Now we are going to run MCTS on our grid example. One thing we do is we compare how different weights can affect our MCTS results and exploration. We have low wight 1.0 and high rate of 3.0 and we have 5 explorations. And overall we run around thousand iterations for MCTS. This MCTSGridWrapper is in your utils files where you can see the implementation. Here, this allows us to trigger execution over grid, to see and collect optimal routes. Now, once our MCTS has been triggered. Here you can use compare exploration weights that's provided in mcts_animation. That's part of how you do the follow. This allows you to visualize our MCTS results. Here, we can see in the heatmap of in the heatmap of our n states for different explorations and you can see different results, where this particular weight for 1.8 we see it starts, as an initial state, and reaches the end state where you can see how much more darker colors, whereas you see a much more brighter yellow. This signifies, is falling apart, but this part is less optimal. So the q value is lower. So you find darker green. Whereas here and state the q value is higher, signifying that this is the best state to reach and as you can see in this animation, this is cell signifies the lowest Q-value that means the agent. We are always avoiding this particular state. And similarly for this roll out we avoid going through this path which are really not optimal for the agent to follow. And similarly for the algorithm to avoid these weights, this signifies that these are very less optimal path, but with the lowest q value. Now, for more exciting results let's see something cool. Here you can see, what's the best path our algorithm is following and for the first example, it's able to reach the most optimal state, at the end, and it converges closer to 0.8 in value. Initially, the model is exploring a much larger path, but then it's able to converge into a shortest path possible. Let's make it much more complex. You can add much more obstacles. You can make it even bigger. So you can visualize how well entity performs with different weights and different rollout, and how quickly it converges into the shortest path However, please leave a path for our agent to follow from start to the end. Now, once we finish with our MCTS, Let's see MCTS in a browser example, with our AgentQ. So, in the AgentQ, we represent a world model, in a browser state. The states are represented by the DOM, URL and the objective. The action executions are clicking, typing, navigation, and it's able to do state transitions. In the browser MCTS, that process follows, first initialization of the world model such configurations and then we run our MCTS algorithm to generate q-reward values for DPO, which is Direct Preference Optimization. Now let's see this in our lab. AgentQ clone is available for you, to use, to visualize a tree, we can use this function plotTree and the dfs_browser_nodes to visualize our proper tree representation for entity as tree. So these are different states on what URLs messages, actions and critic_responses that were captured, during the MCTS process this are all nodes and these are all Q-values. Then this represents our objectives, rewards mentioning which states are terminal and the completed tasks. And we run DFS to fully end and collect all this label data. Now, let's visualize our tree. So now we can see a pretty complex tree representation for high density of trees. Now let's zoom out to properly visualize the whole tree and what paths it took. As you can see it started at how it could accomplish the initial state and it assigned it a Q-value then, it went to the first node then, it realizes it's not the most optimal path, it abandoned it and the second node, same. The third node is, a more doable path that is selected. And you can see because of the Q-value and it's much higher than the rest of the Q-values, And from what we can see, node 4 has the highest Q-value and is also the end state. So this is where it stopped, after evaluating all the possible rollouts. On starting this node it further created 3 more paths, that it can follow And started with node 4, node 5 and node 6 Then it further explores all the paths following those nodes. And then, it back propagate back to node 4 and assigns it the highest Q-value and determines is that terminal state, where it reaches the goal and also the shortest route possible to achieve our goal. From this possible path, it continued to node 4 and it didn't explore it further because this is a terminal state, but it's already on the course page which is Building and Evaluating Advanced RAG, That was one of the possible objectives. Then, in the other nodes, you can see, it instead navigated to community.DeepLearning.AI And further path, it navigated to short courses in the community and even more exciting, this page is about RAG But it's not a course, it's more of a community discussion on the course. And you can see once it sees that it's more of a discussion, it abandons this path. And same with our other possible path, which is again on the community page. Further, it explores down, finds the course or discussion on RAG, in this case building multimodal RAGs search. And same with trying to find knowledge graph for RAG. And again here, it is decided it needs to navigate back and back propagate to its initial state because it is the wrong page. And same you can see here once it's continue down this line it says this is a multimodal search but not the proper state that it needs to be, same. It sees it is a discussion. And here it even sees an error that allows it to think it can continue further and it is a terminal state. Following this route, the agent sees that node 4 is the shortest route and the best possible state to see the course listing for our RAG course. In this lesson, you learned about how to implement MCTS, And also see how MCTS can behave in a browser environment and find the shortest and most optimal path possible, and see how it can actually improve our results compared to previous results.