In this lesson, you'll learn how to prompt Llama 3.2 for image reasoning. You'll work on different exciting use cases, from counting the Llamas in an image, to analyzing warning messages shown on a car dashboard. All right. Let's dive into the code. Llama 3.2 introduced multimodal 11 B and 90 B models and lightweight 1 B and 3 B models. With multimodal Llama 3.2, 11 B and 90 B models, developers can build visual reasoning and understanding apps that approach the capabilities of closed models, but come with the full flexibility of open models. Pre-trained models can be adapted for a variety of image reasoning tasks, and instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. In Llama 3.2, there is no updates of Llama 3.1 8B, 70 B, and 405B text models. They are built on top of Llama 3.1 text only models. All Llama 3.2, models use the same tokenizer as Llama 3.1. They both have the same 128k context window. Also in text capabilities, Llama 3.2 multimodal 11 B and 90 B models are the same as Llama 3.1 8B and 70 B respectively. Also, the supported language for Llama 3.2 are the same as Llama 3.1. But text only tasks, you have English, German, French, Italian, Portuguese, Hindi, Spanish and Thai. And for image plus text applications only English is supported. The prompt of Llama 3.2 vision instruct models is similar to that of Llama 3.1 text instruct models with the only additional image special token. If the input includes an image to reason about. Without adding the image token, you will treat 3.2 11 b and 90 B as text models. The Llama prompting format lesson has a detailed coverage of both the low-level prompting format and the messages list format. But because we have many interesting vision use cases to cover in the rest of the lesson, we'll focus on using the more user-friendly, high-level messages list when querying Llama 3.2 vision models. While learning the basics of Llama 3.2 multimodal prompting, you will work on multiple use cases. You will analyze this Llama image and count them. You will do plant and dog breed recognition and will also analyze the warning message about tire pressure. All right, let's get coding. Let's start by adding this line to ignore unnecessary warnings. You can then load the environment variables using the load_env function that is defined in the utils. Before using Llama 3.2, for image reasoning, let's first compare the two Llama 3.1 and Llama 3.2, in dealing with text-only prompts. For this two helper functions, Llama 3.2 and Llama 3.1 are defined in utils for easier prompting. With Llama 3.2 and Llama 3.1 respectively. These two helper functions will receive the prompt in either raw prompt format or high-level messages, and pass it to Llama model, and will return the response from the model. Feel free to check these two functions in utils file. In this lab, you will use high-level messages prompt format to prompt the Llama model. Here you have the prompt. Who wrote the book? Charlotte's Web? You can now pass these messages to your Llama32 function and get the response, and then print the response. The book Charlotte's Web was written by E.B. White. You can do the same using Llama31 model. As you see, both of these responses are very similar or in this case, exactly the same. This is because the text model in Llama 3.2 and Llama 3.1 are the same. Let's now reprompt the model. Based on the response we got. You can start using the same messages and the initial prompt you passed as the role user. Then add the response you got as the role assistant. And finally, your new question, which is three of the best quotes added as the role user. You can now pass the new messages to Llama 32 function, get the response and print it. And here are the three best quotes. If you do the same using Llama31 function that calls 3.1 model, you'll get a very similar response. Now, let's see some cool use cases that use multimodal Llama 3.2 to do image reasoning. Let's take a look at the image you will use in your first image prompting use case. For this, you can use the disp underscore image function that is defined in utils. You can pass the address of an image on the web or a local image path to display it. To prompt Llama and ask it to describe an image for you, will be the higher level messages format for your prompt. You have the role as user with the content that includes the type text with the prompt described the image in one sentence and the image URL in the type image URL. If you have an image as a URL on the web, for example, this image, which is the same previous image you saw. You can run the cell and create messages. And then pass messages to Llama32 function and print the result. And here's the response from the model that correctly describes it to be three llamas in this picture. If you're using a local image, you will use the same messages prompt. The only difference is that for the image URL, you will need to pass the base64 encoded format of the image. For example, if you have a Jpeg image, this is what you will have for the URL. Since all of the remaining use cases will be using local images, you can define this helper function that gets the image path, and returns the base64 encoded format of the image. You can now call this function and pass the same llama image you displayed in the previous cells. And this base64 underscore image will be used in image URL in messages. And the last step is to pass the messages to Llama32 function. And print the result. And here you see similar response, this time using a local image. You can now ask a follow-up question based on the previous response. You will have the same messages you had. And then will add the result you received as a response to the initial prompt, using role assistant. And finally, you will add your new question "how many of them are purple?" Using the role user. Now you can pass the messages to the Llama32 function and print the result. And as you can see, one of the Llamas is purple, which is correct. Since you are going to work with many image use cases in this lab, you will need to define the messages prompt multiple times. Where you will have an image URL, and a prompt that, for example, asks the model to describe this image. So it will be useful to wrap these messages around this Llama32PI or prompt image function. That gets the prompt, the image URL, and the optional model size parameter, forms the messages, passes it to the model, gets the result, and returns it. You can now use this function to prompt Llama to describe an image. By passing a URL from the web and getting a response. Or you can use a local image using the base64 encoded format of the image. And get the similar response. Now that you have all the helper functions you need, you can work on your first main use case in this lab. Say you have this image of a tree. And you have this question: "What kind of plant is this in my garden?" You can convert your image to base64 format. And pass the base64 image and your question to the llama32pi function you defined, get the result, and print it. And here's the result. "The plant in your garden is a pomegranate tree", which is correctly recognized by the model. Now let's see this use case where you prompt Llama to recognize a dog breed. You have this image of a Labrador retriever. Let's see if Llama can recognize and describe this dog breed. Here's the question you have about this image. You can now convert your image to base64 and pass the image and the question to llama32pi function. Get the result and print it. And here is the response. The dog breed depicted in this image is Labrador Retriever, and as you asked in your prompt, here are some characteristics in a few bullet items. You can try this on another image. You'll convert this into base64 and then pass this new image with the same question you had to llama32pi function and get the result, and then print it. As you see, Llama recognized this image to be Labrador Retriever again. Even though this was an image from a different angle. Another cool use case will be prompting Llama with an image of a warning message you have received on a car dashboard. Let's say this is a picture of a warning message you have received. The goal is to ask Llama this question: "What's the problem this is about?" "What should be good numbers?" To prompt Llama, you will first convert the image to base64. And then you will pass the base64 image and the question to llama32pi function. Get the result. And print it. And here's the response that indicates low tire pressure, with some additional information about the pressure on the front and back tires. All right. In this lesson, you learned how you do multimodal prompting with Llama and worked on four cool image reasoning use cases. If you found these exciting, I have good news for you. In the next lesson, you will work on eight more advanced use cases. You will analyze restaurant receipts with multiple images to calculate the total charge, and also will combine image reasoning with tool calling. See you in the next lesson.