In this lesson, you will take what you learned In this lesson, you will take what you learned in the previous lessons to more exciting and complex use cases. Let's go. You will work on eight different use cases in this lab. You will analyze restaurant receipts with multiple images and will calculate the total charge. You will prompt Llama and ask questions about an interior design. And we will extract nutrition facts of drink images in json format. You will turn a model diagram into code. We'll convert plot to an HTML table. We'll analyze the fridge and ask for food recipe and we'll grade written math homework. You will also combine image understanding with tool calling. Let's see this cool use case in a little bit more detail. Let's say you have this image of Golden Gate and want to know what the current weather is at this place. This is not what Llama will directly know the answer for, but you can pass this image to Llama and ask about this place and its location. You can then read from Llama to ask about weather and if you have enabled tool calling, Llama will return the search call to get the information needed to get the weather. Going from an image to a search tool call seems very exciting. Let's code all of this. As you did in the previous lab. You can start by adding these two lines to ignore unnecessary warnings. Then you will load the environment variables and the API keys. You will need several helper functions that you have already defined. You will need Llama32 that you used in previous lab. And since you are going to work with local images, you need to convert them to base64 using encode image function, you defined. And finally llama32pi function that gets the prompt, the image URL, forms the messages, calls the Llama32 function, and then returns the result. Now that you have all the helper functions ready, you can start your first use case in this lab. You have three receipts of restaurant orders. Using the same display image function you used in previous lab. You can display these images that are available and receipt one, two, three Jpeg files. And here are the three receipts. The goal is to pass all these images to Llama and ask "What is the total charge in the receipt?" You can use the same previous loop. Convert each image to base64. And pass base64 image and the question to Llama three two API function. Get the result and print it. Here are the results for the three receipts. You can now pass these results in a prompt and ask Llama to calculate the total cost. For this, you can have this empty results string. And then instead of printing the result here, you can add each result to the initial string. And print the final result string. You can now form a new messages with the role user, with the content, "What's the total charge of all the receipts below?" And pass the results you got here as part of this prompt. You can now pass the messages to Llama32 function. Get the response and print it. Now based on the initial results, the calculation is done and the total charge of all the receipts is displayed here. In this example, you pass each receipt image to Llama, got the total charge in each receipt, and then pass all the messages together to Llama. Now let's see another way of doing this. The goal here is to merge all the three images into a single image, and pass that image to Llama and get the response directly based on that image. To merge these three images, a merge images function is defined in utils that gets three images and merges them. Let's also load pyplot from matplotlib. Then call the merge images by passing the three images to it. And finally displaying the merged image. The dimensions of the merged image are 5760 by 2560. For Llama 3.2, if an image has a dimension larger than 1120 pixels, you should resize the larger dimension to fit into 1120 pixels and maintain the aspect ratio. For this, a resize function is defined in utils. You can call the function and pass the merged image and get resized image. Now your new resized image is 1120 pixels by 497. This image is now ready to be passed to Llama. You will first convert your resized image to base64 and with the question "What is the total charge of all the receipts below?" You will call llama32pi function. Pass the question. The base64 image, will get the result, and we'll print it. Here is the breakdown of all the receipts one by one. And then finally the total charge of all the receipts calculated. Note that because of the resizing, the overall quality of images might have been reduced and the numbers might be slightly different than the ones you could get by analyzing each image individually. Another use case you will be working on is choosing the right drink. Here is the image you're going to use in this use case. You have an image of two drinks with different nutrition facts. Here's the question: "I'm on a diet. Which drink should I drink?" You convert the image to base64. And then we'll pass the base64 image and the question to llama32pi function. Get the result. And print it. And here's the response with detailed information about nutrition facts of each drink. And finally, based on the nutrition information, drink one appears to be a better choice for your diet. As drink one has zero calorie compared to drink two. With 130 calories. In some use cases, you may want to get the response in a structured format like json. Let's generate a json data representation of the nutrition facts of the two drinks. The question in this case will be generate nutrition facts of the two drinks in json format for easy comparison. You'll pass this question and the base64 image will get the result and will print it. And here are the nutrition facts of drink one and drink two in json format. Another multimodal use case you will work on is understanding architecture diagram with code implementation. Here's the diagram you will use in this use case. The question is: "I see this diagram in the Llama three paper. Summarize the flow in text and then return a Python script that implements the flow." You will convert the image to base64. We'll pass the base64 image and the question to llama32pi function. And we'll get the result and print it. Here's the response by the model. Based on the diagram provided, the model identified five stages that are listed here, and also code in Python that implements the flow. The next use case is extracting the information from a chart and converting it to an HTML table. Here's the image you're going to use in this use case. It's a chart that compares the speed of different LLMs. Here's our prompt: "Convert the chart to an HTML table." You'll convert the image to base64. Then we'll pass the image and the question to the llama32pi function. And we'll get the result. And we'll print it. And here's the result, which is an HTML table, that lists different models and their speed. You can display this HTML table and see the result. You can use HTML from iPython display. And then if you put the result of the previous HTML table in this single-line minified string, you can display the final HTML table and see the result. Let's see another use case. In this use case. You have this image of a fridge and this question: "What is in the fridge? What kind of food can be made? Give me two examples based on only the ingredients in the fridge." You can convert the image to base64. And pass the question and base64 image, to llama32pi function. Get the result and print it. And here is the response. The fridge contains a variety of ingredients, including vegetables such as lettuce, cucumber, tomatoes, and fruits like apples. And with these ingredients, two possible dishes are recommended here. A fresh salad with lettuce, cucumber and tomatoes. And a fruit platter with sliced apples. Let's now ask a follow-up question about the fridge. "Is there a banana in the fridge? Where?" To prompt the model, you can form the higher level messages and ask your follow up question. Here is our initial messages that was created in the llama32pi function you called earlier. With our initial question and the image URL. You will need to add the result of the previous question as the role assistant, and then add your new question. You can now pass these messages to llama32 function. Get the result and print it. And here's the response. "There's no banana in the fridge." Which is great. Since you will have other use cases with follow-up questions, you can wrap this code in a helper function and reuse it. You can call this function Llama32 re-prompt image. With all the arguments that are needed to form the messages, copy the messages and paste it, add the optional model size parameter. And finally, return result. Now you can use this function whenever you have a follow-up question and need to re-prompt Llama. Another use case you will work on is interior design assistant. You have this image. And here is a question about this image. "Describe the design, style, color, material, and other aspects and also list all the objects in the photo." You can convert the image to base64. Pass the question and image to llama32pi function. Get the result and print it. And here is the description and the list of all items in the picture. Let's now ask a follow-up question. Here is our new question: "How many balls and vases are there? Which one is closer to the fireplace? The balls or the vases?" To ask the follow-up question, you can use the function you defined to re-prompt. By passing the initial question, the image URL, the result you got for this initial question, and the new question. Get the new result. And print it. Let's display the image here again and compare the response with the image. There are three balls and one vase. And the balls are closer to the fireplace than the vase. It seems that in this case, Llama 3 understood depth and distance better than it did counting. Let's see one more cool use case. You have this image of elementary math handwriting homework. And the goal is to prompt Llama and grade this homework assignment and calculate the final score. Here is a detailed prompt for grading this homework, providing feedback and calculating the final score. You will convert the image to base64 and we will pass the prompt and the base64 to llama32pi function. Get the result and print it. And here is the response by the model which includes first calculating the correct answer for each problem. And then comparing it with the written answer and providing feedback. And finally calculating the total score, which in this case was 100% correct. If you check one of the example problems here, for example, this last problem twelve, the total sum is 79 plus 15, which should be 94. And since the return response is 94, it is marked as correct. So as you see you can use Lama 3.2 for grading return homework. The last use case in this lab is tool calling with image. Llama 3.2. vision models don't support combining tool calling with image reasoning. Meaning, the models only provide a generic answer without tool calling. So you would have to first prompt the model to reason about the image, and then prompt it separately to make the tool call. You're going to use an image of the Golden Gate Bridge to get the current temperature and weather condition there. Llama cannot help you with this question directly, but you can first ask Llama where is the location of this place shown in the picture? You convert the image to base64. We'll, pass the question and base64 image to llama32pi function and get the result and print it. The location of the place shown in the picture is San Francisco, California. The Golden Gate Bridge. Since Llama 3.2 supports function calling, you can pass the location information you got from the image and pass it to Llama again. Here's our new question "What's the current weather in the location mentioned in the text below?" Let's print and see it. So this is our question and this is the location info. The prompting format for tool calling is going to be discussed in detail in the tool calling lesson for enabling the built-in tool calling, you'll need the formatted current date and also in your messages for the role system, you will provide environment iPython to enable tool calling. And then provide the list of the built in tools Brave Search and Wolfram Alpha. You will see all these in detail in the tool calling lesson. The current knowledge date is given and the Today's date is past here. Now you can add the weather question as the role user. Then pass the messages to llama32 function and print the result. And as you see, the model decided that the answer for this question needs to call a search and the response shows the brave search call. And the proper query. The current weather in San Francisco, California is returned. You're not going to make the actual call in this lab. In the tool calling lesson, you will make this search tool call, will receive the results and we will use them in re-prompting Llama to get the current weather related to your prompt. All right. In this lesson, you worked on eight more advanced image reasoning use cases, including analyzing restaurant receipts with multiple images to calculate the total charge, and also combining image reasoning with tool calling. In the next lesson, you will learn Llama's prompting format in detail. As mentioned, for non-multimodal tasks, Llama 3.2 is exactly the same as Llama 3.1. So in our next lessons we are going to use Llama 3.1 models. See you in the next lesson.

Learn Code

Next Lesson