This lesson focuses on use cases involving images. You'll learn how to extract information about items or prices from images and explore practical applications such as recommending furniture for your living room based on a set of the available images, and more. Let's have some fun! In this lesson, you will learn how you can use a multimodal model to work with images and text prompts. Before we dive into that, just like in lesson one we have to run our setup code. First, we need to get our credentials and make sure we can authenticate it. Secondly, we're going to specify the region for which we're going to use the Gemini API. Again, US Central one. Then we're going to import the vertexai sdk in order to interact with the Gemini API. And then we're going to type all of this together and initialize the SDK using our Credentials region and Project ID. Once you executed this, we are ready to get started. In order to use our model, we first need to import the SDK again. I'm going to import the SDK. I'm going to import the generative model class and the image class because we're going to use images in our use case. Next, I'm going to import our multi-modal model. And in this case, we're going to use Gemini 1 dot 0 Pro vision version 001. This model is great for working with images. Okay, so for this first use case you will not only be using different modalities, like text and images, but you will also reason across these modalities. Meaning the model needs to extract information from the images in order to answer each of the questions from the text prompts. Okay, let's have a look at this and let's get started. First, we need to import our first image. The first image is an image of a bowl of fruits. We will load a local image using the load from file. So we're going to set fruit. And we're going to load the local file. In this case, it's a jpeg. The second image is a list of prices of fruits at my local supermarket. Also, this is a local image a jpeg. Cool thing is that we'll ask the model to add some questions about these two images, and they can only answer the questions based on what it sees in these images. Now let's take these two images and create a list. So we have a list of images. We have our fruit image and we have prices. Now let's have a look at these images using our helper function. Again we're going to use utils. I'm going to import print multimodal prompts. And this lets us print the images. Okay. You can see our two images here. So first of all a bowl of fruit with some bananas and some apples. Second image is a screenshot of fruits and the prices per item in dollars. Now it's time to write some prompts. As we just discussed in lesson 3, structuring your prompts can help the model performance, especially if we have to reason across different modalities. So first let's create an instruction for the model. So we're going to call this instruction_1. And this is our first part of the prompt. And my first instruction is "I want to make a fruit salad with three bananas, two apples, and one kiwi, and one orange." This is an image of my bowl of fruits. So I'm providing an instruction that I want to create a fruit salad. And I'm telling the model this is an image of my bowl of fruit. We're also going to add a second instruction. The second instruction is telling the model that following this instruction, there's a price list of fruits at my local supermarket. Now let's create some questions. These are the questions that we want the model to answer. First of all we're asking the model "Describe which fruits and how many I have in my fruit bowl on the image." Secondly, "giving the fruits in my bowl on the image. And the fruit salad recipe. What am I missing?" Thirdly, "given the fruit, I still need to buy, what would be the prices and the total cost for these fruits?" In order to answer these questions, the model needs to look at the images first. Let's run this. Now let's put everything together in a list again. So we're going to call this list contents. And this is where we will combine our images with our prompts. Can you see how I'm using an order. We've talked about this and I'm grouping the instructions with the images. Instruction one sets the stage and talks about what I want to do, and gives information on the ingredients. Instruction two talks about the price list. And last, there are the questions that we want the model to answer. Feel free to play around with this order as well to see how it impacts the model. So you can say, for example, we're going to move questions all the way to the top. But for now let's keep the original structure. We combined all of these modalities in our list. And now we can have a look at everything that we're going to send to the model. So let's print our prompt in our images. So here you can see this instruction one an image of our fruit bowl. Second instruction with the image of the price list. And then there's the questions that we want the model to answer. Now you're ready to send all of this to the model in order to get a response. I'm going to use the utils function for this again. And like I mentioned before, feel free to go into the utils.py to see all the logic there that's helping us call the API. I'm going to run this one. And then you're ready to call the model in order to get a response. Okay. That was quick. Let's have a look at the response. So the model is telling us you have two bananas and two apples in your fruit bowl, which is correct. To make a fruit salad you need three bananas two apples, one kiwi and one orange. Okay. So we're missing one banana, one kiwi and one orange. The prices of these fruits are $0.80, $1.25 and $0.99. Okay. Let's have a look at the prices. So the banana indeed is $0.80. We have the kiwi which is $1.25. That is correct. And then we have $0.99 for the orange. This is correct. So the model was able to understand what we are currently missing or we still need to buy. And what's going to be the total cost of this. So let's see if the model is correct about the cost. $1.25 plus $0.99, which is a great deal for, for an orange. And correct. This is $3.04 for me to buy the extra fruits. Okay. We've now learned how we can use images in our prompts, how we can structure our prompts, and how we can have the model reason across these modalities and answer questions based on what it sees in these modalities, in this case, images. Our next use case, we're going to use Gemini as a recommendation system to recommend us a new chair for our room. And for this use case, we're going to use more images as well. So I want to buy a new chair for my living room. So I have four images of chairs. It's chair one, two, three and four. Plus I have an image of my living room. The images of the chairs, we combine in a list and here we have the image of a room. Okay. So as you can see, we loaded our room image from a local file we have to do the same for the images of our chairs. Since we have a list, We have to load from file from the URI. for each URI in our list. In order to print, let's first create a list of a room image, and then we're going to extend this list with the list of our chairs. We can then print all of the images. Here you can see our images. The first image is the image of our living room. As you can see, it's a very clean and white design. We then have the fours four chairs that we want to pick one from to put in our living room. Okay. Now it's time to write the instructions, to write or prompt. In our first example, each of our instruction and question was a variable. In this case, let's use a different approach. let's say you get started with Gemini. It might be easier to combine your instructions and images from the start. Here we have a list of the images and our instructions. Everything is in one variable. Let's go through this. We're telling Gemini that it's an interior designer and we're saying, consider the following chairs. Here we have the images of our chairs and the image of our room. For each of these chairs. We're asking Gemini to explain where it would be appropriate, the style of the room. We have now put everything together in this variable recommendation on the score contents. Let's print everything together and see how it flows. There's the instruction. You are a interior designer. We have the chairs and we have the image of our room. When looking at our images in a prompt we can see, there's a lot of white space here. Let's fix that. Rerun and we can see if it's fixed. Much cleaner. With these language models, white spaces can influence the response of the model. This is why you want to have a look at your prompt and see if everything is okay. We're now happy with this and we can send it to Gemini to get a response. We're going to use our helper function to call the API and get our response. Okay, let's go through the response to see what Gemini thinks of our chairs. So the first chair, this chair will not be appropriate for the style because the room is modern with clean lines and the chair is made of wood and metal. Let's go up. And yes, wood and metal. And I think Gemini is right here too, and wouldn't fit the style of the room. That's pretty cool, right? Let's pick another one. Let's do the fourth one. This chair would be appropriate for the style of the room. The chair has a sleek, modern design that would complement the clean lines of the room. The chairs also upholstered in a soft, neutral fabric that would match the colors of the room. Okay, let's scroll up our room and the fourth chair. Yes, this chair would look great in this room. What's impressive here to me is that Gemini does not only match the chairs based on color, but also the materials it's made of and the style of the chair and the room. Feel free to have a look at the answer on chair two and three to see if you agree with Gemini. Awesome. Now I have a personal interior designer. I go now and buy a chair. Okay, that was interesting to see how we can use a multimodal model to do recommendations. Now, let's look at another use case. But let's look at a use case where I'm going to use Gemini 1.5. This is a midsize multimodal model optimized for scaling across a wide range of tasks. And it performs at a similar level to 1.0 Ultra. The model provides support for up to 1 million tokens context and the maximum number of images per prompt is 3000. As I mentioned before. This is great where you need to build applications that understand the process much larger chunks of information. Things like lengthy documents, code bases or extended conversations. This comes in handy because we don't need a chunk of data first, and then sent chunk by chunk to the model. We can now send lengthy things like documents to the model and get a response. Okay, let's import our model. The 1.5 version OO1. So in this example, you'll work on a use case where we need to itemize receipts and check if the expenses comply with a company's policy. We all know that processing receipts is not the most fun thing to do. So let's see if we can automate it a bit. In this example you must reason across multiple modalities like images and text, and then combine information to conclude. We will throw a few variations or edge cases. What I like to call curveballs. If you're into baseball, to see how the model performs. Now let's first read the images and write some prompts. Just like we've done before. Our images in this use case are receipts. Let's run this. Next, you can load the images of the receipts. So we're going to use image dot load from file. Give it the path to the image and then load these images. So now you've loaded the receipts. Next, you have to load the company policy for business travel expenses. In this case, it's a text file. So we're going to load the local text file. Say txt file travel policy. And we're going to read that file into our variable policy. If you want you can print policy. And check the company policy. Let me clear this and we'll continue. So now it's time to do some prompting. First, we're going to set an initial instruction for the model because we want the model to be truthful and we want the model to be transparent if the model is not sure about the answer for a question. So if, for example, the model doesn't have enough information to answer a question, we want the model to highlight this. Secondly, we're going to set a rule. So in this case, we will be building a model that is a HR professional. So you are an HR professional and an expert in travel expenses. Next, we will have our assignment. Let me paste in the assignments. So you're reviewing travel expenses for a business trip. And we're asking the model a couple of questions. So first of all, to itemize everything on the receipts. So including the tax. And then we want the model to calculate the total sales tax. Thirdly, we want the model to extract from one of the receipts only the expenses for the meal that the employee had. So the employee had a meal with colleagues and the employee only ate the KFC bowl. So we want the model to extract the cost for this food item. Fourthly, we want to calculate the amount of spent by others and then, five. So let me change that. We want to check the expenses against the company policy and flag if there's any issues. So there's a whole bunch of questions. And these questions the model can answer only based on the images we gave. Plus the company's policy. So let's run this. And let's bring everything together. Okay. So we're going to receipt, content. Cool. Again it's going to be a list of all of our images and prompts. So, the instructions and the assignment and the role. So as I mentioned, we're throwing a few curveballs. So let's check what they are. So let's print out all of the content. So the prompts and the images. And let's have a look at this. So if we look at the receipts then you can see that some of them are not that great. These are quite low quality. So this is for me already difficult to see what's on there. So let's see if the model can. Some of these expenses are outside of the company policy. And thirdly there was this meal that was with a colleague. So let's see if the model is able to figure out which meal the employee had and which meals was from the colleague. Okay. I'm going to clear this one for now. So clear outputs. And let's call the model to see how would response. Okay. So we're going to call the model and print the output. Okay, let's go through the output. So first we can see that the model has itemized receipts okay. Already making my life easier. And that was a receipt with the meal with the colleagues. So it calculated the total sales tax. As I asked. It also extracted the cost for the meal that the employee had while being with colleagues. These are the cost of the colleagues. So it's separated out this meal from the employee. And then it checked all of the expenses. So all of the receipts against the company policy. And we can see there are some issues with these receipts. So let's do some digging. We can see that the employee had a green smoothie. This is a problem. This is flagged as a non-reimbursable per company policy. So it's company policy that employees are not allowed to have green smoothies. Also, there's an issue with the daily limit. Okay, that's it for lesson four. We've seen how we can use a multimodal model across different use cases. So we've seen how we can reason across modalities. You've seen how we can use a multimodal model to do recommendations. And we've seen how we can use the model to extract information from different images and then compare that against a policy. In the next lesson you will analyze the videos. And you will learn about what is a needle in a haystack problem and how you can use a model to solve this problem.