In this lesson, you will explore prompting a multimodal model using text and images. You will learn about various model parameters and how to choose and understand their influence on the model's creativity and consistency. All right, let's dive into the code. In order to run this notebook, we have to execute some setup code. First thing we need to do is import our utils and do authentication. Authentication meaning is that we're able to call the Gemini API in the cloud from our notebook environment. So we need to set up some credentials and our project ID. The project contains the resources and the Gemini API that we're using in the cloud for this course. we're also going to specify a region. A region meaning where are we going to execute our code. In this case, we're using us- central 1. Next, we're going to import the VertexAI SDK. So, the VertexAI SDK is like a Python toolkit that helps you interact with Gemini. So we can use Python to call the Gemini API and get a response in our notebook. We need to initialize the Vertex SDK. Meaning is we need to tell the SDK which project we are using within which region we want to use the Gemini model and we need to give it our credentials. Credentials meaning that Gemini, the API, knows that this is Erwin and I'm allowed to use the API. Okay. Now, it's time to select our model. In order to use the Gemini API, we have to import the SDK. From the SDK, will import generative model. There's multiple generative models. In this case we're using a Gemini model. Then we have to specify our model. We can say model equals and then take the generative model that we just imported and select our Gemini model. So, in this case we're going to be using Gemini 1.0 Pro version two. Models are being updated so can have different versions. Let's run these two. Next, we have to import some helper functions from our utils. If you want to understand more on what happens in the utils functions, have a look at utils dot py file. Once we imported these, we can call our model. So, let's start with a very basic prompt and ask what is a multimodal model? We're using the model that we just specified here. And we can run this. And it will then call the API. Here you see the output. Feel free to change the prompt and to play with Gemini Pro to see what kind of responses it gives. You might be wondering now what is happening in this helper function? So let's go and see how we can call the API using the SDK. So first we need a prompt. Okay. So once we have a prompt we can use "model dot generate content" to call the API. So this lets us call the API with our prompt. So, we're going to use prompt one here in order to get a response. And the cool thing is we can use streaming to get the response. So we can set "stream" to true. Often what you would see with large language models is that you have to wait for your response to be complete before the model returns the response to you as a user, if we use streaming, so if we enable streaming, we don't have to wait for the complete response to be done. You can start processing fragments as soon as they come in. So let's say you want to build a live chatbot and you don't want your users to wait for the complete response to be finished. You can stream in the response while it comes in. Okay, So we're going to run model dot generate, let's call it response one. So we're going to run this one. And then when you run response one, you can see that it's generates a object and this is a streaming object. So if we want it to get the actual responses, you need to iterate over the generator object using a loop. So let's use a loop in response to square one and let's print each of the response. What you will see is a json response. Now you might be wondering: This json doesn't look anything like the response, the text response I received earlier. So let's see how we can get the text response. We have to call the API again to get a response. So I'm calling the generate content and then we have to loop over the object to get. But what we're doing is we're printing response dot text. So we're taking from the json object only text response from the model. And as you see, we have set streaming to true. So the response comes in on the fly, once it's available. All fun text to text. But Erwin, you talked about multi modality. Okay. You're right, let's see how we can use an image and text to get a text response from Gemini. First, you need to import a few extra classes from the VertexAI SDK. You know, the generative model we're adding, two. So we're adding image. This let's just deal with images and send them to the Gemini API and we're going to import part. This is being used for multi content messages. For example, we want to combine text and image and send that to the Gemini API. Yeah, next, we can import our multimodal model. In this case, we're using Gemini 1.0 Pro vision, which we can use for data like images or video. Now we need a prompt and an image. So we have a very cool image of Andrew and we're going to ask the model to describe what is in this image. So we'll load the local image. I run the prompt and then I'm going to combine the image and the prompt together so that both of them we can send to the Gemini API. I'm putting the image first because we've seen, putting the image first and the prompt after gives us a better response. I'll talk about prompting best practices for Gemini later in the course as well. Let's have a look at the image and the prompt. So here you can see we have an image of Andrew holding a hammer and a power drill. We're asking Gemini to describe what is in this image? Let's call the API and see what it responds. So both the image and the prompt goes to Gemini. And Gemini gives us a response. Saying "the image shows a man holding a hammer and a power drill." And yes, Andrew is smiling. Let's ask it a different question. A more fun question. So let's ask it, "what are likely professions of this person?" Next, you can run content image. If you want, you can print the image in a prompt again, but you can also skip it. And then you run Gemini vision to get the output using the new prompt but the same image. So this person is likely a handyman or a carpenter or maybe a construction worker. With this prompt you can see we do not only do image captioning, but we also ask the model to reason about the profession of this person. And it actually describes us that the person is holding a hammer and a power drill and both of them are commonly used tools in these professions. So the model thinks that it's likely it's a handyman or a carpenter or a construction worker because the person is holding these tools. If you remember in the past, what we had to do is first use an image model to do captioning and then use that output and send it to a large language model to maybe reason about what is in the image or or draw some conclusions about the image. Like what is the likely profession of this person? Based on those captions. Here we are taking the image and the prompt and send both to the Gemini model to reason about the image. It's one step. That was an example of how we can use text of a prompt and an image and send it to Gemini. Now, let's do something else. Let's use a video and a prompt and ask the model to describe what is in that video. So, in order to do this, we first need a video. And I want to do it differently than what we've done until now. We're not going to use a local video we're going to load a video from a cloud environment. Why I'm doing this is that images or videos may not always be available locally. If you build an application with Gemini, you might want to store your images or videos in the cloud. So you have to load it from somewhere else. In this example, we're using a cloud bucket, see it like a storage environment in the cloud where I can store my image or my video in this case. We're using a URI it tells us where our video lives. So, we now have a URI and a URL. So you might be wondering what's the difference? Okay. A URI is, like a unique string like we see here that acts like a digital address pinpointing any kind of resource could be online or even offline. In this case, it's a storage environment within Google Cloud where we have our video stored. And a URL is like a specific type of URI that identifies resources on the Internet. I'm going to use Ipython to display the video. So let's import IPython and let's display the video so that you can at least have a look at it. Let's run these cells. And here you have the video. You can have a look at the video from the notebook if you want to see what it's about. We're going to ask the model a few questions that you're only able to answer if you watch the full video. Once we have the video, we need a prompt. And this prompt is a bit more complex than what we've seen before. But because we're going to ask a few questions and these questions you can only answer if you see the full video. So we're asking what's the main profession of the person? What are the main features of the phone highlighted? and also ask in which city this was recorded? Okay, so we have a prompt. We have our video. Now we have to load our video and we have to combine the contents again. So we're going to combine the video and the prompts and send both through the Gemini API. Just like we've done before, we can use generate on this for content. We send the video in a prompt and we set streaming to true. Now let's see how the model responds to our questions about the video. Okay. So, the model tells us the person in the video is a photographer. The features of the phone are night sight and video boost and it's recorded within Tokyo. You can watch the video and see if the model is correct. You can have a look at code and play around with it to see how the model behaves. You might want to change the prompt a bit. Next, you'll learn about parameters and how you can use parameters to influence the outputs of the Gemini model. Before you will try different model parameters, I want to do a quick overview of some of the key model parameters. You have probably already read or learned about these parameters, so I'll do it very quickly to make sure everyone is on the same page and has these fundamentals. Okay. Imagine your large language model output has a restaurant menu. Think of the words an LLM can potentially generate as a giant restaurant menu. Each word is a dish, and each dish has a different probability of being ordered. Of course, some dishes are more popular than others. In the image, each of the square represents a dish and the probability. You don't want to see every single option. Maybe just a top five or top ten. The ones that are most likely to be amazing. This is where a top K can help. Top K works in a similar way. It's a way to find the best or most relevant results from a large set of probability. In this case, all of our dishes. Instead of looking at everything, top k, focus on the top few choices that are most likely to be what you're looking for. Okay, it works like this: First, ranking. So each probability in each dish gets a score based on how good it is. Then we do selecting. The algorithm picks the items with the highest scores up to the number you want. Let's say if you want the top five dishes then you set k to five. The result is you get a short list of the top k options. This will save you time and effort. Then we have top P. Top P is a way to control the creativity and randomness of language models by selecting from a smaller pool of likely words. Top P is like telling the cook I want the best dishes, even if they are not the most popular. Keeps suggesting dishes until we've covered 80% of the menu's overall deliciousness. The P sets a threshold for the cumulative probability of words. The model keeps adding words to its list of options until it reaches that threshold. How does this work? As we know, each possible dish has a probability. These probabilities are added up. Starting with the most likely word or dish. So the dish with the highest probability and top P set a threshold. So the threshold could be something like 0.80. We keep adding words and the probabilities until we reach the threshold. Meaning is that we hit 0.8 or we go over it. That's when we stop. Then we have temperature. Temperature is like adjusting the restaurant's ambiance. A low temperature equals more deterministic, meaning a lower temperature. Something like 0.2, means that the restaurant is calm and quiet. the LLM plays it safe, choosing the most likely dishes. Output is predictable and focused. High temperature means more randomness. So a high temperature like a 0.8, means the restaurant is very lively and bustling. The LLM takes risks exploring less likely dishes, and the output is more creative and more varied. Gemini has default values for each of these parameters. To default value for temperature for Gemini Pro vision is 0.4 and the range is between zero and one. The default value for top K is none and the range is between one and 40. Default value for top P is one and the range is between zero and one. My advice is keep the parameters as default when you get started. From there you can play with setting the parameters, different values to see how the model output changes. We will go through this in a notebook as well. I think it's best to change one parameter at a time so you can keep an overview of how these changes impact the output. Let's go back to the code. Okay. So before we start playing with the model parameters, let's first run an example for an image use case using the same Pro Vision model we used above. So we're going to load a local image from Andrew and going to send that to the Gemini API. Next, we have to create a prompt. So we're going to ask the Gemini API to describe what is happening in the image. And for now the model doesn't need to mention names. Just describe the image. Let's run this. Then we need to combine our content so the image and the prompt. Now let's print the image and the prompt. So as you can see, this is an image of Andrew. Actually an image of Andrew holding a stuffed panda. So let's now ask Gemini what it sees in the image. We can again use to generate content called the API. We give it the content. So the image and the prompt we set string to true. We then run this and call the API and then we can print to response text. Once we get the response, the model says a man sitting in front of the computer screen. That's correct. He is wearing a blue shirt and has a panda stuffed animal in his arms. Also correct. The man is smiling. Check. And there's a chart on the computer behind him that says Grouping customers. Yes, there's a slide on the background that says grouping customers. That is impressive. I didn't even catch that slide. Thank you, Gemini, for telling me that. So in this case, we used our default parameters. In order for us to set the parameters, custom values, we need to set the generation config. We're going to use our helper function to make it easier to call the API. Next, we have to set our parameters and we can use this through the generation config. The generation config lets us set our parameters to specific values other than the default values we just talked about. This example as only change two of the parameters. We're going to set temperature to its lowest value, point zero. And we're going to set top K to one. Let's run these two cells. We can then use our config with our parameters and call the Gemini API. Here again, we have our contents, we have our model, we're just adding our generation config. Let's see what outputs we're getting from the model. Okay, nice. But as you remember, when we set the temperature to low, the lowest value possible in this case, that it means our output should be more consistent. I'm going to run this code again to see if we're getting a similar output. So as you can see, we're getting this same output. So the model is consistent as expected because we set temperature to low and we set top K to one. So we're getting the token with the highest probability. Okay, let's change the value of the parameters. Let's go in the other direction. Let's now set temperature to high and set it to one. We're going to set top K to the maximum value of 40. Let's run this. Start call the API with the new parameter values and extend print responses. As you can see, this output is way more wordy than what we got before. We see some overlap. In the model talks about the blue shirt. There's a stuffed animal panda, but there's some additional information here. For example, it says to imagine holding a panda in a way that suggests he's very excited about it and looks like the panda is smiling. Well, you could argue the panda is not smiling. As you can see, this answer is more wordy and more creative. You might see a different answer than I do because we set the temperature to high and top K to high as well. So we're getting more creativity. Let's now look at the next parameter, top P. So let's keep the parameters as is. So we're keeping temperature as one as we've just done before. Top K to 40, but we're going to add top P. And remember to value range for top P is between zero and one. We're going to set to low 0.1. We're doing this because we want to see if we can counter-balance some of the creativity, some of the randomness that we just saw in this example. Okay, let's run this and then let's call the API. As you can see, we're getting a very similar answer to what we got initially. So we can say by changing top P to a low value, we counter-balance some of the creativity and some of the wordiness. Finding the perfect settings for your parameters is still a balance between art and science. You might want to test different parameters to see what is best for your use case. As you mentioned before, if you have a use case like classification, you might want to set the randomness to lower temperature, lower top K and lower top P. If you need more creativity, you want to increase these parameters, you could play around with them. Gemini also has other parameters you can use to control the output you're getting. Let's look at two more. First, let's have a look at max output tokens. In our generation config, we can set max output tokens. Next, output tokens means the maximum number of tokens that can be generated in the response. For Gemini token is approximately four characters. Hundred tokens correspond to roughly 60 to 80 words. Specify a lower value for maximum output tokens like one to get a shorter response or a higher value. Like 2000, means you get potentially longer responses. The range from max output tokens is between one and 2048. Okay, let's set our max output tokens to ten and see how it influences the output. Let's call the API and see what response we're getting. In this case, as expected, we're getting a very short response. A man is sitting in front of the computer. Depending on your use case, you might want to control the output you're getting. If you have a use case where your output needs to be very to the point and short, then you might want to lower it. Be careful, you can happen that you get a cut off in a very weird place in the middle of a sentence, for example. So also play around with this parameter to see how it influences it up. Another parameter that we can set is called stop sequences. Stop sequences specifies a list of strings that tells the model to stop generating text if one of the strings is encountered in the response. If a string appears multiple times in the response, then the response truncates where it's first encountered. Strings are case-sensitive. A use case where you might want to use stop sequences is, for example, if you build a chatbot for children, There might be some words that you don't want to return to the end user. So you might have a list of words that you want to filter out. This is where you can use a stop sequence. Let's create a stop sequence with one string, and we know that our image has a panda. So let's use panda as our stop sequence. We can then call the API with this stop sequence parameter. As you can see, the response we're getting is "a man is sitting in front of a computer screen, he is wearing a blue shirt and has a" and then stops. You can guess that this is where the word panda should be. It saw the word "panda" and it stopped returning the output. Just for a record. I love pandas. Who doesn't like pandas? This is just an example. Okay. As an exercise, you can change a stop sequence string, so you can chase the word or maybe you can add one to it. So one additional string and then rerun this code. That was it for parameters. In the next lesson, we will dive into image use cases with Gemini.