In this lesson, you will learn how to interact with videos using Gemini models, such as extracting the title or specific content that typically requires watching the entire video. You'll tackle the needle in a haystack problem and learn to solve it using a multi-modal model. Let's make magic happen. In this lesson, we'll talk about How you can use video with the Gemini model to build some cool use cases. First, we need to run our helper functions so that we can use the Gemini API. Our authentication code. We'll use the same region again. We need to use the open-source SDK and then initialize the SDK. Exactly how we've done it before. Next, we have to import our model again. Just like in the lesson on images, we're going to use Gemini Pro Vision. So the Gemini Pro Vision model is able to work with images, video and text. Now it's time to dive into our first video use case. In this first use case, you will be a digital marketer. You will work on a video that needs to be posted on a website. So we need a couple of things from this video in order to post it on our website. We need a title for the video. We need a description and we need some metadata that we can use in the website backend. Let's use the Gemini model for this, just like we've seen in the first lesson, we need to load our video. Here's a video about Vertex and LangChain. Just like in the first lesson, we need a URI and we need a URL. Once you run this code, you can then use IPython again to display the video within the notebook. First, we need to import IPython. Next, we can use IPython to show the video in the notebook. Here we have the video. You can even have a look at the video to see what's in it. This will help you judge if the model gives a good response or not. what you will do is that you will use this video and send it with a prompt to the model. The model will look at what is in the video. Currently, it will not use the audio to understand what's going on. The cool thing here is, we don't need to process the video, meaning is that we don't need to convert the video into a format that the model can understand, just like with images. Right? Remember, in the past, sometimes you need to convert images in a different format so that the model can read the data and give you a prediction. What is really cool here is that you don't need to do any video pre-processing before you sent the video to the Gemini model. For some context, if you worked with any computer vision models, They normally expect the video or the image to be in a certain file format like MP-4 for only. Or to be in a specific height and width, which means that if you are trying to give it a video that the model can't handle, you would first convert it into the file format the model expects such as mp-4, and then also resize the video to the height and width that the model can handle. But here in this example with Gemini, it can handle pretty much any video size and shape and also most of the popular file formats such as mp-4, mlep or MPEG, just to name a few. This makes it easier for us to use. Next, you will load the video from the URI. In order to do this, we need to import some of the classes that we used before and we're going to use part to load the video from the URI. So here we use part which our class will load from URI that we specified before. This is our your URI and we're going to specify the video file type and MP4 in our case. But as mentioned, we can use also other file formats. You can run this. Now that you imported the video, it's time for you to write the prompts. As we discussed in the previous lesson, when writing prompts, it can help improve the output, if we structure our instructions step by step. We can also decompose our prompts into different sections. Let's explain this a bit more. So in the next example, we have a couple of tasks for the model to go through. You remember, we're going to need a title for our video, we need a summary and we need to generate some metadata for our website background. Also, remember you are a digital marketer working on the video. So we could do is that we can first specify the role for the model to make sure that the model has more context. Next. You can specify the tasks that the model needs to execute. So we say "tasks equal" and make sure it's step by step. Before you write down the tasks, let's provide a bit more context for the model. We're going to add the video to our website. Before we can do this, we need to complete a few tasks. And let's ask the model to provide some structure so that we can take each of the answers from it. Okay. Now it's time for tasks. First, we need the title from the video. Second, let's ask the model to write a summary. And thirdly, you need metadata for your website backend. Our backend only handles json format. So let's ask the model to return the metadata in jsonformat. Well, we have a title, a short description, the language of the video, and the company that created the video. As you can see here, we've split our prompt into, there is the raw piece, and that's the task. We now have our prompts. Let's now put everything together and see what response we're getting. Contents_1 equals and we can now stitch everything together. Our video goes first, just like with the images. We then have our role. We then have tasks. You run contents_1. Next, you also set your configuration and you set your temperature parameter. Feel free to change this one to a higher temperature or run it as is first and then change it and then run it again to see how the response differs. Ask yourself the question, Do you still remember what temperature does? As we discussed in the previous lesson. You now have content on the square one. You have your configuration file. You either updated the temperature or you left it as is. That's up to you. Next, let's generate the response. For this, you will use multi-model and generate on the square content and you take your contents, your configuration file, and you will set streaming to true. Feel free to maybe change this to false for once and see how that is different to streaming. Now you can print the response. So we go print and we say responses dot text. You probably notice that we do it different than in the previous lesson. This is because you have set stream to false. Meaning is that we don't have multiple objects that we need to iterate over and print one by one. We just have one object that we're going to print. Okay, let's print the response. Good. Let's have a look at this. The title is Built AI-Powered Apps on Vertex with LangChain. This is correct. This is a title of the video. We now have a short description of what the video is about. After that, we get the metadata response. Our metadata, as you can see, is an json format. So we have one object here where we have a value pair with the title, the short description, the language and the company that created the video. Remember how I talk that we can split or prompt in raw tasks and format. You probably notice that there's no format variable here. Having worked with these models with the Google team on the use cases we've worked on, we structured a prompt with the role first task second and then the response format. So we have role, task and then format. In our experience, this seems to work well for most of the models that we've used. In my experience as a developer, you can make a prompt more reusable, if you split those up in separate variables. Each of these would be a variable containing your prompt text. So this also makes it easier for you to experiment and iterate on your prompts because you only need to update one of these variables. Another advantage is, is that if you bring this into production and you might need to update something in your production system, let's say you want to update the role, then you only need to update one of these variables. Gemini makes it easier to decompose a prompt into its individual components because it takes in a list of multiple text prompts. Since you can give it a list of strings, you don't have to put the role, task, and format into a single prompt. Let's do that now. Let's add this and let's experiment how this influenced the output. So we're going to ask the model here to output the metadata in json format. So please, output the metadata in json and I'm going to remove it in our tasks. So I'm going to update our tasks variable here. You update the tasks and you specify the format underscore json variable. We now also need to add this variable to the contents. Format goes after tasks. Also, rerun this one. You don't need to rerun your temperature unless you are playing with it. As we discussed before, I wouldn't change too much at once. So once you updated the prompt. Maybe you want to keep temperature as is to first see how the updated prompt influences the output. After that, you can also play around with temperature a bit more, so I'll leave it as it for now. But it's up to you if you want to do the same. Rerun responses will keep streamed to false for now. And then I'm going to print out response and to see how it's different. You might want to give the model some time to think. You can see that's a json response with the metadata is very similar to what we've seen before. So separating the format in a separate variable hasn't done too much for for this response, it's just easier for you to manage. Let's say if you want to change the format to a different one, like a csv file, you can just change this one. Let's go to the next use case. In the next example, you will do a few things. We'll have a video and we'll ask the model a few questions about that video. You will write three questions. The answer to question one. The model will use to answer a question two and three. Okay, let's start writing the code for this. First, you need to specify the URI and your URL for the video just like we've done before. We can print the video and we can have a look at the video and what it's about. 214 00:12:28,766 --> 00:12:32,833 If you watch this video, you can see it's about a regression model. Next, we will have to load the video from our URI. This is video two. Next, we'll write a prompt. In this example, you will only use one variable for your prompt. Which is prompt. And here you will write your prompts. You will use very basic instruction. You will ask the model to look at the video and answer the following questions. So the questions are: question one, Which concept is explained in the video? Question two is and this is where it gets a bit more tricky, based on the answer to question one, can you explain the basic math of this concept? So we're giving no hints what this video is about. Question three, this is going to be even more tricky, Can you provide a simple scikit code example explaining the concepts? What we've seen with lost language models in the past or with other models, is that sometimes you don't want to ask too many questions at once. Sometimes the performance is better if you separate each question one by one. Here, we're going to combine all of these three questions, plus we're going to send the video. Next, let's combine all of this into our contents. Let's call this contents two. And you say video first. So we have video two and then you put your prompt here. Next, you didn't call the model with your video and your prompt. We keep our stream to false and we have our contents underscore 2 here. You can print the output like this. Responses dot the text. Okay, let's have a look at the output and see if the model is correct about the video. The video explains the concept of linear regression. So if you watch the video you can see, yes, it's about linear regression. Secondly, we have the basic math here for linear regression. You can see the equation Y equals MX plus B. You can take out your statistics book and see if the model's right. Okay, now another exciting piece is that here is simple scikit code explaining the concept of linear regression. So you can see our code below. You can have a look at the code and see if it's correct. But let's do a test and copy-paste this to see if it runs. You can have a look at your code and to see what you think and maybe even give it a try. When I run this code, I get this weird graph that has a bit of a bend. So the code runs. Is this a straight linear line? I would argue no. Nice try. For you, the code may run or may not run. Just give it a try, when you use a model to output code, make sure to always check it. Okay. Let's go to the next use case and look at a video where we're going to ask a few questions to the model about the video. But now it's time for you to write the prompt. First, I'll give you a video to work with. Please have a look at the video. So you are able to come up with some prompt questions. Next, load the video from the URI, and now it's time to write a prompt. I'm going to set it to one variable for this example. Feel free to put everything in one single prompt or to create separate variables like we did before. I'll use one for now. I'll give you a bit of boilerplate prompt. So when asked the model to answer a bunch of questions about the video. I would like the model to present the results in a table with a row for each question. And this answer, we like structure. And let's make it even more fun and make sure the table is in markdown format so we can run it on our notebook to check the format. Now it's time for you to write a few questions. I'll give you one example, but feel free to come up with other questions based on the video. What is the most searched sport? I'll try another one. What is the most searched scientist? You can run the prompt. Then you put everything together. You call the contents_4. I'll put my video first. Feel free to change the order and to see if you get something else That what I had. Now you call the API, you use generate underscore content again. We have our contents_4 and I set my stream to true. Up to you what you do. Next, you will print the response. Okay, here's your answer. Let's first check the questions. So what is the most searched sport? So based on the video, the model says it's soccer. I can see this model is not from the Netherlands because we like to call it football. But I'm okay with this answer for now. Secondly, what is the most searched scientist? It says Nikola Tesla. You can go and check if the model is correct, but you probably have different questions because you've written your own prompt and your own questions. Next. Let's check if this markdown gives us a nice table structure. Copy. Paste this in the next cell, make sure is set to markdown. Set it from code to markdown. Up here. Then you run it. And yes, look, we get a nice table. Question. Answer. To be honest, I don't think the answer here should be Nikola Tesla. I think it should be Einstein. As we know with these models, sometimes we can have a bit of hallucination. The important question is, what can we do about it to try to mitigate this? The first thing you can try is changing your prompt to see if that improves the output. You can add a sentence like saying if the answer is not found in the video, say not found in video, you can run the prompt, you run the contents and you generate a new response. What you can try is maybe first make some changes to your prompt to see if that improves your output or play around the parameter see if that helps. Have you ever scrolled through a video looking for a specific moment? Going back and forth through a 15 minute video can be very time-consuming. Let's figure out if a model could help us out here. You may have heard of the concept of finding a needle in a haystack. In the context of Gemini, needle in a haystack refers to its ability to find specific information needle, buried within a massive amount of data, haystack. As you remember, we talked about large context windows and how it can help processing large and vast amounts of data. This window allows the model to understand the broader context in which the specific information might be hidden, enhancing its ability to find the needle in the haystack of data. Okay. Let's see if we can find a needle in a haystack. Let's look at a use case of videos. What we want to find specific information within one of these videos. And let's make it a bit more fun. And let's search for something that is not that obvious in the video. Secondly, we'll ask the model to analyze not only one, but three videos totaling more than 15 minutes of video content. Meaning that we have to leverage the large context window. Let's import utils and let's load the model. As mentioned, we're not going to go and send one video to the model. We're going to send three videos. Each of these videos are a lesson from the LLMOps course on the DeepLearning.AI platform. Let's have a look at one of these videos to see what's in it. We're going to load the videos from a URI. Just like we've done before. And then we're going to use iframe from IPython to display the video here. Let's only go through one video for now. And we'll have the model go through all three videos. Okay. As you can see, this is a lesson. It's part of the LLMOps course. These are three videos, three lessons about different topics. Feel free to scroll through the video to see what's in there. We've loaded our videos. Now it's time to write some prompts. And let's use some of the design principles that we discussed in lesson two. First, Let's set a role for the model. In this use case, we want the model to be specialized in analyzing videos and finding a needle in the haystack. Next, we can add a instruction. Before we provide the videos to the model. The instruction sets, here are three videos. Each is a lesson from the LLMops course from DeepLearning.AI. Your answers are only based on the videos. by providing this instruction, it is more clear to the model what we expect from it. Next, let's write our questions. You are going to ask the model two questions. First question being, create a summary of each video and what is discussed in the video. Limit the summary of a max of 100 words. As discussed during our prompt design tips, providing expectations to your model helps get a better answer. In this case, we want the summary to be short, so we're asking the model to limit the answer to 100 words. Second question in which of the three videos does the instructor run and explain this Python code? Big Query underscore client dot query. It's a bit of a curveball because this code is in the course, as I know, but it's not in the course exactly like this. Because there's a parameter in here as well. We also want the model to let us know where in the video you can find the code. So we can check the answer. And let's now put everything together. Just like we've done before, we'll create a list. First, we will give our role. This is like setting the stage. Secondly, we will set a instruction, just to manage expectations. After the instruction we'll provide the first video. Video one. Video two. And we will add video three underscore three. We will end with our questions. Okay. So now we've put everything together. We have the role. We are providing a instruction to the model. We have video 1,2,3 and the question. Now you can send all of this to the model. So we're going to use the multimodal dot generate content again. We are setting streaming to true. Okay. Now it's time to send all of this to the model and check the response. The model is getting a lot of information. So there's three prompts and three videos. These videos are quite lengthy, so it might take some time before you see your response coming in. The model needs to process all of this data. Okay. Here you can see the output. So for each of the video, we have a summary of what is discussed. So first video indeed is about LLMOps concepts. The second video focuses on data operation. Also correct. And the third video focuses on automation and orchestration. So that's all correct in summaries. But now let's have a look at the needle in the haystack search. So we asked the model to look for this code in the video. So the model is telling us it's in the second video. And it appears at timestamp four minutes and 19 seconds. So let's go to the video. Let's scroll to 4:19. This is 15. 17, 18, 19. As you can see, the code starts to learn, but there's a parameter in the test provided here. So even though we didn't exactly provide this code that's in the video, it was still able to find it throughout the three videos and give us an exact timestamp of where we can find this code. This way, we can easily search videos and find the information that we're looking for. If I had to do this myself, it would take a very long time. And I might have missed this because it's a lot of video. Okay. So feel free to play around not only with the prompts, you can also try a different order of the content. So let's say you can add role to the end. And then rerun the code below. So rerun the cell and all the code below and see how the output changes. You can explore how the order can impact the output of the model. Okay. That was it for lesson five. You learned about how you can use videos in your use cases. How you can do video search, how you can find a needle in a haystack, and how you can leverage large context window. In the next lesson, we will dive into function calling and how it can help you incorporate external data into your use cases.