In this lesson, you will interact with the multimodal RAG system that lets you chat with the video through an interactive gradio app. You will examine various components of the system which you will learn how to build from scratch in subsequent lessons. Let's have some fun! As humans we use all of our senses to understand the world around us. For the concept of an apple, we understand the sound that is made when someone bites into it. The taste of it, the color and texture, and also the fact that apple pie is made out of it. I like to say a dog is a dog is a dog. Whether the concept of a dog is expressed in a video or in an image, or if the word dog is mentioned in any of the various human languages. Similarly, truly cognitive AI systems need to be able to connect through concepts that intersect all modalities. So let's say we have a video, for example, this video of astronauts returning back from a mission. A multimodal AI system should be able to utilize both the visual and language content in such a video. This includes firstly, the written text in the video. Secondly, it should be able to utilize pure visual information in the video. Thirdly, it should be able to utilize the information that is being spoken about in the video. And fourth, it should be able to hold multi-turn conversations where we can ask follow-up questions about, you know, the first query or the second query in the video. So let's look at what this gradio app will look like in a notebook. So here we can run user queries like "What is the name of the astronaut?" So the system will go to retrieve a video segment that tries to answer this query. The mission that we have here on the International Space Station, I am proud to have been part of much of the science. And the large-vision language model will reply in natural language. We can do a follow-up question. For example, "an astronaut's spacewalk." And here again, the system will retrieve an appropriate video segment. We do another spacewalk and then now have the chance to have done four more... And we see that the large-vision language model provides some detailed description of the scene. And we can also ask a follow-up question "What does the astronaut say?" And in this case, the model appropriately gets information from the transcript of what was said and responds in natural language. So let's take a look at the architecture diagram of the system. So this is how we will implement a multimodal RAG system. And in this course we will study individual components of the system. Specifically in lesson two we will learn how multimodal embedding models can process image and text pairs into a common multimodal semantic space. In lesson three, we will learn how to preprocess our own video data to get the data in a form that can be ingested by a multimodal embedding model. In lesson four, we will learn how to ingest our video data into a multimodal vector database and be able to run search queries on this vector database. In lesson five, we will learn about large-vision language models or LVLMs that can take as input both images and text. And then finally, in lesson six, we will put all components together to build our multimodal RAG system. So let's jump into our lesson one notebook. Welcome to the notebook for lesson one. In the lab exercise, we are going to start the gradio app. Since we've defined this application in an utility file, we will go ahead and import this library. So we will go ahead and launch the gradio app now. Yeah. So this this should be similar to what you just saw in the slides as well. And in this interface, you can see that we can run some queries. So let's try this query. "What is the name of the astronaut?" For answering this query, the system is going to retrieve a video segment that can best answer this query and it's retrieved the video segments. I'll play it. And so we know this in this video segment the name of the astronaut is written here. And with this information, our multimodal RAG system is able to produce an answer in natural language that one of the astronauts name is Robert Behnken. So before we run another query, if you don't want it to remember the previous context, let's clear the history and let's try another query. An astronaut's spacewalk. So again, for this query, a multimodal RAG system will first try to retrieve a video segment that best tries to answer this query. And in this case it's retrieved this video segment I'm going to play this video segment. "I didn't think I would do another spacewalk, and to now have the chance to have done four more..." So, you know, that seems like a relevant retrieval for this query. And, the interesting thing is that we can also do some follow-up questions here. So for example, we can do this follow-up question about "What does the astronaut say?" So in this way, our system is able to maintain a chat history. It'll be able to respond to this question based on you know what the astronaut had actually said in the video. So, the large-vision language model that's producing this language is also, contextualizing the transcript, and the question, and the video frame, and it's mentioning how what the astronaut statement is reflecting the astronaut experience and appreciation for the incredible view of space, and so on. So with this, I hope you can, play with this application, the code for this application is defined in our library. So the code for this gradio app is defined in our gradio utils file. And I highly recommend taking, a short look at this file right now. And then in subsequent lessons, we will learn and detail every component of the multimodal RAG system. So I would highly recommend you to look at this gradio underscore utils file, particularly this section here where we use the BridgeTower model for embedding the video data. We will learn more about this in our notebook exercise for lesson two. We also see here that we're initializing a vector store. We will learn about this in lesson four. We also see here that we are using an LVLM, a large-vision language model, which we will learn about in lesson five. And there are many components here that we are putting together using LangChain for creating a multimodal RAG pipeline, which we will cover in lesson six. I encourage you to come back to this file once we've done with the course, because then you would be able to see how the components come together in a gradio application that is interactive and lets you chat with videos. So now let's move on to lesson two.