In this lesson, you will implement a complete multimodal RAG system capable of handling complex user queries. By the end of the lesson, you will be able to input a query, retrieve corresponding video segments, and receive a text response along with the retrieved frame, providing a comprehensive answer to the user's question. Let's have some fun. So let's contrast a traditional RAG with multimodal RAG in a traditional RAG system. We will combine retrieval and generation of text only LLM models. And retrieve relevant context which is only textual and incorporated into a prompt before generating the response using a large language model that can only incorporate textual input. In a multimodal RAG flow, we extend the conditional RAG pipeline to handle multiple modalities that include text and images. So in this blog diagram we have studied individual components built before. In lesson two, we studied the embedding , specifically the BridgeTower model that we use. And lesson three we learned how we can preprocess our video data so that the data is ready for ingestion with the embedding model. In lesson four, we learned how we can populate the vector database and in lesson five, we learned about LVLMs, and specifically the LlaVa model. So in this lesson, we will combine all of these modules together to give ourselves the full multimodal RAG pipeline. Notice how we have a log sign on the enterprise data here. This indicates that this data can be private. This can be your own data where you haven't trained any of the embedding models on the LVLM model on. And it can reside in a private data repository. The same is true for this vector database. This can be a private, database that can be hosted in a private setting. You will also notice two other things in this diagram. The purple arrows, the peg the data indexing process. So this is when we take our enterprise data or our private data, and we do data processing. And we inject it through the embedding model to populate the vector data store. The green arrow indicates the inference flow. So after we've deployed let's say an application into production then we might have a user query. The user query will also be processed through the embedding model. And then we will execute a similarity search against the vector database and retrieve the closest matching entries, which will be contextualized with the initial user query and fed the LVLM model. So during inference of the model, we will only execute in this flow path. In the notebook for this lesson, we will also practice making our own custom chains using LangChain. For example, consider the block diagram here, where we have a few modules and we want to connect them in a very specific way. So for example, if we have a user query that user query needs to go into the retrieval module, but it also needs to pass through along with the output of the retrieval module to the problem processing module. And the output of the problem processing module goes to the LVLM inference module. So in code we see that we create a chain with three steps. And the first step we have the retrieval module output that gets combined with the original user query in the dictionary. That goes to the second block which is the prompt processing module. And the output of the problem processing module goes to the LVLM inference module. Okay. So let's move to the notebook of this lesson and implement the multimodal RAG system in code. Welcome to the lab for lesson six. The final lab. In previous labs, we have preprocess the video data to make it suitable for computing multimodal embeddings using the BridgeTower model. And we have ingested our entire video corpus into a multimodal vector store. We have also learned about large-vision language models like LlaVa, that can take as input both images and text to generate textual responses. In this lesson, we will connect these two previous components, the multimodal vector store and the large-vision language model, so that we can build a multimodal RAG system that will allow us to chat with videos. Let's import the relevant libraries including relevant components from LangChain. Let's also initialize the vector store that we have previously constructed in lesson three and four. If you haven't done it, you can change this table name to the demo table that we have provided pre-processed. Now let's initialize the BridgeTower embedding model. So similar to lesson four, we will go ahead and initialize a retrieval module that is associated with our vector store. To test that the retrieval is working, let's invoke this query. "What do the astronauts feel about their work?" Let's examine all the metadata retrieved by the query. This includes parts to the extracted frame. The transcript of the video segment, and so on. For multimodal RAG system, we want that the initial query supplied by the user should be augmented with sensor retrieval, and go downstream to the LVLM model. Right now we will initialize the LVLM inference module and manually provide the augmented query and retrieval to the LVLM model. After this manual step, we would have confirmed that all modules are independently working, and after that, we will use LangChain to connect these modules. So let's initialize the LlaVa inference module. We will now augment the original query with the transcript of the retrieved video segment. So we will have the transcript of the video segment. And we will have the original query here. So here we see that the original query was "What do astronauts feel about their work?" And the transcript is pre-appended to it. We now create a dictionary with the augmented query and the path to the retrieved frame. We can now provide this dictionary as an input to the LVLM inference module. And we will notice the output of the model. So here we see that the LVLM responses that the astronauts and the images appear to be proud of their work. The output response accurately describes that the astronaut feel proud of their work on the International Space Station. Now that we have individually tested the retrieval module and the LVLM module, we have also manually took the output of the retrieval module, augmented the query vector, and fed it to the LVLM inference module. We are now ready to define a prompt processing module and start connecting all the modules with LangChain. So we define this prompt processing function which will take as input the user's original query and the retrieved results. We will then create a new prompt template where we will augment the original user query with the retrieved transcript. This function will return the augmented prompt that includes the original user query and the transcript, and it will also return the path to the video frame. We wrap this function as a LangChain runnable lambda module to create a prompt processing module. Let's see how we will use this problem processing module. Let's invoke it with the inputs of the user query. And the retrieved results. We noticed that the output is the user query augmented with the retrieved transcript, along with the path to the retrieved frame. Note that the output of this prompt processing module is now ready to be fed to our LVLM inference module. Now we will combine all the modules into a chain to create the multimodal RAG system. We will apply this chain to a given user query. In the first step in the chain, we will invoke the retrieval module with the query. We also want the query to pass through to the next step, which is why we use runnable parallel in this manner. So the output of the first step is a dictionary with fields for the retrieved results and user query. This output dictionary from the first step is fed into the prompt processing module, and then the output from the prompt processing module is fed into the LVLM inference module. So now that we've defined a multi-modal RAG chain, let's invoke it with the query that we have been working with. "What do the astronauts feel about their work?" So here we get the same response that we got. In a manual attempt before. That the astronauts appear proud of their work on the International Space Station. Let's try a different query. What is the name of one of the astronauts? Here we get a response that one of the astronauts is named Robert Behnken. From the above two queries, we noticed that the multimodal RAG with LangChain seems to be working. However, in the multimodal RAG system we not only want to obtain the final answer in natural language, we also want to display the video segment that was retrieved. This will let us understand the video context that the LlaVA model is using to come up with the answer. At this point, I would suggest that you pause this video and try modifying the chain so that you are able to pass through the retrieve frame information to the output. This exercise will get you more comfortable with using LangChain. So in order to have the part of the retrieved frame, we need to update the chain in the following way: Specifically, we need to add a runnable parallel here, so that we pass through the input to the LVLM inference module straight through the third step into the output. So let's try this new chain with our query. We can now display the final text answer as before. But now we also know the retrieved frame and context that the model used for coming up with this response that the astronauts name is Robert Bhenken. Let's try another query. And astronaut's space walk. We will invoke this query with our chain that we've defined. And we will display the retrieved frame and the response generated by our multi-modal RAG system. So we noticed a multimodal RAG setup has retrieved an appropriate video frame. And the LVLM has appropriately described the scene where we see an astronaut wearing a white space suit, standing on a spacecraft, and performing a spacewalk. Let's try another query, which is a little more complex, where we want to describe the image of an astronaut's spacewalk with an amazing view of the Earth from space behind. We notice here that in the retrieved frame, there is no image of Earth behind. This goes to show that there are still some deficiencies in such multimodal models today. Our team at Intel labs, as well as many other research teams around the world, are continuously working to try to improve such models. Now let's change the query slightly and see if the system does better. So we change the query to be an astronaut's spacewalk with an amazing view of the Earth from space behind. In this case, we do see that the model has extracted an appropriate frame from the video. And when the LlaVa model contextualizes the frame and its transcript, it's able to accurately also describe what's happening in the scene by saying that the image captures a breathtaking view of the Earth from space, with an astronaut performing a spacewalk. So we see that these systems are still sensitive to the way that the query is formulated. With this, we come to the end of the notebook for lesson six, which is our final lesson. At this point, I would like you to go back to lesson one, where we showed the gradio app and specifically the gradio utility file, where we have defined that entire multimodal RAG application. You should now be able to understand every component defined in that file. Since we studied each component individually in the five lessons after the first lesson. At this point, I will also encourage you to take your own videos and experiment making your own multimodal RAG application that you can use to chat with your own video corpus. I hope you had fun in this course.