Welcome to the short course Multimodal RAG: Chat with videos. Built in partnership with Intel. In this course, you'll learn how to build a question and answering system for multimodal data, specifically a collection of videos. You explore multimodal encoding transformer models that merge vision and textual data into a unified semantic space, and you learn how to create a system capable of interacting with a collection of videos. Starting with the BridgeTower model, which embeds both text and images, you generate embeddings and store them in a vector database. Next, you develop a RAG pipeline to fetch relevant multimodal content from this database and feed it to a downstream large vision-language model as input context to get a response for the user. By the end of the course, you have built an interactive AI system that allows you to chat with your video corpus. I'm delighted that the instructor for this course is Vasudev Lal, a principle AI research scientist on the Multimodal cognitive AI team at Intel Labs. Vasudev and his team carry out lots of research on multimodal foundation models, especially how scaling up models causes vision and text representations to align with each other and to further support to research and developer communities, they often open source their work and present detailed technical papers in major AI conferences. Thanks, Andrew. In this course, you will learn about the multimodal embedding models like the BridgeTower model and how to create joint embeddings for image caption pairs. These embeddings are able to represent both the vision and language modality in a common multimodal semantic space. You will learn how to process video data for multimodal applications, including transcription with the Whisper model and generating captions using large vision-language models. And how to develop robust retrieval systems that handle complex queries involving text and images utilizing tools like vectors stores and LangChain for data retrieval and similarity searches. We also learn how to use large-vision language models for tasks such as image captioning, answering questions based on visual and textual cues, and maintaining conversation flow. This course is built using APIs on multimodal models that are hosted by Prediction Guard, which is a startup that hosts its models on Intel's Cloud. You learn to implement a complete multimodal retrieval, augmented generation, or RAG system, that accepts complex user queries, retrieves relevant video segments, and provides comprehensive responses that are grounded in specific video frames. Vasudev will guide you through these concepts, tools, and methods, including the BridgeTower model, Whisper model, vector stores, LangChain workflows, and Prediction Guard APIs. The topic of multimodal AI excites me for practical reasons. Much of the world's data exists as videos and other multimodal documents. Like the videos you will watch in this course or the slides and papers I review daily. It can be useful to have AI assistance that can leverage data in all such modalities. This capability is vital for applications and customer support and education and entertainment, where multimodal interactions enhance user experience. The models used in this course are open-sourced and can be easily swapped with other open-sourced or proprietary models. Many people have worked to create this course. I'd like to thank Tiep Le, Gustavo Lujan and Ryan Metz from Intel. Daniel Whitenack Jacob Mansdorfer and Florin Patan. from Prediction Guard. As well as Diala Ezzeddine from DeepLearning.AI. In the first lesson, you will interact with the multimodal RAG system that will allow you to chat with the video through an interactive Gradio app. That sounds great. Let's go on to the next video and get started.

Under Maintenance