Extract structured, queryable information from unstructured images, audio, and video using OCR, Automatic Speech Recognition (ASR), and Vision Language Models (VLMs).
We'd like to know you better so we can create more relevant courses. What do you do for work?

Session expired â please return to Cornerstone to restart the session and complete the course.

Instructor: Gilberto Hernandez
Earn an accomplishment with PRO

Extract structured, queryable information from unstructured images, audio, and video using OCR, Automatic Speech Recognition (ASR), and Vision Language Models (VLMs).
Build a VLM-backed pipeline that reasons across video frames to generate timestamped scene descriptions and track events over time.
Implement a multimodal RAG application on a real-world dataset, taking raw images, audio, and video into a fully queryable interface with grounded, cited answers.
Images, audio, and video make up a growing share of the data companies generate today, but most pipelines are still built for structured data alone. This course teaches you to build AI-powered pipelines that process multimodal data and turn it into LLM-ready text.
Youâll start with the foundations: using ASR to extract transcripts from audio and turning images into LLM-ready text descriptions. From there, youâll see how Vision Language Models generate descriptions from video segments, capturing not just whatâs visible in a single frame, but what unfolds across a scene over time. Youâll then apply these skills to implement a multimodal RAG pipeline that searches across slides, audio, and video from meetings to answer questions about their content. By combining all three modalities, you give LLMs the rich context they need to deliver detailed answers across complex, real-world content.
In detail, youâll:
Every technique youâll learn serves the same goal data engineers have always had: take messy, unstructured data and turn it into something you can query, analyze, and build on.
Data engineers and ML practitioners who want to extend their pipelines beyond structured data to handle images, audio, and video. Familiarity with Python, SQL queries and basic data engineering concepts is recommended.
Gradedă»Quiz
Additional learning features, such as quizzes and projects, are included with DeepLearning.AI Pro. Explore it today
Keep learning with updates on curated AI news, courses, and events, as well as Andrewâs thoughts from DeepLearning.AI!