AI is the new electricity and will transform and improve nearly all areas of human lives.

Quick Guide & Tips

💻 Accessing Utils File and Helper Functions

In each notebook on the top menu:

1: Click on "File"

2: Then, click on "Open"

You will be able to see all the notebook files for the lesson, including any helper functions used in the notebook on the left sidebar. See the following image for the steps above.

🔄 Reset User Workspace

If you need to reset your workspace to its original state, follow these quick steps:

1: Access the Menu: Look for the three-dot menu (⋮) in the top-right corner of the notebook toolbar.

2: Restore Original Version: Click on "Restore Original Version" from the dropdown menu.

For more detailed instructions, please visit our Reset Workspace Guide.

💻 Downloading Notebooks

In each notebook on the top menu:

1: Click on "File"

2: Then, click on "Download as"

3: Then, click on "Notebook (.ipynb)"

💻 Uploading Your Files

After following the steps shown in the previous section ("File" => "Open"), then click on "Upload" button to upload your files.

📗 See Your Progress

Once you enroll in this course—or any other short course on the DeepLearning.AI platform—and open it, you can click on 'My Learning' at the top right corner of the desktop view. There, you will be able to see all the short courses you have enrolled in and your progress in each one.

Additionally, your progress in each short course is displayed at the bottom-left corner of the learning page for each course (desktop view).

📱 Features to Use

🎞 Adjust Video Speed: Click on the gear icon (⚙) on the video and then from the Speed option, choose your desired video speed.

🗣 Captions (English and Spanish): Click on the gear icon (⚙) on the video and then from the Captions option, choose to see the captions either in English or Spanish.

🔅 Video Quality: If you do not have access to high-speed internet, click on the gear icon (⚙) on the video and then from Quality, choose the quality that works the best for your Internet speed.

🖥 Picture in Picture (PiP): This feature allows you to continue watching the video when you switch to another browser tab or window. Click on the small rectangle shape on the video to go to PiP mode.

√ Hide and Unhide Lesson Navigation Menu: If you do not have a large screen, you may click on the small hamburger icon beside the title of the course to hide the left-side navigation menu. You can then unhide it by clicking on the same icon again.

🧑 Efficient Learning Tips

The following tips can help you have an efficient learning experience with this short course and other courses.

🧑 Create a Dedicated Study Space: Establish a quiet, organized workspace free from distractions. A dedicated learning environment can significantly improve concentration and overall learning efficiency.

📅 Develop a Consistent Learning Schedule: Consistency is key to learning. Set out specific times in your day for study and make it a routine. Consistent study times help build a habit and improve information retention.

Tip: Set a recurring event and reminder in your calendar, with clear action items, to get regular notifications about your study plans and goals.

☕ Take Regular Breaks: Include short breaks in your study sessions. The Pomodoro Technique, which involves studying for 25 minutes followed by a 5-minute break, can be particularly effective.

💬 Engage with the Community: Participate in forums, discussions, and group activities. Engaging with peers can provide additional insights, create a sense of community, and make learning more enjoyable.

✍ Practice Active Learning: Don't just read or run notebooks or watch the material. Engage actively by taking notes, summarizing what you learn, teaching the concept to someone else, or applying the knowledge in your practical projects.

📚 Enroll in Other Short Courses

Keep learning by enrolling in other short courses. We add new short courses regularly. Visit DeepLearning.AI Short Courses page to see our latest courses and begin learning new topics. 👇

👉👉 🔗 DeepLearning.AI – All Short Courses [+]

🙂 Let Us Know What You Think

Your feedback helps us know what you liked and didn't like about the course. We read all your feedback and use them to improve this course and future courses. Please submit your feedback by clicking on "Course Feedback" option at the bottom of the lessons list menu (desktop view).

Also, you are more than welcome to join our community 👉👉 🔗 DeepLearning.AI Forum

Sign in

Or, sign in with your email

Email

Password

Forgot password?

Don't have an account? Create account

By signing up, you agree to our Terms Of Use and Privacy Policy

Create Your Account

Or, sign up with your email

Email Address

Already have an account? Sign in here!

By signing up, you agree to our Terms Of Use and Privacy Policy

Choose Your Plan

Planning for more users?

What best describes you?

This helps us tune the catalog to suit you best.

Software Engineer

Data Scientist

Machine Learning Engineer

Data Analyst

Product Manager

Entrepreneur

Business / Consulting

Research / Academic

Student

Other

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Join Team Success

You have successfully joined undefined

You now have access to all Pro features. Click below to start learning!

Session Expired

Session expired — please return to Cornerstone to restart the session and complete the course.

/

Building Multimodal Data Pipelines

All Courses

/

Building Multimodal Data Pipelines

All Courses

Building Multimodal Data Pipelines

Building Multimodal Data Pipelines

Course Syllabus

Elevate Your Career with Full Learning Experience

Unlock Plus AI learning and gain exclusive insights from industry leaders

Access exclusive features like graded notebooks and quizzes

Earn unlimited certificates to enhance your resume

Starting at $1 USD/mo after a free trial – cancel anytime

Before we tackle processing video, let's master its building blocks and the techniques used to analyze them. All right, let's go. At its core, video is really just two things: audio and frame data. A soundtrack plus a sequence of images. So, if we can learn to process audio and images individually, We're most of the way to processing video and that's what this lesson is about, building on that foundation. We'll cover ASR for audio and OCR for images and how to use these technologies in a data engineering context. Then, in the lab, You'll work hands-on with real audio and slide data and make it searchable and analyzable. Let's start with audio. Audio is just not very well suited to search. There are hours of meetings, calls, and recordings sitting in data storage, full of insights but completely unsearchable. If someone said something important last quarter, you might have a really tough time searching and finding it. Audio is also sequential. You can't skim it or search it, you have to listen to it. The solution is to convert speech to text using ASR or automatic speech recognition. Once you convert the speech to text, all of our text-based data processing methodologies apply. Under the hood, ASR uses deep learning in the form of transformer models to process audio signals. And modern ASR is able to handle speaker accents, background noise, multiple speakers, and more. So how does ASR actually work? It's actually similar to how LLMs work. It starts with a raw Audio Signal, the waveform, and maps that to Spectral Features. You can think of a spectrogram as a visual fingerprint of sound. It shows what frequencies are present at what times. From there, a Neural Network looks at the visual patterns in that spectrogram and converts them into word pieces, the smallest units of language. Then, a language model predicts the most likely word sequences from those pieces. The end result is a text transcription along with metadata like timestamps and speaker labels. So you can go from raw audio all the way to structured, searchable text. The data set we're working with is from the AMI meeting corpus. The data set covers a fictional product design meeting where participants design a new remote control and discuss features and budget related to the product. It's a single meeting split into four parts, A, B, C, and D. And you'll see this structure reflected in our lab code. Each meeting part includes audio recordings, presentation slides, and video from multiple camera angles. In total, there are about 100 files across all three modalities, which makes it a great compact data set for practicing multimodal processing. Here, you see a sample slide, a screenshot of one of the videos, and now let's listen in to a few seconds of an audio recording. Our agenda today is we're going to do a little opening and then um, I'm going to talk a little bit about the project. Then we'll move into acquaintance, that's just getting to know each other a little bit, including a tool training exercise. Okay, let's dive in. Let's start by connecting to Snowflake and setting up our session. This cell loads our credentials, creates a Snowflake session, and defines a few variables that we'll reuse throughout the lab, like our database, warehouse, and the internal stage where our data files are stored. Next, we'll create a personal schema based on your username. A schema is a logical grouping of data, tables specifically. And if you want to learn more about schemas, check out database theory material and the Snowflake documentation. In this lab, we'll create a personal schema that you can use as your own workspace within the shared database. So as you can see, this is the schema that I'm going to be using as my workspace throughout the lab. You should see similar output when you run this code. All right, now we're ready to start transcribing some audio files. Here, we define and create the audio transcripts table. where we'll store the results of our ASR processing. You'll notice the table has five columns with important metadata from the audio files. meeting_id, meeting_part, the path to the audio file, the actual transcript_text, and the duration of the file. We then create that table within the schema that you created earlier. As you can see, the table has been created and it's called audio transcripts. This table is going to hold our transcribed audio. Now, we loop through each meeting part and run the AI_TRANSCRIBE function on the audio files. This is the ASR step. It converts each audio recording into text and gives us a duration along with the other metadata that we defined in our table earlier. The results are going to get inserted straight into our audio_transcripts table. I want to call out one thing. Throughout this course, we're going to use Python as much as possible, but some features and functions like AI_TRANSCRIBE and other capabilities are only available through SQL. So you'll see me embed SQL queries into the Python code when I need to. As you can see, it's already processed the audio for the first meeting part ES2008a and it's going to proceed to the other meeting parts, parts B, C, and D. So it'll take a few minutes to actually execute and complete. Let's take a quick look at what we got back. This is going to pull up a preview of each transcript along with the audio duration. So you can see a preview here of each audio file for each meeting part. You'll note that the column is just a preview of the transcription, but you can see the words here, Okay, good morning, everybody. You'll note in the preview column that we have a preview of the transcription for each of these audio files. For example, okay, good morning everybody, and that corresponds to the transcribed audio from our audio files. Now let's look at techniques for processing image data. Starting with one of the most classic, Optical Character Recognition or OCR. OCR was designed to solve a really common problem. text trapped in images. Think slides, screenshots, scan documents. Companies have tons of files stored this way and we often want to access the information inside them for indexing, search, analysis. or to power new applications. The text is there, but because the file format is an image rather than a text document, you need techniques that can process the image to pull that text out. There's no single technique called OCR. It's really a high-level goal, and over the decades, it's been accomplished with whatever the most capable techniques of the day were. Traditional OCR focused on identifying individual characters and matching them to templates. It relied on clean, well-formatted input and was failure prone. Modern OCR uses deep learning to predict text sequences directly from the full image. It can handle complex layouts, tables, mixed content like diagrams, and even handwriting. The end result is that image data can be readily transformed into text data that you can index, search, and use to power new features and applications. Okay, let's turn to the notebook to see OCR in action. Let's create a table to store the text that we extract from the slide images using OCR. Each row is going to capture the meeting info, the slide file path and the extracted text. You can see our table was successfully created. Here we loop through each meeting part and use the AI_PARSE_DOCUMENT function to run OCR on every slide image. It's going to read the JPEGs from our stage and extract the text content, storing everything in the slides OCR table. As you can see, it just extracted the text from the slides. for meeting ES2008a, and it's going to iterate and do this for all meeting parts. And now, let's preview the OCR results. First, we show a summary table, then print out the full extracted text for a slide so that you can see exactly what was pulled from the images. And in this output, you can see extracted content from each slide image, including the title. and what was actually on each slide. Now that we've converted both audio and video data into text, let's look at some techniques for managing that text data, beginning with Chunking. Semantic search works by using embedding models to map text to vector representations. These are numerical coordinates in a high-dimensional space. But embedding models have input limits, typically somewhere between 512 and 8192 tokens, depending on the model. A full meeting transcript easily exceeds that. So we need to break our text into smaller pieces, chunks, before we can embed them. But there's another reason chunking matters beyond just fitting within model limits. It actually improves retrieval precision. If you embed an entire document as one vector, a search query has to match against everything at once. smaller chunks, you get more targeted results because each chunk represents a more focused piece of content. So, how big should your chunks be? There is a real tradeoff here. If your chunks are too large, you risk exceeding the model's input limits. And even if they fit, the embedding ends up diluting the relevance. You retrieve too much loosely related content. Now, if your chunks are too small, you lose context, you fragment the meaning, and your search results end up being too narrow to be useful. The sweet spot is a chunk that captures a complete thought, fit within the model's limits and retrieves precisely what you're looking for. The simplest chunking strategy is Fixed-Size Chunking with Overlap. You just split the text every N tokens or characters. But the key trick is overlap. You include some text from the end of the previous chunk at the beginning of the next one. This prevents cutting sentences mid-thought. and ensures you don't lose context at the boundaries. For example, you might use 500-token chunks with a 50-token overlap. This is exactly what we'll do in the lab. So let's turn back to the code and implement this. Here, we create a table to store our transcript chunks and set our chunking parameters. 500 characters per chunk with a 50 character overlap, just like we discussed. Now we apply that fixed size chunking with overlap. We loop through each transcript, slide a window of 500 characters forward by 450 characters each step. and that's our 50 character overlap. Then we write all the chunks back into the transcript chunks table. Okay, great. So we've successfully chunked every transcript for every single meeting. And as a quick sanity check, let's preview our chunks and check the statistics. This is a preview of our transcript chunks sample, which contains information like the chunk_id and the chunk_index. that corresponds to specific chunks of the transcript, along with a chunk preview on the right. At the bottom, we see a high-level overview of our chunk statistics across all of our meeting parts. For example, you can see that meeting part A had 14 chunks and the average chunk length there was 470 characters. Okay, great. Now that our text has been chunked, let's turn to vector embeddings. In lesson one, we covered how embeddings capture semantic meaning as vectors. Now we apply that to the audio transcripts and slide text that we just extracted. The process is pretty straightforward. take a text chunk, run it through the embedding model, and get a vector. We then store that vector alongside the original text and its metadata, including the source file, timestamp, and modality. The key insight is that audio transcript chunks and slide text chunks both get embedded into the same vector space using the same model. That means a single query can search across both modalities at once, with results ranked by relevance regardless of where the content came from. Now let's turn to our notebook to implement vector embeddings. Remember that embeddings capture the semantic meaning of text as numerical vectors. So we're going to apply them to both the audio transcript chunks and the slide text that we just extracted. We'll store each vector alongside the original text and its metadata, things like source file, timestamp, and which modality it came from. The key insight is that because both audio chunks and slides get embedded into the same vector space, a single search query can find relevant results across both modalities. And that is the power of this approach. So you can see here that our embeddings have been generated for the transcript chunks and the slide OCR text. As a quick sanity check, let's confirm the embeddings were generated by previewing a few rows with their chunk text and corresponding vectors. These are the embeddings that were generated using the arctic-embed model within Snowflake. Because we're using the same model and the same 768 dimensional space for both transcript chunks and slide text, we're able to compare them directly. Now let's test it out with a semantic search. Here, we embed a natural language query, in this case, budget for the remote control design, and we use cosine similarity to find the most relevant transcript chunks. What we see here is a preview of the five chunks that have the highest similarity score when compared to the natural language query. You can see what meeting parts they came from, and you can see the actual similarity score ranging from .77 to .82 So you'll notice they're not an exact perfect one. At the bottom, we print out a full chunk and you'll notice that the term remote control is mentioned twice, but the word budget is missing, and that's okay. That aligns well with a similarity score that we see in the preview, which is really close to one, but not one. And this is what embedding and semantic search unlocks. And here's the same thing but searching across slide text. This is going to use the same embedding model and the same approach. This means that one query is going to be able to surface relevant slides just as easy as we did with our transcript chunks. You can see that my query here is product features about the remote control, and we output a couple of things. The first is a preview of the top ranking slides that matched our query. So you can see here a path to the slide along with their similarity score. At the very bottom, we print out the content that was actually on one slide. It's an object containing things like the title and things like listed items that were on the slide. For example, you'll notice that the title is Ranking important aspects of remote control. And another example is that number one says research urges switch from current functional look-and-feel to fancy look-and-feel remotes. All right, so time for the real magic, Cross-Modal Search. We're going to union audio chunks and slides into a single query and rank everything by similarity. We're also going to get results from both modalities side by side. So again, one search, multiple data types, they're all in the same vector space, which means that we can search across them. You can see that my query was product features about the remote control, and we're outputting the 10 results across both audio and slide, ranked by similarity. One thing I'm noticing is that audio consistently ranks at the top in terms of similarity. And that makes sense because the AMI Corpus dataset has a lot more audio covering all four meetings than say slides that showed up during a segment of the meeting. This means that it's far more likely that a search query will match the transcribed audio. You may notice variations like this as you explore multimodal datasets. It's one thing to keep in mind as you build out semantic search on top of those multimodal datasets. Great job. Let's recap what we just did. We used ASR to turn raw audio into searchable transcripts, and we used OCR to extract text from slides and documents. We chunked that text with overlap so we don't lose context at the boundaries. Then we embedded everything into a shared vector space using the same model. This means a single semantic search was able to surface results across both audio and images. And these are exactly the same building blocks that we'll use when we tackle video in the next lesson.

deco top

deco bottom

Building Multimodal Data Pipelines

Sign in to continue learning

Building Multimodal Data Pipelines

Intermediate

1h1m

Topics

Data Processing

Collaborator

Building Multimodal Data Pipelines

Introduction
Video
・
2m

Multimodal Data Overview
Video
・
7m

Automatic Transcription, OCR, and Semantic Search
Video with Code Example
・
16m

Processing Video with a VLM
Video
・
7m

Building a VLM‐Backed Pipeline
Video with Code Example
・
8m

Multimodal RAG System
Video with Code Example
・
9m

Conclusion
Video
・
1m

Graded・Quiz

Course Details