Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
Before we tackle processing video, let's master its building blocks and the techniques used to analyze them. All right, let's go. At its core, video is really just two things: audio and frame data. A soundtrack plus a sequence of images. So, if we can learn to process audio and images individually, We're most of the way to processing video and that's what this lesson is about, building on that foundation. We'll cover ASR for audio and OCR for images and how to use these technologies in a data engineering context. Then, in the lab, You'll work hands-on with real audio and slide data and make it searchable and analyzable. Let's start with audio. Audio is just not very well suited to search. There are hours of meetings, calls, and recordings sitting in data storage, full of insights but completely unsearchable. If someone said something important last quarter, you might have a really tough time searching and finding it. Audio is also sequential. You can't skim it or search it, you have to listen to it. The solution is to convert speech to text using ASR or automatic speech recognition. Once you convert the speech to text, all of our text-based data processing methodologies apply. Under the hood, ASR uses deep learning in the form of transformer models to process audio signals. And modern ASR is able to handle speaker accents, background noise, multiple speakers, and more. So how does ASR actually work? It's actually similar to how LLMs work. It starts with a raw Audio Signal, the waveform, and maps that to Spectral Features. You can think of a spectrogram as a visual fingerprint of sound. It shows what frequencies are present at what times. From there, a Neural Network looks at the visual patterns in that spectrogram and converts them into word pieces, the smallest units of language. Then, a language model predicts the most likely word sequences from those pieces. The end result is a text transcription along with metadata like timestamps and speaker labels. So you can go from raw audio all the way to structured, searchable text. The data set we're working with is from the AMI meeting corpus. The data set covers a fictional product design meeting where participants design a new remote control and discuss features and budget related to the product. It's a single meeting split into four parts, A, B, C, and D. And you'll see this structure reflected in our lab code. Each meeting part includes audio recordings, presentation slides, and video from multiple camera angles. In total, there are about 100 files across all three modalities, which makes it a great compact data set for practicing multimodal processing. Here, you see a sample slide, a screenshot of one of the videos, and now let's listen in to a few seconds of an audio recording. Our agenda today is we're going to do a little opening and then um, I'm going to talk a little bit about the project. Then we'll move into acquaintance, that's just getting to know each other a little bit, including a tool training exercise. Okay, let's dive in. Let's start by connecting to Snowflake and setting up our session. This cell loads our credentials, creates a Snowflake session, and defines a few variables that we'll reuse throughout the lab, like our database, warehouse, and the internal stage where our data files are stored. Next, we'll create a personal schema based on your username. A schema is a logical grouping of data, tables specifically. And if you want to learn more about schemas, check out database theory material and the Snowflake documentation. In this lab, we'll create a personal schema that you can use as your own workspace within the shared database. So as you can see, this is the schema that I'm going to be using as my workspace throughout the lab. You should see similar output when you run this code. All right, now we're ready to start transcribing some audio files. Here, we define and create the audio transcripts table. where we'll store the results of our ASR processing. You'll notice the table has five columns with important metadata from the audio files. meeting_id, meeting_part, the path to the audio file, the actual transcript_text, and the duration of the file. We then create that table within the schema that you created earlier. As you can see, the table has been created and it's called audio transcripts. This table is going to hold our transcribed audio. Now, we loop through each meeting part and run the AI_TRANSCRIBE function on the audio files. This is the ASR step. It converts each audio recording into text and gives us a duration along with the other metadata that we defined in our table earlier. The results are going to get inserted straight into our audio_transcripts table. I want to call out one thing. Throughout this course, we're going to use Python as much as possible, but some features and functions like AI_TRANSCRIBE and other capabilities are only available through SQL. So you'll see me embed SQL queries into the Python code when I need to. As you can see, it's already processed the audio for the first meeting part ES2008a and it's going to proceed to the other meeting parts, parts B, C, and D. So it'll take a few minutes to actually execute and complete. Let's take a quick look at what we got back. This is going to pull up a preview of each transcript along with the audio duration. So you can see a preview here of each audio file for each meeting part. You'll note that the column is just a preview of the transcription, but you can see the words here, Okay, good morning, everybody. You'll note in the preview column that we have a preview of the transcription for each of these audio files. For example, okay, good morning everybody, and that corresponds to the transcribed audio from our audio files. Now let's look at techniques for processing image data. Starting with one of the most classic, Optical Character Recognition or OCR. OCR was designed to solve a really common problem. text trapped in images. Think slides, screenshots, scan documents. Companies have tons of files stored this way and we often want to access the information inside them for indexing, search, analysis. or to power new applications. The text is there, but because the file format is an image rather than a text document, you need techniques that can process the image to pull that text out. There's no single technique called OCR. It's really a high-level goal, and over the decades, it's been accomplished with whatever the most capable techniques of the day were. Traditional OCR focused on identifying individual characters and matching them to templates. It relied on clean, well-formatted input and was failure prone. Modern OCR uses deep learning to predict text sequences directly from the full image. It can handle complex layouts, tables, mixed content like diagrams, and even handwriting. The end result is that image data can be readily transformed into text data that you can index, search, and use to power new features and applications. Okay, let's turn to the notebook to see OCR in action. Let's create a table to store the text that we extract from the slide images using OCR. Each row is going to capture the meeting info, the slide file path and the extracted text. You can see our table was successfully created. Here we loop through each meeting part and use the AI_PARSE_DOCUMENT function to run OCR on every slide image. It's going to read the JPEGs from our stage and extract the text content, storing everything in the slides OCR table. As you can see, it just extracted the text from the slides. for meeting ES2008a, and it's going to iterate and do this for all meeting parts. And now, let's preview the OCR results. First, we show a summary table, then print out the full extracted text for a slide so that you can see exactly what was pulled from the images. And in this output, you can see extracted content from each slide image, including the title. and what was actually on each slide. Now that we've converted both audio and video data into text, let's look at some techniques for managing that text data, beginning with Chunking. Semantic search works by using embedding models to map text to vector representations. These are numerical coordinates in a high-dimensional space. But embedding models have input limits, typically somewhere between 512 and 8192 tokens, depending on the model. A full meeting transcript easily exceeds that. So we need to break our text into smaller pieces, chunks, before we can embed them. But there's another reason chunking matters beyond just fitting within model limits. It actually improves retrieval precision. If you embed an entire document as one vector, a search query has to match against everything at once. smaller chunks, you get more targeted results because each chunk represents a more focused piece of content. So, how big should your chunks be? There is a real tradeoff here. If your chunks are too large, you risk exceeding the model's input limits. And even if they fit, the embedding ends up diluting the relevance. You retrieve too much loosely related content. Now, if your chunks are too small, you lose context, you fragment the meaning, and your search results end up being too narrow to be useful. The sweet spot is a chunk that captures a complete thought, fit within the model's limits and retrieves precisely what you're looking for. The simplest chunking strategy is Fixed-Size Chunking with Overlap. You just split the text every N tokens or characters. But the key trick is overlap. You include some text from the end of the previous chunk at the beginning of the next one. This prevents cutting sentences mid-thought. and ensures you don't lose context at the boundaries. For example, you might use 500-token chunks with a 50-token overlap. This is exactly what we'll do in the lab. So let's turn back to the code and implement this. Here, we create a table to store our transcript chunks and set our chunking parameters. 500 characters per chunk with a 50 character overlap, just like we discussed. Now we apply that fixed size chunking with overlap. We loop through each transcript, slide a window of 500 characters forward by 450 characters each step. and that's our 50 character overlap. Then we write all the chunks back into the transcript chunks table. Okay, great. So we've successfully chunked every transcript for every single meeting. And as a quick sanity check, let's preview our chunks and check the statistics. This is a preview of our transcript chunks sample, which contains information like the chunk_id and the chunk_index. that corresponds to specific chunks of the transcript, along with a chunk preview on the right. At the bottom, we see a high-level overview of our chunk statistics across all of our meeting parts. For example, you can see that meeting part A had 14 chunks and the average chunk length there was 470 characters. Okay, great. Now that our text has been chunked, let's turn to vector embeddings. In lesson one, we covered how embeddings capture semantic meaning as vectors. Now we apply that to the audio transcripts and slide text that we just extracted. The process is pretty straightforward. take a text chunk, run it through the embedding model, and get a vector. We then store that vector alongside the original text and its metadata, including the source file, timestamp, and modality. The key insight is that audio transcript chunks and slide text chunks both get embedded into the same vector space using the same model. That means a single query can search across both modalities at once, with results ranked by relevance regardless of where the content came from. Now let's turn to our notebook to implement vector embeddings. Remember that embeddings capture the semantic meaning of text as numerical vectors. So we're going to apply them to both the audio transcript chunks and the slide text that we just extracted. We'll store each vector alongside the original text and its metadata, things like source file, timestamp, and which modality it came from. The key insight is that because both audio chunks and slides get embedded into the same vector space, a single search query can find relevant results across both modalities. And that is the power of this approach. So you can see here that our embeddings have been generated for the transcript chunks and the slide OCR text. As a quick sanity check, let's confirm the embeddings were generated by previewing a few rows with their chunk text and corresponding vectors. These are the embeddings that were generated using the arctic-embed model within Snowflake. Because we're using the same model and the same 768 dimensional space for both transcript chunks and slide text, we're able to compare them directly. Now let's test it out with a semantic search. Here, we embed a natural language query, in this case, budget for the remote control design, and we use cosine similarity to find the most relevant transcript chunks. What we see here is a preview of the five chunks that have the highest similarity score when compared to the natural language query. You can see what meeting parts they came from, and you can see the actual similarity score ranging from .77 to .82 So you'll notice they're not an exact perfect one. At the bottom, we print out a full chunk and you'll notice that the term remote control is mentioned twice, but the word budget is missing, and that's okay. That aligns well with a similarity score that we see in the preview, which is really close to one, but not one. And this is what embedding and semantic search unlocks. And here's the same thing but searching across slide text. This is going to use the same embedding model and the same approach. This means that one query is going to be able to surface relevant slides just as easy as we did with our transcript chunks. You can see that my query here is product features about the remote control, and we output a couple of things. The first is a preview of the top ranking slides that matched our query. So you can see here a path to the slide along with their similarity score. At the very bottom, we print out the content that was actually on one slide. It's an object containing things like the title and things like listed items that were on the slide. For example, you'll notice that the title is Ranking important aspects of remote control. And another example is that number one says research urges switch from current functional look-and-feel to fancy look-and-feel remotes. All right, so time for the real magic, Cross-Modal Search. We're going to union audio chunks and slides into a single query and rank everything by similarity. We're also going to get results from both modalities side by side. So again, one search, multiple data types, they're all in the same vector space, which means that we can search across them. You can see that my query was product features about the remote control, and we're outputting the 10 results across both audio and slide, ranked by similarity. One thing I'm noticing is that audio consistently ranks at the top in terms of similarity. And that makes sense because the AMI Corpus dataset has a lot more audio covering all four meetings than say slides that showed up during a segment of the meeting. This means that it's far more likely that a search query will match the transcribed audio. You may notice variations like this as you explore multimodal datasets. It's one thing to keep in mind as you build out semantic search on top of those multimodal datasets. Great job. Let's recap what we just did. We used ASR to turn raw audio into searchable transcripts, and we used OCR to extract text from slides and documents. We chunked that text with overlap so we don't lose context at the boundaries. Then we embedded everything into a shared vector space using the same model. This means a single semantic search was able to surface results across both audio and images. And these are exactly the same building blocks that we'll use when we tackle video in the next lesson.