One major shortcoming of ColPali is that it requires significantly more memory than other techniques in order to store all the vectors for each document. As a result, optimization techniques are used to reduce ColPali's memory footprint. Let's see what those approaches look like. The first is Scalar or Binary Quantization, which converts the floating point numbers in each vector to a condensed format, like a 4-bit integer or even 1 bit. The second approach is Row or Column Pooling. Patches are grouped together by rows or columns, and the vectors in each group are pooled, usually by averaging, to create a single vector. The final approach is Hierarchical Pooling, which is essentially more intelligent Row/Column Pooling. Patches are grouped with other patches that have similar embedding vectors. You will start by loading the document embeddings. In the previous lesson, you've learned how ColPali converts PDF pages to screenshots and generates multi-vector embeddings. Here, you will use a helper function that handles all of these steps internally. PDF conversion, image loading, and embedding generation. Since these details were covered in the previous lesson, you can focus on what matters for this one, optimization techniques. You will use pre-computed embeddings to keep this lesson moving smoothly. But you can also set LOAD_PRECOMPUTED to false to see the entire process. You will load the ColPali processor to demonstrate how ColPali structures document images as patch grids. This will help you understand the spatial pooling techniques we'll explore later on. Another helper function handles all the complexity, either loading precomputed embeddings from Parquet files, or generating them fresh by converting PDFs, loading the ColPali model and processing images in batches. load_or_compute_image_embeddings function returns a data frame with image paths and embeddings already converted to NumPy. Here is how some of the entries look like. Obviously, since you are working with images, we can also display them on the screen. This slides are coming from different Deeplearning AI courses. So if you have ever took any of them, after completing this one, you will be ready to just ask question directly to the slides. Each page generates around thousand token embeddings. That's a lot of vectors to store and search over, but let's see how the individual shapes of the embeddings look like. It's consistently 1031 vectors per image. You can now implement different optimization techniques, one by one, and eventually compare them on the same data set. Let's start with scalar and binary quantization. When we calculate the amount of memory needed to store the vectors, a common factor is the size of each individual dimension. Standard embeddings use 4-byte float32 values per dimension, but if you could reduce that size without sacrificing search quality, you would have an easy way to save significant RAM. Scalar and binary quantization techniques aim to represent vectors using either one byte integers or even just one bit per dimension, and they are widely used for regular dense embeddings from bi-encoders. Scalar quantization compresses float32 values, which are 4 bytes, down to int8 values, which are just 1 byte, achieving four times compression. Scalar quantization works by mapping the continuous float range to discrete integer buckets. For each dimension, the quantizer learns the minimum and maximum values across the entire data set and then linearly maps this range to the 0 to 255 integer space. For example, if dimension 5 ranges from negative 0.8 to 1.2 across all vectors, a value of 0.2 would map to approximately 128. This makes scalar quantization dataset aware. It requires analyzing all vectors, or at least a representative sample, to determine the optimal ranges. Fortunately, vector search engines like Qdrant handle this process internally. We simply configure a quantization at the collection level, send our original float embeddings, and the engine handles the compression and transformation automatically. Binary quantization takes compression even further, reducing each float32 dimension, which is 4 bytes, down to just one bit, achieving a dramatic 32 times compression ratio. The process is simple. Positive values become one, while negative or zero values become zero. This extreme compression transforms similarity calculations into efficient bitwise operations. Binary quantization works particularly well when embeddings are centered around zero with relatively symmetric distributions, which is common for normalized neural network outputs. However, this aggressive compression does come with trade-offs. Like scalar quantization, Qdrant handles binary quantization internally. We configure it at the collection level and set original float embeddings, letting the engine perform the binary conversion. A pooling approach can take advantage of ColPali's spatial structure. Remember from the previous lesson that ColPali processes document images as a 32 by 32 grid of patches. Row pooling averages embeddings along each row of this grid, while column pooling averages along each column. This preserves spatial relationships while dramatically reducing the number of vectors to just 32. Let's extract the image patch embeddings. The processor creates a sequence of a 1024 image patch tokens from a 32 by 32 grid, plus a few special tokens for the model's architecture. For the pooling methods, we don't need this additional tokens. So we can use the image mask from the processor to extract only the image related token positions. By filtering to just the image patch tokens using this mask, we can reshape the flat sequence of a 1024 embeddings into a 32 by 32 by 128 grid. Let's create a helper function that does exactly this. Each position in this 32 by 32 grid corresponds to a patch location in the original image, and each patch has a 128 dimensional embedding. Let's test your reshaping function on the first document's embedding. You've just applied the embeddings_grid function to verify it correctly transforms the mask embeddings. into the expected 32 by 32 by 128 dimensional structure. Once our embeddings are in a grid form, you can implement row and column pooling as simple NumPy operations. Row pooling averages each of the 32 rows of the grid, resulting in 32 representative vectors that capture horizontal patterns across the document. And similarly, column pooling averages each of the 32 columns, capturing vertical patterns. Both methods reduce the total number of vectors from 1024 to just 32. It's not that easy to tell anything just by looking at these numbers. But just to confirm the shape is correct, you will run row_mean_pooling on the created grid, and it's truly 32 by 128 dimensions. The same goes for column_mean_pooling. After calling the method, we can see that the shape matches the expected size. Nevertheless, please do not spend much time looking at the numbers we can represent. Hierarchical token pooling uses clustering to intelligently group similar embeddings. The algorithm performs hierarchical clustering on the token embeddings, then averages each cluster into a single representative vector. Clusters are created from similar patches. So if you have a lot of background patches of the same color, they should be grouped together. The ColPali engine library has an implementation of that technique available. Let me show you a simple example of how it works. Assume you run Hierarchical Token Pooling on the image of nine patches. Each patch gets its own embedding, so our multivector representation is a sequence of nine vectors. If the pool factor is set to two, you create four clusters as 9 divided by 2 is 4.5, which rounds down to four. I'm making this up, but probably both patches containing ears would end up in the same cluster. Similarly, the patches with just fur are also quite similar to each other, so it's quite likely they would also be grouped together. A tile with eyes is quite unique, so it's likely to be a single patch cluster. And so on, and so forth. Patches don't have to be contiguous to form a cluster. Actually, the pooling technique gets a sequence of embeddings with no information about spatial relationships, so there is not even a way to do it. The ColPali engine library that you use has an implementation of this technique available. So you will just create a helper function called hierarchical_token_pooling to have a simpler way of using it on our original embeddings. Let's run it on one of the examples that you have in the data set to see how much memory we're able to save by using this method. With a pool factor of two, you've reduced the 1031 vectors down to 515. Cutting memory usage nearly in half while preserving the semantic information through clustering. Now comes the crucial experiment, comparing all our optimization strategies in a real vector database. We'll create a single Qdrant collection with multiple named vectors, each representing a different optimization approach. This allows you to directly compare retrieval quality and memory usage across all strategies. Original embeddings, binary quantization, hierarchical pooling at different factors, and spatial pooling. You will answer the key question, how much memory can we save without sacrificing retrieval quality? First of all, let's make sure the collection does not exist. You will set up multiple named vectors in a single collection. Each vector will use multivector configuration with max sim comparison, but the number of vectors for pooling will vary depending on the strategy. For quantization methods, we'll enable them on the corresponding named vector. to compare against the original. Now, let's insert all our embeddings into the collection. Each document will have seven different vector representations, allowing us to compare the retrieval performance side by side. The helper method loads precomputed vector data and also limits which slide decks are used to a subset of the entire list. To avoid loading all the vectors into memory, the helper function yields the examples one by one. You will run the same queries across all the optimization techniques you implemented to compare retrieval quality. The helper function handles all the complexity. It will load precomputed query embeddings or compute them fresh depending on the load precomputed flag. The queries, the model, and the processor are just created internally in the helper function. Now let's search with each vector configuration and compare the results side by side. You will retrieve the top three documents for each query and optimization strategy. So, let's define the order of the optimizations so we can refer to them easily. You won't apply token pooling on queries, as they are already limited in terms of the sequence length. Row and column pooling for queries do not make sense as there are no spatial relationships in text. Moreover, Qdrant handles quantization internally. So in all the cases, the call to Qdrant API uses the query embeddings directly. Thus, we have a helper method that runs the same query on all the vectors and presents them as a matrix. Let's visualize the results with actual document images side by side. This comparison shows precision metrics with color coding to quickly identify which optimizations might maintain retrieval quality. Precision here measures what percentage of the retrieved documents match those returned by the baseline, which is original embeddings. That shows how well each optimization preserves retrieval accuracy. Our first query, coffee mug, returns this set of documents as the top three matches. Scalar quantization, binary quantization, and hierarchical token pooling were all able to return the same set of results. Sometimes the order was slightly different. Still, the precision does not measure the order of the documents. It focuses only on the fact of being relevant or not. However, for column_pooled and row_pooled, we are getting slightly different set of results, and we can clearly say that at least column_pooled didn't do this job really well. Since you only have three queries, you can run them one by one. It's time for the second one. Again, scalar quantization was able to return the same set of documents, but all the other methods struggled with that. Especially column pooling was not even able to pick one of the documents that were marked as relevant by the original ColPali embeddings. Last but not least, let's check the last query, which is one learning algorithm. Scalar quantization still provides exactly the same set of results like the original ColPali embeddings. Similarly, hierarchical token pooling was able to return the same set of results like the original baseline vectors. Unfortunately, row and column pooling both do not seem to be working that fine, at least not for our data set. However in practice, we don't look at the quality of retrieval for specific cases, but we calculate the metrics for all the test examples globally. The average precision at five for all the tested methods look like this. Column and Row Pooling didn't perform well as they were rarely able to select the most relevant document as you saw in the examples in our notebook. Hierarchical Token Pooling works significantly better, and the pool factor does not even seem to impact it too much. So we could possibly increase it even more. Surprisingly, a simple Scalar Quantization method was able to consistently return the same documents as the baseline, making it a promising approach that does not require any preprocessing as it can be configured at a collection level only. Binary Quantization wasn't the greatest method out there, but given its huge possible impact on the memory usage and processing speed, it might be considered in some cases. It's worth mentioning that you can combine multiple memory optimizations of a different kind. Scalar Quantization might be also enabled for the embeddings after applying Hierarchical Token Pooling if you see the quality is still acceptable. Nevertheless, a real benchmark should not focus on three handpicked examples but be broader. Still, evaluation is key and dependent on the datasets you are working on. Experiment with different techniques to see which combinations work best in your data.

Multi-vector Image Retrieval

Intermediate

Topics

AI Coding

Search and Retrieval

Collaborator

Qdrant

Multi-vector Image Retrieval

Introduction
Video
・
3 mins

Multi-vector Text Retrieval: ColBERT
Video with Code Example
・
17 mins

Multi-vector Image Retrieval: ColPali
Video with Code Example
・
15 mins

Optimizing retrieval with multi vector representations
Video with Code Example
・
15 mins

MUVERA Embeddings
Video with Code Example
・
18 mins

Building multi-modal RAG with ColPali
Video with Code Example
・
11 mins

Conclusion
Video
・
1 min

Optional: Hands-On Project
Code Example
・
10 mins

Quiz

Graded・Quiz

・

10 mins