In this lesson, you will optimize an existing vector search mechanism to reduce its memory usage. You will explore different types of vector quantization that not only make semantic search more affordable, but can also enhance its efficiency. Let's code and innovate. You might be surprised to see how effective the vectors may still be, even if we compress them using various quantization techniques. Let's discuss the basics. Vector embeddings are usually represented by floating point numbers. Float 32 means that each individual dimension needs 32 bits of memory, which is equivalent to four bytes. For example, a popular OpenAI embedding model will produce over 1500 numbers for each input text. That's six kilobytes of memory for such an individual chunk in your RAG, and there's still some overhead generated by the vector database structures. This number has to be multiplied by the total number of documents you will store. Even on a million scale, there has to be a decent amount of memory available. But if we go beyond that, the customer grade machine won't handle that load. Putting the data on disk was the only solution for those hitting their budget limits in the past. Quantization techniques turned out to be much more powerful, and offered a way to make semantic search more affordable, with a little impact on the search quality. Product quantization is the first method you will learn today. Everything starts with the original float based embeddings. Product quantization divides them into sub-vectors. The number of sub-vectors depends on the selected compression rate. For example, a compression rate of 16 means each sub-vector will have four dimensions, as the compression rate is defined in bytes. Once the vectors are split into pieces, we can start the compression. In the next step, we start from the sub-vectors. Each group of sub vectors is used as an input to the clustering algorithm, such as k means. Eventually, each sub-vector will get assigned to the closest centroid. The number of centroids is not configurable as it is always set to 256. A total number of possible values we can represent with a single byte. From now on, instead of storing the sub-vectors, we keep the identifiers of the closest centroids. In our case, each sub-vector had four bytes, so it required 16 bytes overall. After the product quantization, we just have to store a single byte, which is the centroid identifier. Sixteen bytes became just a single byte, so the achieved compression rate is also 16. Product quantization is called product because it divides a high dimensional vector into smaller sub vectors, and quantize each sub-vector separately using its own codebook. The combination of this quantized sub vectors forms a product of the individual code books, creating a comprehensive product code for the entire space. This approach leverages the Cartesian product of the subspaces to efficiently approximate the original high dimensional data. There are multiple possible configurations of the product quantization and its impact vary depending on how you set it up. If you are on a budget, then you can consider setting the compression rate even up to 64 times, but it's rarely works precisely enough. In our case, the most aggressive compression of 32 times decreased the precision of search to around 0.66. Even though the search with original vectors was able to get around 0.98, that's a huge difference. Product quantization will always result in reduced search precision, but sometimes it may speed things up a little bit. Inserting your points and building the collection will always be slower, though, as it has to run the K-means clustering. The degraded quality of quantization may be improved back with the help of free-scoring. First of all, the quantized representations encode very similar vectors in exactly the same way as their sub-vectors get the same centroids assigned. That makes it harder to always get the best results possible, as many points might be now represented with the same coordinates. The same coordinates also means the same scores. For that reason, vector databases keep the original vectors on disk. Once we have a set of results returned by quantized search, we can load the original embeddings and compute the similarity using them. That helps to differentiate the distances between the points coming from the same neighborhood. Restoring should improve the quality of search and choosing to turn it off should be rather a conscious decision. Let's see how product quantization works for our dataset. In order to test the impact of quantization on the search quality, you have to import the snapshot of the once collection again. Let's also repeat the process of calculating query embeddings with our selected model. Whenever we speak about measuring quality, we have to random. You already know we cannot improve the embedding model quality by applying any technique to the created vectors. Therefore, our baseline is defined by the outputs of the exact knn search, similarly to the previous lesson. You are now ready to evaluate the impact of the product quantization on the search quality. You have to modify the configuration of the collection to enable the product quantization on its vectors. This change of the configuration will fire the quantization process and then the whole collection will be rebuilt. There is no downtime, but during that changes been applied, you might be still using the old structures or exact search. From a user perspective, nothing changes. Setting always run property forces Qdrants to keep the quantized vectors always in memory. Again, let's wait for the optimizer to finish its job. We want to measure the impact of product quantization. So let's wait until it's ready. You will be now able to evaluate the quality of the collection again. The process is not different from what we have done already. When we did that for different actions HSNW configurations. The only thing we add here is an operation that measures how much time we spent on each search operation. Eventually, you will display the average latency, as it is also an important factor when choosing the right parameters for your semantic search. In this initial test, we disable the Rescoring mechanism for each quantization technique. You will run evaluate twice, either with or without the rescoring. Obviously rescoring comes with an additional cost, so it's better to avoid it if possible. Finally, we will store the latency of the search requests and then build a dictionary. Similarity to what we did for all the other tests so far. This is the average response time for product quantization without the rescoring, The product quantization results are also ready. Let's evaluate their quality versus the exact search than on original float based vectors. Again, using precision at 25. But you can also experiment with different values. Product quantization reduced the search precision, but we got pretty aggressive and requested the compression ratio of 64 times. Let's repeat the process of evaluation, but this time, with the rescoring enabled. It will recompute the distances using original vectors, and that should hopefully improve the quality. There is just a single difference in how we evaluate the quality in this case. We enable the rescoring for all the search queries we set, but all the rest is kept identical. Now let's run ranx evaluation again. Rescoring improves the quality of search, as different documents with the same representation may have their scores computed properly with full precision. Let's review the second quantization technique available for vector search. Scalar quantization is another technique you may apply to make semantic search less memory intensive. It is based on the very simple idea of converting floats into integers. However, when we have the original vectors represented with floats, the individual numbers come from a specific range, which is just a subset of all the values that we may represent with floats. When we then receive some new vectors, the range may change, so we never know the minimum and maximum values. That's why quantization is done separately on subsets of the data, not globally for all the vectors. Do you remember that HSNW graph is split into segments? That also helps to implement Scalar quantization. As the database assumes, each segment is immutable. So we can measure the range of the numbers and calculate minimum and maximum values. Then the whole quantization process is about converting these numbers from the range of integers between 0 to 255, or -128 to 127. The compression rate of Scalar quantization is always four. Four bytes are compressed to a single one. The memory required to keep the vectors after applying scalar quantization is slower by up to 75%. That comes with a cost in terms of search precision, but there are also speed benefits. Overall, Scalar quantization is usually the first thing to consider if you want to make your search faster and more affordable. Let's consider scalar quantization on the collection we have and test how it impacts the quality of retrieval. You will now repeat the quality evaluation process. But on the results provided by Scalar quantization. It has to be enabled first. But all the other steps are similar. Qdrant has to finish the rebuilding process. So let's give it some time. Our first attempt to evaluate Scalar quantization will be performed without the rescoring. Ranx require is to have a run object created. Let's do it. It will also evaluate the same metric on the collection with Scalar quantization enabled, but this time with rescoring. Let's run the ranx evaluation again. We've just calculated the precision at 25 for both rescoring disabled and enabled on the same collection. We'll compare all the quantization methods at the very end. Last but not least, Binary quantization takes the idea of compressing individual dimensions even further. It converts each floating point number into a boolean, either 0 or 1. 32 bits of information compressed to just a single bit. The role is as simple as converting each positive value to one, and each negative or 0 to 0. One important thing in binary quantization is over-sampling. Since many points will be represented in exactly the same way. You want to extract more points that you really need, and then calculate the distance between your query and original vectors represented as floats. Again, let's see the real-world impact of the binary quantization. Your last attempt to quantize vectors will focus on binary quantization. Enabling it is as simple as changing the collection configuration again and setting the quantization config to binary. Qdrant will remove the scalar quantized vectors and start the binary quantization on the original ones. Let's give it some time to finish that operation. At this point, you should already know what we will do next. Let's run our benchmark on the collection with binary quantization enabled and obviously do not rescore in the first test. Do you want to see the impact on our metric? Let's calculate this. Again. Let's enable rescoring and run all the queries against our collection. Finally, the impact of the rescoring on the binary quantization also has to be determined. As a last step, you will compare the search precision achieved by all the tested methods. As you may see, the highest results were achieved by Scalar quantization. Either a with or without rescoring enabled. Binary quantization with additional rescoring interface, also did the job well. However, choosing the right quantization model depends on the specific requirements of your project and the models you use. Quantization is usually the first method to consider when you want to make your semantic search faster, without sacrificing quality too much. Binary quantization offers reduced memory usage by even up to 32 times. However, the performance benefits might be even higher, as it can increase the speed of retrieval by up to 40 times. Overall, choosing the right quantization technique depends on the specifics of your applications. From now on, you should know what are the options and testing them out shouldn't be a big deal anymore. In this lesson, you implemented all the major vector quantization methods and measured their impact on the search quality. The ability to test and optimize semantic search will help you build more reliable AI applications.

Learn Code

Next Lesson