Welcome to this short course, Retrieval Optimization: from Tokenization to Vector Quantization. Built in partnership with Qdrant and taught by Kacper Łukawski. Retrieval augmented generation involves two main steps. First, a retriever searches a large document corpus to find relevant information. Then a generator uses this information to produce accurate and contextually relevant results for the user's query. This course focus on enhancing and optimizing the first step, the retrieval step, in your RAG and search applications. You begin by learning how tokenization is done in large language models and in embedding models, and how the tokenizer can affect the quality of your search. Specifically, tokenizers create a sequence of numerical ids or integers representing the tokens which usually correspond to words or parts of words in a text. The multiple ways to turn a piece of text into a sequence of tokens. For example, simple word level tokenization would split a sentence like "I enjoy learning." Into "I" and then "enjoy" and "learning." While subword tokenization can break it down even further to "I" then "en" "joy", "learn" and then "ing". The tokenizer is a key component of the language model that is also trainable. You learn about tokenization techniques like wordpiece byte-pair encoding and unigram tokenization, as well as how token vocabulary impacts search results, especially with special characters like emojis, typos, and numerical values. I'm delighted to introduce the instructor for this course, Kacper Łukawski, who is developer relations lead for Qdrant. Kacper, has been helping many developers create and optimize the search and RAG systems. Thanks, Andrew. The first step in optimizing your system is measuring the quality of its outputs. Without this how can you, even tell if your changes have made an improvement. So in this course you will learn how to assess the quality of your search using several quality metrics. Vector databases use specialized data structures to approximate the search of nearest neighbors. HNSW which stands for Hierarchical, Navigable, Small words, is the most commonly used one, and it has some parameters that give you control over how good the approximation is. HNSW search is built on top of a multi-layer graph, and it's like shipping the package in the mail. The top layer gets you close enough to the state that mail should be delivered to, the next layer then finds the city, and as you go down the layers, you get closer and closer. You will see how to balance the parameters used for forming and searching the HNSW graph, for higher speed and maximum relevance. If you have millions of vectors a search, then storing, indexing and searching can become resource intensive and slow. Say you built a news analysis app that gathers all the news related to an industry and summarizes them. After chunking thousands of articles published each day within a month, you might easily end up with millions of vector embeddings. If you end up with, say, 5 million vectors, then using OpenAI's Ada embedding model, which generates a vector with a little over a 1500 dimensions, you need about 30 gigs of memory. And this will continue to grow every month. This is where quantization techniques come in. Using the techniques you learned in this course, you better reduce the memory needed for your vector search by up to 64x. You will learn three main quantization techniques. The first is product quantization. This maps subvectors to the nearest centroid, reducing memory usage or increasing indexing time. The second is scalar quantization by converting each float value to 1-byte integers. This significantly reduces memory and speeds up both indexing and search operations. However, this hurts precision slightly. And the third is an extreme form of quantization called binary quantization, which converts float values to Boolean values, which improves memory usage significantly and improves search speed, but at even greater costs to precision. In the next few lessons, you will first learn about embedding models and how they turn text into vectors. You will then learn how tokenization is done. Next, you will look at practical issues with tokenization and how they can affect your vector search and retrieval relevance. You will also learn how to measure the quality of search results in RAG applications, and why this is important for making improvements. We will review HNSW, and learn ways to improve its search results. We will also explore vector quantization to reduce memory use and make searches more efficient. By the end of this course, you will know how to optimize semantic search and build more reliable AI applications. Let's get started and create something great. Many people have worked to create this course. I'd like to thank David Myriel from Qdrant. In addition, Esmaeil Gargari and Geoff Ladwig from DeepLearning.AI have also contributed to this course. Up first is a video on how embedding models turn text into vectors, and the role of the tokenizer in this whole process. I always thought tokenization is one of the really critical, but often ignored and underappreciated aspects of the models we use. So let's go on to the next video and learn about that.

Please sign in to view this content

Next Lesson

Retrieval Optimization: Tokenization to Vector Quantization

Introduction
Video
・
6 mins

Embedding models
Video with Code Example
・
16 mins

Role of the tokenizers
Video with Code Example
・
15 mins

Practical implications of the tokenization
Video with Code Example
・
14 mins

Measuring Search Relevance
Video with Code Example
・
14 mins

Optimizing HNSW search
Video with Code Example
・
10 mins

Vector quantization
Video with Code Example
・
16 mins

Conclusion
Video
・

Appendix – Tips and Help
Code Example
・

Course Feedback

Community