In this lesson, you will explore scenarios where vector search alone may fall short. You will work on some of the most common challenges you might encounter, and see how to effectively address them using vector databases. All right. Let's get to coding. Choosing the best possible embedding model is a crucial step to build a reliable semantic search system. However, no matter the model you choose, there are some relationships which are not well captured by the vectors. Sometimes these problematic relationships may be counterintuitive. So let's review them in this lesson. We already know that tokenization is an important step of creating the embeddings, especially since it can chop our words into multiple pieces. And this transform the actual input of the transformer. Subword level tokens may have clear meaning that will be captured already in the input token embeddings layer. Shorter or rare tokens may, on the other hand, occur in various contexts, what effectively limits the ability of the model to learn proper input representations. This is especially true for the prefixes and suffixes which are parts of different words. None of these tokens has a specific meaning, and the corresponding input token embeddings will be rather covering different contexts. It's certainly surprising, though, that the ability of the model to capture the semantics of both words will be rather limited. Also, some papers, such as: "Investigating the effectiveness of BPE, the Power of Short Sequences", suggest that the fewer tokens an algorithm needs to cover the test set, the better the results might be. There are various other problems you may face while implementing semantic search. Let's see what are the most common misconceptions and how to avoid them. There are some myths about the magical capabilities of the semantic search. Unfortunately, it fell short if the tokenizer cannot handle input data properly. Typically, you don't consider the embeddings assigned to each individual token, but you just take a single vector per whole text. In this lesson, you will use the same sentence transformer like before. You will also need the tokenizer to see how texts are split into tokens. Let's access it. You might be surprised to see that some characters which are commonly used nowadays are not That well capture by some of the embedding models. The happy face emoji is not translated to any meaningful token, and we should not expect it to capture the real emotions of the person who wrote this text if we use this particular model. Things may change if we just describe them using text. There seems to be a token corresponding to being happy, but only if we use regular letters, not emojis. How about different emotions? Sad Face and happy face are both not recognized and converted into the same unknown token. If our documents only differ in that single word, then the transformer input will be identical in both cases. However, when we use regular letters, there is a corresponding token found. Again, our model struggles with emojis and probably with some other sets of characters. If you want to use it on social media data, you'd better check how many unknown tokens there are. Many businesses use specific domain terminology. For example, Broadcom BCM2712 is a processor used in Raspberry Pi. If someone asked for this device, you would love to see this processor in the set of results. Unfortunately, the name of it is just cut into multiple subword level tokens. You might also consider semantic search to handle synonyms as well. But if you check how titles are tokenized, you will see there are lots of subword level tokens and probably not that much meaning of their corresponding input token embeddings. If your users will just accidentally omit the double "C" in the word accommodation, then the whole text will be just transformed into a completely different set of tokens. Many people also believe that vector embeddings should capture the numbers and produce meaningful results. For example, when two items have similar prices. There are various ways of how to define price. However, the overlap of the tokens is the highest between the first and the fourth example. Even though the prices are completely different. When it comes to dates, there are also various ways to describe the same day. If you have multiple different formats in your system, then the tokenization will also be completely different. That still doesn't mean the representations will be far from each other. That's something we need to verify. The first two examples represent the same date, but they are obviously the least similar in terms of how they're converted into tokens. Practically, we never calculate embeddings of single words, but rather sentences or paragraphs. The impact of a single token is not that high, but it might be significant in some cases. Let's check the same examples like before, but this time the similarity of the output text embeddings. We will display the similarity as a heatmap. Emojis do not have any token embeddings assigned, and that results in a different emojis being encoded as unknown token. Eventually, it won't make a difference if someone is happy or sad, as long as they use emojis to describe it. Let's also check how typos may impact the overall similarity. Similarity of both sentences is not that high. Even though any human would see these sentences mean the same thing. As a next example, you will check the numerical values. The biggest overlap of the tokens also results in the highest similarity. Increasing the price from $55 to 559 does not change the meaning too much. Another example you checked for the tokenization was about the dates. The same format of the date seems to have highest impact on the similarity. Even though the first two examples describe the same date, their similarity score is the lowest. The problems you just realized might not hold true for all the embedding models available out there. That's why this kind of evaluation is important. Otherwise, you don't know what specific types of queries are just not supported. Let's take another model and see how well it solves them. OpenAI embeddings are commonly used, so we'll check them now. The Tiktoken package is an OpenAI fast implementation of the byte-pair encoding. Let's check the vocabulary size used in the model you will use. The vocabulary is three times bigger than the sentence transformer we used so far. Let's see how well it works for the same cases and start with emojis. It seems that emoji is transformed to some tokens, but on the byte level. Obviously it will be different if we just use words. The first difference you can see is that the OpenAI tokenizer was trained on some additional sets of Unicode. Thus it should handle them better. Let's see if the Raspberry Pi and its processor was also commonly mentioned in the training data. It seems the CPU wasn't mentioned too often. Still, that doesn't mean the representation will be bad. Let's now check how typos impact the tokenization. Different prices will also be encoded differently. Let's check the tokenization out. Last but not least, different date formats. After checking the tokenization of the problematic cases, let's check if these problems also impact the similarity. First of all, we have to connect to the OpenAI services. Now, if we pass the same texts using both emojis and regular text, we can calculate the cosine distance between all of these examples and check how the similarity of all the examples look like. Emojis are supported by the model and the same emotion described with text and graphical representation seem to keep the highest similarity of the sentences. Let's check the typos and their impact. Similarities are definitely higher than for the sentence transformer. If you think using the OpenAI embeddings solves all the problems, you'd better do some more evaluations. Another example was about prices and numbers in general. Definitely, OpenAI embeddings seem to handle the numbers better than the sentence transformer does not necessarily generalize to all the numbers, but this is already promising. When we perform the same process for the dates, different formats are still problematic and must be solved differently. It's better to consider some normalization if that's the case for your data. Semantic search is not all you need. You should also pay more attention to the additional constraints typical for the domain. You will now connect to Qdrant and check an example of a real problem you might encounter while working with semantic search. A Qdrant server should be already running in the environment, and connecting to it is as simple as creating a client instance. Let's also check the list of the collections we have to make sure the environment is fresh. Your data will consist of textual descriptions of different pieces of clothes and their prices. Don't take it too seriously as it was generated by one of the GPT's. You can now create a collection that will keep the data encoded with the sentence transformer we use. It produces vectors with 384 dimensions and that's the size you'll set. Cosine distance is a metric you want to use to compare the vector similarity. The upsert methods of the client helps to put the data in a collection. Each point has an ID, a vector, and some metadata that we call payload, in Qdrant's terminology. Our points are ready to be searched over. The query has to be encoded with the same model as we used for the documents. And now imagine you want to find a piece of cloth suitable for winter. This is a use case in which semantic search should shine over the lexical search, as none of the query words has been ever used in the product descriptions. Our results seem to be okay, but in some cases we may have some additional requirements. Imagine one of your users is looking for something below $40. They would probably put this kind of query for cold weather under $40. Unfortunately, the price requirement was not properly handled. The semantic search does not guarantee that it will be captured. The payload is a solution to this kind of problems. Vector databases generally provide additional mechanism to apply filters next to the semantic similarity. If your vector database has a concept of metadata indexes, it's always good to use them to keep this additional filtering efficient. A good practice in Qdrant is creating a payload index on a filter fields we use. In our case, an index on the price field makes a lot of sense. From now on, when we have strict price level requirements which constrain the set of results using filters. That's the only way to ensure we always provide a relevant set of results given a user query. All the results returned by our system, fulfill our criteria. You can also play with your own queries, but generally semantic search should work even for some more sophisticated ones without any overlap with the documents we have. Still, there is a challenge in how to extract these additional filtres from the query that will require changing the UI of the system or applying some NLP techniques to detect them. Eventually, the LLM's will also be really good at extracting these constraints into a structured output. Overall, that's another challenge we want to discuss in this course. In the next lesson, you will learn how to ensure your semantic search system works well. Let's discuss quality. See you in the next lesson.