You can skip this lesson if you don't care about the quality of the retrieval in your RAG application. If you stay, you will learn the tools and the most commonly used metrics of the information retrieval. You can't improve what you don't measure. So we need to learn how to measure it. Before we discuss optimizations. Let's get to it. Measuring the quality of the system outputs is the first thing to do if you want to optimize it. How else could you know that your application works better after doing some changes? Such relevance is not a new area, as we've been measuring how well the search engines behave for decades. The same means might be applied to semantic search as it's just a new approach to information retrieval. Let's review the most common metrics used in that space and see how we can apply them. For example, in retrieval augmented generation. Everything starts with building a reference ground truth datasets. It contains a set of queries and the best matching documents for each of them. Building such a dataset is a real challenge as it requires lots of human labor. Moreover, the real intents of a particular person are rarely perfectly reflected in the queries they write, and different items may be more relevant for different groups of people even if the query is the same. Even though a smaller yet well-curated ground truth dataset is a sign of maturity of any project, it helps to track how your changes improve the quality of the results. There are two typical ways of annotating the relevance of a document given query. Binary item is either relevant or not, or numerical. We express the relevance of an item as a number. The higher it is, the more relevant a particular document is. Obviously, for each query, the set of relevant results differs. Since the process of creating a dataset requires lots of efforts, we will just use one of the existing datasets which already has the ground truth defined. Wafair Annotation Dataset is an example you are going to use. Let's check how it looks like. WANDS stands for Wafair Annotation Data Set. It is a benchmark to determine the effectiveness of different search methods. It provides multiple CSV files, including the products. Each product has multiple parameters, but for semantic search, product name and description seem both to be natural candidates for being encoded by the embedding model. In this lesson, we are going to build two search processes. One of them will be searching over product names and the other one on the descriptions. Eventually, you will measure which one works better. First of all, let's convert the product names into vectors. We are not going to convert all of them, but just a subset of the data, and num products will define how many you want to select. You will use the same sentence transformer to calculate the embeddings of the product names for the selected number of them. This process took some time, and if you would like to do it for the whole dataset, it will be even longer. But each product name is now vectorized. As a next step, let's do the same for the descriptions. The process takes even longer for the descriptions because those texts are supposed to be longer. And if you remember how the embedding models work, they take a sequence of tokens, and the more tokens we have, the longer the processing will be. You could also experiment and combine the product name with its description, but here, you will just use and test these two ways. Searching over our dataset requires building a collection in the vector database. You will name it WANDS-products, and it will have two vectors for each point, one for the name and the other one for the description. The collection you created is still empty, but you will fill it with data right away. Our dataframe might be converted into dictionaries, and its index will be used as identifiers of points, as the identifiers might be either integers or UUID-like strings. The upload collection method will iterate in batches of 64 elements at a time. Once the upload is finished, our collection already has all the products ready to be searched over, and we can check the number of points in it. That doesn't mean all the internal processes have finished already. Vector databases implement an approximation of nearest neighbor search, and the optimizer has to build some helper data structures to make search efficient. Let's wait until the collection status becomes green as it means the data is indexed already. Another part of the WAND's dataset is a set of test queries. This query still do not define the ideal matches. That's what is put in the third file. The ground truth dataset maps queries to the relevant documents. Each entry has a label which is their exact, partial, or irrelevant. It's always easier to define the relevancy as a number, so let's convert texts into scores by a simple mapping. In addition to that, query and document id's will get a prefix to distinguish them from each other. We didn't mention how to measure quality yet. Information retrieval uses different types of metrics and we will review them now. Each query should result in returning a set of responses sorted by relevance. When it comes to vector search, measuring quality comes quite naturally. Our distance function, such as cosine distance, is used to sort the documents our system finds to be the most relevant. In the words of information retrieval, we would say they are ranked-based on the similarity. There are mainly three kinds of metrics you can calculate. Relevancy based, which only care if a certain document is relevant or not. Those metrics are useful if our ground truth dataset is binary. So we don't know how relevant each document is. There are also ranking-related metrics which take care of the position of the relevant items in the results. Intuitively, we want to put the best items to the top of the list and not present them on the third page. The last group focuses not only on the fact of being relevant, but also takes the relevancy score from the ground truth dataset into consideration. Let's discuss the first group of metrics. A common way to calculate the quality of the search system is precision@K. This metric measures the fraction of relevant items in the top K results returned by our system. If we return fewer than K relevant documents, achieving a 100% score might be impossible, but still, that's a good way to compare different search pipelines. Precision at K is calculated per query, and we usually report an average precision at k for all the queries from the ground truth dataset. Similarly, recall that k measures how much of the relevant documents we can return in the top k results. Perfectly, we would be returning 100% of them, but that might be hard if there are more than k relevant documents per query. Again, we usually report the average we call at k for all the queries. It's rather common to present both precision and recall together. Another group of metric incorporates the information about the ranking of the document. In other words, the order of the documents matters. Mean Reciprocal Rank is a commonly used metric that belongs to this group. It does not care about the number of relevant items returned, but only considers the position of the first relevant one. You will probably choose to optimize for that metric if you want to present something relevant in the very top of the results. It is a well-known fact that the first position of the Google search results are also the most clicked ones. Last but not least, Score related metrics are also commonly used. Discounted Cumulative Gain is one of the examples. It measures the total relevance of the return documents, but also addresses the problem of diminishing relevance of items down the results list. There are different alternative formulations of how to calculate DCG, but the main idea stays the same. We want to promote the fact of putting relevant items in the top of the results. The value of DCG at k is not normalized, so we quite often divided by IDCG which stands for Ideal Discounted Cumulative Gain. This is a DCG that the perfect ranking would get. That gives us a value which is normalized to the range of 0 to 1. We could possibly calculate the metrics on our own, but it doesn't make much sense as there are already some existing libraries. Ranx is one of them, and it's fairly easy to use. There are two main concepts: Qrels query relevance judgments a.k.a our ground truth. And Run, which are the outputs as returned by our search systems. Generally, for each query we defined a set of documents and their importance. In the case of Qrels the importance is an integer. The higher the better to match. For runs, we just use the similarity scores as returned by the vector db. Let's come back to the words dataset and see how well different attempts solve the problem of search. You should rather avoid implementing the metrics on your own. Let's use ranx and define the Qrels object from the ground truth dataset. You will now convert all the queries into embeddings so you can use them in their evaluations. Ranx requires a run object that describes the set of results returned for a particular query. Each returned document should get the score assigned by the search mechanism. You will do it twice. First of all, searching over the product name. We'll define a dictionary that will keep the run object data. For that, we need to iterate over all the queries from the dataset and then run a search operation on the collection we created before. We are going to use the product name and use the query embedding created in the previous cell. Once we have the set of results as returned by the vector database, we have to iterate over the points returned and build the dictionary structure as requested by ranx. Let's also display it so we know how it looks like. Now, this dictionary has to be converted into run object. You can do the same, but for the product descriptions. Another run object will encode the results for this attempt. In the final step, you will compare both search pipelines. Ranx allows to pass a list of metrics you want to measure as a list of strings. In our case, we want the precision, recall, mrr, DCG, ndcg and this all of them at k equal ten. All the metrics suggest the product name is a better candidate for the semantic search. Don't be surprised if your test returns slightly different values of this metrics, as vector databases may build their internal data structures differently on each of them. You might be still wondering how measuring the quality might improve your retrieval augmented generation apps. Measuring the quality of retrieval is an important thing in the retrieval argument generation. The idea of putting relevant information into the prompt improves the ability of the large language models to work with our private data. Our prompts won't help if we fill them with non-essential information. However, if our search mechanism is unable to provide the LLM with anything meaningful, the whole process will fail. That's why it's so important to test it carefully. Moreover, testing the retrieval is way more simpler than testing the whole RAG pipeline end to end. As the latter usually involves another LLM as a judge. Information retrieval is nothing new, and there are ways of making sure it returns what it's expected to do. If you take your project seriously, you should consider building a reference dataset from the very beginning. It helps to avoid some common pitfalls of building AI applications with semantic search. Think of it as if it was just another set of test cases you run in the CI pipeline to make sure your search results do not get worse over time. In this lesson, you learned how to measure the quality of the retrieval systems. If you want to serve a different number of results, check out how changing the k parameter will change the values of different metrics. In the next lesson, you will learn how to use these quality measures to optimize the approximate nearest neighbor search in your vector database. See you there!