In this lesson, you will learn how to evaluate your cache using metrics like hit rate, precision, recall, and latency to understand its real impact. All right, let's go. So the two ways that our cache can fail, it can either have a Low Quality or a Poor Performance. And the way that we're going to measure our cache performance is similar to how we measure the performance of machine learning models. In this lab, we're going to use this data set where we have user queries, their corresponding cache hits or the closest match we have, the distance to that cache hit, and we also have a label column, which tells us whether this cache hit was true or not. This last column can be obtained either by using human feedback or labeling it with LLM as a judge. In this slide here, we can see diagrammatically what happens with the data set. We have examples of the cache entries and they are mapped in the embedding space. Here we have a point in the embedding space. and we also see the distance threshold. These two user queries are mapped to the specific cache entry. This metric measures from all of the user queries that come to our system, how many of those actually fall within the bounds of the distance threshold. In this case, we have three out of five. Next, we'll continue with Precision. And Precision measures how much of all of the entries that fall within the Distance threshold are actually valid. And this validity comes from the label column. Recall measures how many of the queries that should have hit actually hit. And here in this case, we see that one of the questions that should have hit did not because of a too low of a distance threshold. It is also important to see where each example falls in our confusion matrix, which shows us all the true positives, true negatives, false positives, and false negatives. In the specific case of our example, we have one true negative that doesn't map to anything. We have one False Positive, which maps but it's wrong. We have one False Negative that should have mapped, but just because of the threshold was too low it didn't, and two True Positives. And these are all the values of our confusion matrix. We prefer for the confusion matrix values to be mainly on the main diagonal. The distance threshold can help us trade off between Precision and Recall. As we lower the distance threshold, we can increase Precision but we lose Recall. And as we increase it, the reverse happen. We increase the recall of the model and decrease the precision. And the measure that we can use to strike a good balance between precision and recall is the F1 score. We'll use this technique of sweeping through the threshold to optimize and find the best threshold for our data. And in the next lesson, we're going to use the F1 score to find the perfect threshold for our cache. It is important also to measure the speedup that the caching system can give us and the LLM tokens that it can save us. To measure the latency performance, we can use this formula, which measures a metric called With Cache Latency, which breaks down into three different variables. ACL or Average Cache Latency. which is the average time that our cache takes to respond. ALL or average LLM latency, which is how long the LLM takes to respond, and CHR or cache hit ratio, which is how many of the user queries actually get cache hits. And let's look at a specific example where the average cache latency would be 11 milliseconds, the average LLM latency is 350 milliseconds, which is probably an underestimate. and the cache hit ratio is 30%. We calculate that the with cache latency is 256 milliseconds. which you can compare to the system's latency before being cached, and we see the improvement. which in this case is 26%. And to measure the saved LLM tokens, you can use the resource at the corresponding URL, where you can put your cache hit rate and expected daily queries, input and output tokens. and it will calculate for you an expected measure of the annual savings. And now let's do all the cache performance measures in the code. All right, let's start by setting up our environment. And let's continue with loading our data that we're already familiar with. Then we'll introduce the SemanticCacheWrapper, that wraps around our cache and provides us some helper functions uh that we can use to hydrate our cache or check our cache. and so forth. Then using the abstraction, we can hydrate our cache using this simple helper function, and you can also check it with questions like this first instance of a user query that we have in our FAQ data frame. When we check our cache, we get an object which we see nicely formatted here, which shows us what our query was and all of our matches. In this case, we have a single match. We have already defined this test_queries list, which contains all of the user queries that we're going to be using. We can also introduce the check_many helper function, which can given a list of user queries, give us results of the matches. We can also configure it with different thresholds or different number of matches. Okay, moving on, we can now also use an abstraction called the CacheEvaluator. I've already executed this cell, but as you can see, you can provide it a number of cache_results. And because our data set is set up in a particular way, we're actually able to automatically label them because the data set was automatically generated. We can use the data_container that provides the data to provide also labels for the data. Because the data_container can extract the queries and the matches, and because we already know the correspondences because of how the data was generated, we can generate labels automatically. Running the report_metrics function also prints us this nice-looking report where we'll see the confusion matrix. In this case, we have a rather nice-looking confusion matrix and all of the metrics for the particular threshold that the cache was instantiated with. In this case, we get a precision of 0.79 and recall of 0.79. Another very useful high-level helper function is called get_metrics. This function gives us access to all of these metrics in a dictionary format, but it also gives us something called the confusion_mask. The confusion mask is similar to the confusion matrix, but instead of giving you the values of the confusion matrix as numbers. It gives you a mask over the data set, which you can use and extract the different values corresponding to different categories of the confusion matrix. In this example, we're seeing the first nine examples of the true negatives group. In this case, we see that the 0, 1, 2, and the third element is a part of the true negatives group. Another very useful method of the evaluator is the matches data frame, matches_df. If we run it, we get a nicely looking data frame that gives us our queries, matches, distance to the corresponding match, and labeling. We can use our confusion masks and filter out and get the corresponding groups. In this case, we're looking at the false positives or these five examples here of this particular example. In this specific instance, we see that the query, Can I get a refund if I change my mind? is mapped to the cache entry, How do I get a refund? It's obvious why the retrieval model got confused, but because we generated our dataset in a particular way, having in mind that this query is not particular enough to be matched to this cache entry. We have generated the label to be false. And this so that because of that, we're considering this example as a false positive. So for instance, let's look at this last example in the table. Can I schedule a specific delivery time? is matched against Can I change my delivery address? As you can see, These are quite different, even though they sound similar. The model has given them low enough distance so that we've considered them as a match, even though we've labeled them as though they should be a false match. So that's why this example falls into the bucket of false positives. I would strongly encourage you to change this false positives modifier here or a mask to all of the other values like false positives, false negatives, and true positives, all of the other groups of the confusion matrix so that you can explore the results there. Now let's introduce helper functions for evaluating the latency performance of our cache. We'll define a simple function just that would simulate an LLM response latency by randomly sleeping for a period of 200 to 500 milliseconds. Then we'll run this performance code, which uses an abstraction called PerfEval. which we can use to compare the performance of two different executions simultaneously. At the end of the execution, we'll obtain a dictionary called metrics, which is generated by the perf_eval abstraction, which will give us the latency both for the cache and the LLM calls. If we look into the dictionary, we can see that we can obtain the average latency of the cache, which is 2.2 milliseconds, and the average latency of the simulated LLM, which is 361 milliseconds. Using the perf_eval, we can also plot all of the metrics that we have obtained. Here, we see a breakdown of all of the different groups of evaluations that we did. We had 80 cache hits and 80 LLM calls. And here we can compare their latency in a visual manner. So this is how a 360 milliseconds on average looks like, compared to the 2 point something milliseconds for the cache. In the summary block here, we can also see the cache speedup that we get, which in this case is 161 times faster. So what we saw now is a raw LLM latency versus full cache latency. which gave us this big speed up, but it's not a fair comparison. What we should actually do is get our raw LLM latency and compare it. This is the cache latency. We can assume some cache hit rate, in this case 30%. and we can calculate using the formula from the slide, the cached LLM latency that we can expect. Now using that, we can calculate both a drop in the latency and a cached LLM speedup. And here, for this particular example, you see that we see 29% drop in latency. and around 1.42x speed up of the system. Let's move to our last section here, which is LLM-as-a-Judge, where we were going to introduce an automatic way to label your query cache pairs. We're going to start over by hydrating our cache with the faq data frame. We're going to do a full retrieval. So we're going to use a distance_threshold of 1, and for each test query, we're going to retrieve the closest neighbor. no matter the threshold. And what we get is a list of all of the closest matches for all of the queries. Now since we're going to use an LLM to label all the query cache pairs, we have to load our key, which is already loaded. Then we can use an abstraction called LLMEvaluator, which here we will construct using the ChatGPT API. It already comes with pre-configured prompt. which help us compare pairs of queries and cache hits. This evaluator instance has a method called predict, which you can use to pass in our pairs, which are the test queries and the full retrieval matches. We can also configure a batch size. So this call to predict gives us an object called llm_similarity_results. Let's inspect it. This object has a property called data frame df, which formats the results in a data frame format. We can see that this is a data frame that matches every pair with a reason and a is_similar label, which tells us if this pair is similar or not and the reasoning why the LLM thought that was the case. Now we can continue and use our evaluator again, but instead of using the data container to label our data set. we can use the automatic labeling that we generated here by passing the is_similar column and all the values as a boolean labeling. It is important to note that we have to construct the evaluator with this special constructor called from_full_retrieval, which is important when we're using the full retrieval matches. We can again use the evaluator.report_metrics function to get a report. of the performance. As we can see, previously we had 79% precision and 79% recall. Using the automatic labeling, we get 75 precision versus 90 recall, which is relatively close in terms of metrics to what we had previously. which tells us that this method can create good enough labels for our evaluation. And now that you've learned how to evaluate your cache, in the next lesson, you're going to learn how to improve it. Now let's clean the cache and continue to the next lesson.

Semantic Caching for AI Agents

Intermediate

Topics

Agents

Collaborator

Redis

Semantic Caching for AI Agents

Introduction
Video
・
3 mins

Overview of Semantic Caching
Video
・
9 mins

Build Your First Semantic Cache
Video with Code Example
・
10 mins

Measuring Cache Effectiveness
Video with Code Example
・
13 mins

Enhancing Cache Effectiveness
Video with Code Example
・
12 mins

Fast AI Agent with Semantic Cache
Video with Code Example
・
16 mins

Conclusion
Video
・
1 min

Quiz

Graded・Quiz

・

9 mins

OPTIONAL: Project
Code Example
・
10 mins