In this final lesson, you'll integrate your semantic cache into an AI agent so it could reuse past results, skip redundant work, and get faster over time. Let's have some fun. Before we build our own agent, let's recap. Agents can be expensive to scale because they consume a lot of tokens. And what makes agents unique is that during execution, they perform multiple steps and provide many opportunities to reuse intermediate states when appropriate. It's not just about caching the final response. to an original user question. This may work in some cases, but often the raw inputs to agentic applications are more complex and require multiple steps to get a quality answer. For example, agents can cache user profiles or preferences, the outputs to tool calls or reasoning, or even LLM generated plans. In the end, we expect to see the cache enriched over time, yielding fewer required tokens for processing. Let's see this in action. At Redis, we built our own Dataframe explorer agent that can answer questions over data sets loaded into pandas data frames. The agent is given a data schema and a user question, and these inputs prompt the agent to generate some Python code that can analyze the pandas data frame. Throughout the process, the agent caches multiple things, including end-to-end user questions and resulting answers. These are cached with a low TTL, time to live, since data can change frequently. We also cache the generated code that comes from the LLM. That way if a similar question is asked, we already know the code that can be used to execute to get the response. And we also cache guidance on solving common errors that come up from the code that's generated by the LLM. All of these things together can yield a more efficient agent workflow. So here in this diagram, we show processing three different questions on the left and then on the right. On the left, in red, this is the first time we've processed this question. And on the right, this is what happens when we ask a very similar question on the second pass. On the second pass, the cache has been enabled, and we can see that we use many less tokens the second time around. In the end, every single successful cycle enriches the cache and subsequent executions get more instant answers using less tokens overall. So we're going to do this ourselves in code, and we're going to build our own deep research agent. This agent will be built with LangGraph to construct the workflow to implement the cognitive architecture for agentic RAG. This RAG agent will also implement a quality assurance loop. So we'll review the answers that come from the LLM and using different tools, we'll revise and iterate and go back to the beginning if they need. Additionally, this agent will granularly cache individual sub-questions across executions. Let's see how this works. So step one with our agent is we're going to process the query. The user query comes in and we will decompose it into three to five sub questions. These you can think of as individual smaller tasks or smaller questions that we need to go answer. Secondly, we will check the semantic cache for each individual sub question. This allows us to see if there's work that's been handled in the past that we can reuse for this run of the agent. Next, we implement a research loop with tools. This allows us to research uncached questions using the knowledge base tool. Next, we evaluate the quality. The agent uses an LLM judge to implement a score from zero to one, zero being poor quality, one being of excellent quality. And if needed, we iterate and go back to the first step where we try to do some research again. We can do this up to two times with feedback. At the very end, the agent uses a large language model to combine all of the individual pieces of research done into a final response for the user. As you would expect, this workflow is expensive and takes a long time. to run and potentially 20, 30, 40, 50, 60 seconds, it just depends on the quality and difficulty of the question. But the result with caching included is that the agent is more intelligent and learns over time and this allows us to reduce the costs across requests while maintaining high response quality. So let's get into it and see how this works in the code. First we need to set up our environment. We're going to be using OpenAI in this lesson to build our agent. First we'll set our OpenAI key which should already be in your learning environment. Next, we'll go ahead and import a whole host of Python libraries and dependencies we'll use to build our agent in one go. From our Python dependencies, you'll notice we pull in several things from the standard library. We'll also pull in components from langchain, and LangGraph. Lastly, we'll grab necessary dependencies from Redis and the redis-vl open source SDK to do semantic caching and vectorization. Before we build our agent, we also need to connect to our Redis database running on localhost 6379. Let's ping to make sure we can talk to it. We're all good. The agent we are building is a customer support agent. And this agent is supposed to have access to a knowledge base of content in order to answer questions. You can think of this as content from documentation, from manuals, from PDFs. The first step is we're going to actually load our raw documentation into a knowledge base. This knowledge base allows the agent to query for the relevant information it needs to answer a question from scratch. In other words, this is allowing the agent to implement RAG. Let's go ahead and load our knowledge base with all of this documentation. Every entry in our documentation here is going to be vectorized using the OpenAI embedding model. Now that we have a knowledge base in place, the other thing we need in terms of infrastructure is the semantic cache. Similar to other portions in our course, we're going to load the semantic cache from our FAQ data set we've been using all along. Here we hydrate our semantic cache from the FAQ data we've been working with. Excellent. You can see Redis is running and accessible and we've loaded 8 FAQ entries into the cache. It's time to build our agent. in this lesson is built using LangGraph. To make this incredibly easy, we've broken out all of the portions of the agent into different helper functions. This method here just initializes the agent with its cache. knowledge base index and embedding model. LangGraph allows us to create workflows that pass around state. We're going to build this workflow using a set of helper methods to add nodes, to add edges, and direct traffic through the course of the graph. For example, this workflow has nodes that perform query decomposition, that check the cache, that perform additional research, and evaluate the quality of the research. And at the very end, synthesize the result for the end user. Our workflow entry point begins at the decompose query step, and we set edges between all of the necessary nodes. Some deterministic edges, and some conditional edges. Conditional edges are built between nodes based on some decision. And the decision is often specific to the workflow or the application at hand. Here, for example, We're putting a conditional edge after the cache check. If we need to do additional research on one of the decomposed questions, we pass the entry to the research node. Otherwise, if everything is already loaded from the cache, we can go directly to the end and skip to synthesis. Our agent also has a direct edge between research and evaluate quality, so after every single research step, we passed to an LLM as a judge, where we will evaluate the quality of the results. And lastly, after quality evaluation, we'll either go back to do more research or we'll go to the very end and synthesize the results for the end user. And that's where our workflow ends. Let's go ahead and compile this agent. Another nice property of LangGraph is that we can visualize the agent workflow we just built. As you can see, this is the workflow we discussed during the lesson intro. It involves query decomposition, cache checking, research, quality evaluation, and synthesis, all the way to the very end. It's time to demo the agent. We're going to test our agent against 3 different scenarios. The first one, an end user is trying to evaluate a piece of software. They have questions about different features and status of different things. They need to know before they can make the purchase. Let's see how the agent responds. Here you can see the first step was that the agent decomposed the original question into four different sub-questions. As you can see one, two, three, and four. of those four sub-questions that needed to be researched, three of them were not in the cache. But one of them, we were able to serve directly from the cache. That means we had a cache hit rate for this particular request, 25%. Now that we have three total questions that need to be researched because they were not in the cache, our agent begins its process of iterating and finding a result, an answer to the question. At the very end, once it gets to a good solution, all of our sub questions are validated and added to the cache for the next run. This workflow completed in about 20 seconds. Additionally, we used 8 large language model calls, two to GPT-4 and six to GPT-4-Mini. The total latency of the request was around 20 seconds. and you can see that the research portion took nearly 10 seconds. Here's our entire response that came back to the user. Personalized with all of the information that they need. Thank you for considering our platform. You can see it answer things related to the SOC 2 compliance, GDPR compliance, as well as API rate limits, and even Salesforce integration information. Let's move on to the next scenario. Scenario 2 involves a different user who's at a different stage of their buying process for this particular piece of software. And they need to move forward with implementation planning. They have some additional questions for the support agent. Let's see how this agent responds. In similar fashion, the agent decomposes the question into four sub-questions this time, where three are found from the cache. Only one of the sub-questions missed, and that's okay because our agent can handle that one question and finish its research. At the very end, you can see we get a cache hit rate of 75%. Three of the four questions hit the cache. In turn, that took our LLM calls from eight in the previous run down to just four. two to GPT-4, two to GPT-4-Mini. The total latency was also 13 seconds on this run instead of around 20 to 25 on the last one. That's a big improvement. And here you can see the nicely formatted response to the end user. This calls out things like rate limits, Salesforce integration capabilities, data export options and payment methods. These are all things that this user requested in order to plan out their software implementation. Let's go to the final scenario. In our third and final scenario, this is yet another user that is doing a pre-purchase comprehensive review. They're at the final stage of procurement and before they can do the Pro plan purchase, they want to complete validation on different things they have questions about. You'll notice some of these things are similar topics that came up in other user's requests. Let's see how our agent handles this particular request. Similar to before, our agent handles this question by breaking it into four sub questions. Similar to last time, three of our four questions are hit on the cache and one is missed. Again, this is not a problem because our agent can still handle research on that one task. And at the very end, you can see our total latency is around 18 seconds. LLM calls six this time, not as good as four, but still better than eight. The cache hit rate for this particular request was around 75%. Now that we've run our agent across different scenarios, you're probably wondering, how do they compare head-to-head? Well, let's take a look. Here, this function analyze_agent_results takes the results of all three scenarios that we tested and reports back a nice plot and visualization. Let's take a look. On scenario one, we had a cache hit rate of 25%. But in the next two scenarios we hit our cache 75% of the time. You can see the cumulative cache hit rate over time gets up to 60%. We expect over time in a system with hundreds or thousands or even millions of users, this particular cache hit rate can be very effective at saving costs. Let's look at the LLM calls. In our first scenario, we used up 8 different calls to different LLMs for different tasks. But in the last two, we used four and six respectively. This is a great savings. And probably the most interesting piece for our user experience is the latency. you can see in the first scenario our overall latency was quite large. and in the next two scenarios, the latency end to end was about a third of the total latency. This is a significant performance improvement and again over time we expect this kind of trend to continue. Now we're going to build an interactive demo that's going to bring all of this together. The demo runs on Gradio and allows us to reach out to external web pages at some URL of your choice. We'll pull in all of the documentation and content from those pages, pull it into a knowledge base for your agent. And the agent is going to implement agentic RAG and allow you to chat with those documents. And over time, this particular agent will build up its semantic cache just like we've seen in theory and practice in the lessons leading up. Now we have our deep research agent enabled with semantic caching. Let's pull in the contents from the AT&T International cell phone plan site. Hitting process URL will first crawl and load all of that content from the website and load it into our local knowledge base. Now we can ask different questions of this particular data, implementing the exact same agent that we saw before. Let's try one question. The first question we have for our agent is about a cruise trip coming up. We want to make sure that our cell phone coverage through AT&T will still work while we're on the cruise. Will we be okay? The agent is currently doing its research to find an answer to this question. We can see in the performance log, nothing registered on the cache, and it overall took about four LLM calls to get to an answer. Let's try another question. When traveling internationally, is there any difference between cruise and land in terms of AT&T coverage? All right. On this one, we were actually able to leverage the cache. One sub question from the LLM, hit on the cache. We only used two LLM calls end to end. And you can see here, we've got a nice personalized answer to the user. Let's try another question. in Spain, will I be covered by AT&T still? Nice. The agent tells us yes, we can still use our AT&T service while in Spain. AT&T offers an International Day Pass. Excellent. And here on this particular question, we also used an entry in the cache. And only two LLM calls. We're around 300 tokens. That's great. Let's now ask a more complicated question about our trip to Spain. We've got a trip coming up in Spain where part of the trip is on a cruise and the other part is on land. What options are there for me to keep cell coverage through AT&T? Here we can see the agent responded with a personalized answer about my upcoming trip to Spain. It notices that I'm going to be spending time on land and on a cruise. On this particular question we also were able to utilize something in the cache. And so the LLM was more efficient than the first time we ran. Let's take a look at what ended up in our cache. We can do a cursory cache check with a distance threshold of one, meaning find everything, you know, anywhere near this particular query. Here we can see the entry in the cache was I have a cruise ship coming up and wanted to make sure that my cell phone coverage through AT&T will work. We have a response that was used looks like multiple times throughout the course of this particular agent execution. Now that you have access to this agent in the demo, try another URL, try another site, ask different questions, experiment with how this agent performs over time. You have what you need here to build a production ready AI agent that uses a semantic cache. Along the way, we hope you learned about a lot of the great fundamentals in order to build and scale the system. We look forward to seeing what you build with this.

Semantic Caching for AI Agents

Intermediate

Topics

Agents

Collaborator

Redis

Semantic Caching for AI Agents

Introduction
Video
・
3 mins

Overview of Semantic Caching
Video
・
9 mins

Build Your First Semantic Cache
Video with Code Example
・
10 mins

Measuring Cache Effectiveness
Video with Code Example
・
13 mins

Enhancing Cache Effectiveness
Video with Code Example
・
12 mins

Fast AI Agent with Semantic Cache
Video with Code Example
・
16 mins

Conclusion
Video
・
1 min

Quiz

Graded・Quiz

・

9 mins

OPTIONAL: Project
Code Example
・
10 mins