Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
This lesson is all about turning raw agent interactions into durable knowledge. You'll build pipelines that extract structured facts from conversations, consolidate episodic memory into semantic memory. and create write-back loops that let your agent update and refine its own memory autonomously. Let's get coding. In this lesson, we will do an overview of the most common memory operations. We will also talk about Context Engineering and how to use it to our advantage. how to efficiently use context window reduction techniques. And finally, we will talk about something called Workflow memory and how to summarize efficiently to improve our memory operations. So the most common memory operations, which we have already seen in lesson two. First, we have the conversational memory, which is typically stored in an SQL Table because it's conversational, so we want to store a lot of it in raw text. And the rest of the operations, for instance, the Knowledge Base, the workflow, or the toolbox patterns, the entity and the summaries are usually saved as vectors because they are very helpful to be stored in this type as we want to perform similarity search in one way or the other. As we saw in the last lesson with the toolbox pattern. So, what is context engineering? Context engineering is a technique where we optimally and carefully select the amount of information that goes into a large language model's context window by shaping the amount of information that we put there. This helps us make the large language model responses much more efficient or tailored to the actual input from the user, and this will actually produce better responses by the large language model. In the real world, we have lots of data sources. We have access to databases, APIs, MCP servers, and also the internet. So these are all potential context window components and contents that can go into the actual context window. But we cannot put all of it into the context window as it's a limited thing. So we need to carefully and optimally select the context that actually will go into the LLM context window. So what do we do if we have more information than what the context window allows? So for this, we will talk about Context Window Reduction, which is the process of reducing or shrinking the information that is actually placed on it. There are two main techniques, Context Summarization and Context Compaction. Each of these has their own advantages and its disadvantages, which we will see now. Context Summarization is the process of taking the whole context of the large language model and running it through the LLM where we want to focus on three things. So first, we want to retain the highest-signal information from the context, removing the low-signal parts and retaining the task-relevant facts and claims about it. Secondly, we want to preserve the meaning and the key relationships in the conversations so that if we remove a part of it, we don't lose track of the actual purpose of the prompt of the user. And thirdly, we want to remove redundant parts and also useless parts. what you might consider a useless part, something that has low-value detail or something that is not directly connected to the actual query of the user. And after performing Context Summarization, you will get a smaller input so that the large language model can better focus on the task at hand. The definition of Context summarization is the process of compressing all this context into a shorter representation that preserves the best information and the highest and most task-relevant data from the original. So Context Summarization is one of the two techniques that we have at our disposal where we inject the summarized context into a clean context window by wiping the previous conversation with the LLM, so that the final response can actually make the large language model reason everything again from the summary as a starting point. But this has one disadvantage, which is this will always be a lossy technique. What do we mean by that? A lossy technique means that we will always have some data loss. We will always lose a little bit of information in the process. And then we have Context Compaction, which is our second technique. And with Context Compaction, rather than summarizing the information, we will just transfer it into the database and use the database as an external addition to the large language model, so that whenever the large language model needs more information about it, can just go into the database and access this information. So this allows us to offload a lot of the context size into the database and let the large language model decide when it wants to access it. How do we do that? We just need to get this information, put it with an ID into the database, and then let the large language model know that this information is available just in case and we can complement this identifier with a description of the chunk that we just took out of the context. So that if the large language model is only interested in a very general idea of what the compacted content was, it can just take a look at the description. And if it needs more information, it can just go into the database and actually pull the whole context out. What about Workflow Memory? Workflow Memory is a technique which is also aided by a database where, for instance, going back to the current weather example that we had before. Imagine that the user asks, Get me the current weather, right? This is something that requires a lot of steps to perform by an LLM. So Workflow Memory will actually help us streamline and preserve a sequence of the required steps needed to perform this query. For instance, to get the current weather, we might want to get the user location. Then we want to access a weather API or an application tool and pass the latitude and longitude of the user into the application. Then we will be able to get the current weather for those coordinates, and finally return the weather response to the user. Now, since this is a multi-step process, Workflow Memory helps us with this because we can reuse this set of steps to get the current weather for any user after we perform it once. So we can preserve the structure of work over time in the database in a way that makes the large language model know exactly what steps to follow to respond to the users, we can have this structure and if you remember what we were talking about before, Workflow Memory can be used in a streamlined way and we can use this relational table to store information about the steps, a timestamp of when these steps were executed and also an embedding of all the steps. This would be an example. You have a workflow name with an optional description and the original user request with a set of ordered steps so that the large language model is able to reproduce this over and over without having to figure out on the spot what to do. This will help the large language model know exactly what to do on each user request without having to figure it out on the fly. And as a consequence, this will make the large language model much more efficient and will reduce the amount of context required to process and actually respond to the user. We're going to talk about memory operations, context engineering and the ways that we have to reduce the context efficiently. So, just like we have done in the previous lessons, we need to set up the access to our database. And as you'll see, we're correctly connected into the database again. We are also going to initialize our OpenAI client so we can use their GPT models as well as the embedding model from Hugging Face that we used before. We're also going to recall our table names so that we can properly address them in our store manager and our memory managers. And as we did before, we're also going to update and recreate our conversational history table and our tool log history table so we have some kind of reference system for the future when we inspect these logs. And you see here that they already exist, but if you were running this for the first time, you'll see that they were properly created. And after we created all our objects, our conversation tables and our vector stores, we are going to reinitialize our StoreManager and get all the store objects that we need to operate. as you'll see, they have all been loaded properly via the StoreManager. And the same goes for the instantiation of the MemoryManager and our Toolbox. So we can select the tools so that the large language model is able to select the right tools at the right time. and they have been properly initialized. And very quickly before operating with these models, you're going to set a maximum number of tokens for the model, which in this case is going to be 256000. So, since the purpose of this chapter is to work with context sizes, we are going to create this function called calculate_context_usage, which is going to do an estimation of the amount of tokens that are actually present in the context at any time. We need this to be able to assess when to actually perform context summarization or context compaction, which were the two techniques that we described before. So here we're doing a simple estimation by dividing the total number of characters by four. You'll see that depending on the way that each model stores the tokens, you will find that some of them divide by two, some of them divide by four. But we're going to use this estimation as four characters per token. And then we're going to set a percentage so that the maximum is never reached. And now we're going to create our first summarization function. So, since summarization is one of the two techniques that we described in our lesson, we need to be able to discern when summarization is optimal to perform and what instructions we need to give to the large language model to perform summarization. You can use different prompts. I invite you to actually try different prompts for each one of these. In general, there are different types of prompts that work better, I guess, against some types of problems. But here is an example of the prompt that we wanted to give. And remember that summarization is a lossy technique, so we will always lose a little bit of information in the process. So the prompting technique that you use for summarization will determine the quality of the output that you generate. In this case, we're going to summarize the conversation so that it can be resumed in the future by losing the least amount of information and preserving the highest context signals from our context. So we want to preserve all the important information and we remove the things that the LLM will consider as least important in this case. So for that, we want to keep concrete details. We want to separate the facts from open-ended things like questions to not hallucinate any information or include any hints generated by the LLM, and to keep it as concise as possible without being useless in the future. So we need it to be useful too. and we want to split and actually preserve only a maximum of 6000 characters in our final response. So this is what we do here. We call the LLM client with the model that we wanted, with a maximum number of completion tokens for the response, with this structured OpenAI specification request and after that, we will be able to retrieve the actual summary. We also have this safety mechanism here so that if the summarization fails the first time, we also have a fallback mechanism that tries a simpler prompt without that many concepts on the prompt, so that if the first output is empty, we will never completely lose all the context from the original user's input. So in this case, we are reducing the amount of instructions that we give to the large language model. And as opposed to what we did, we also need to have a mechanism to expand the summary as well to retrieve the original conversations. So in case that we stored a summary identifier through context compaction, we also need to be able to do the opposite. And we create this expand_summary function here to actually perform this functionality. So for that, We need to obtain the summary_text from the memory_manager by giving it the summary_id from whatever we compact. And after that, we can get the original context that was previously summarized or compacted. And now that we have this unsummarized and compaction versus summarization system, we are going to create this function that does five steps. The first step is to read all the unsummarized messages from a thread. A thread meaning a conversation, generating a summary via the LLM, so performing context summarization. Then storing this summary in the summary memory and marking the actual rows that were used in the summarization with an identifier. And finally, returning the summary object for future use. In this case, first of all, we read the unsummarized conversation with this SQL query, which basically goes into the memory_manager, goes through our whole conversation table, which as you will remember is stored in an SQL table rather than vectors, so we can trace back the timestamp, the ID of the conversation and the chat, and also all the content in the conversation. And we are going to use this information, all the rows that we get from that, to build a transcript from the conversation. So up until this point, we're just rebuilding the conversation that we had with a large language model. Then, we're going to summarize its context window through the transcript and with the help of the memory_manager and making use of the OpenAI client. Then we're going to generate a summary_id for that. We're going to mark the rows that were used on summarization. So if in the future we want to perform summarization again, we know that this information has already previously been summarized. So we do some updating on the memory_manager for this. And finally, we will mark the messages as summarized and return the summarized result. And as we talked about on the lesson, we also have this function which is going to help us compact the information and put an ID and a summary of the conversation into the database. So, in this case, this function is going to first go into the whole context that we pass it, and through the llm_client and the memory_manager, obtain the thread_id or the summary ID of the conversation. Then it's going to replace the conversation section by taking the information from the identifier that it just got and performing a compaction of this context by rebuilding the context. So, after this function is executed, we can obtain the compacted context which will help the large language model select the actual information when it needs. and we run that. And in the hopes of implementing all the concepts that we are looking at in the lesson, we're also going to register this as a toolbox pattern, so as a new tool, so that the large language model is also able to perform this automatically when it needs. So whenever the large language model reaches 80% utilization, it can automatically summarize and store the conversation into the database, effectively performing context compaction. You will see that this is the importance of having the toolbox pattern active. We have been creating tens of tools and it would be very naive to actually put them all into the context as it would cause a lot of context bloating, context confusion. It would increase your latency and essentially confuse the whole large language model processing. So the toolbox pattern will help us with that. And we will also create this monitor_context_window which will warn us of the actual context utilization at any time by doing the estimations that we did before. So by using the function that we created earlier on the notebook, we're going to calculate an estimation of the context that we are doing and add basically a value that will tell us if we are on the green, if we need to worry about utilization, or if we are at a critical level where context summarization or compaction needs to happen. So now the testing begins. All the functions, the tools that we have created and made available to the large language model are going to be made available right now and we are going to test them by creating a sample conversation about a research that we want to make and we're going to get to see the results. So in this case, we have this test sampled research conversation where we are supposedly working on our PhD thesis and we were talking with our GPT model about what to do, what to research, some of the things about RAG. So we have a thread on our database in the conversational table which we can access through the memory manager with a total of 32 messages. And now we're going to actually monitor how much context we are using. So for that, we are going to call the monitor_context_window with the current_context, which includes the 32 messages, and we are going to print the amount of tokens that we have. maximum allowed by the GPT model that we're using, the percentage usage that we currently have, and the status, being okay, warning, or critical. Meaning, if we need to make context summarization or compaction, or if we are still okay. As you'll see here, our context window monitor currently says that we are using a total of 2000 tokens, which is less than 1% of the maximum. So at this point, we are fine. We don't need to perform context summarization because we still have a lot of space in our context. So the attention matrix of our model is not heavily overloaded at the moment. And this is actually a print of the current conversation with the 32 messages. Now, even if we haven't reached 80% utilization, we are going to call the summarize_conversation so that when you are filling up your context window, you can manually use the summarize_conversation whenever you feel like it. So for that, we're calling the summarize_conversation and it's going to call the OpenAI model and perform this. And as you'll see, we have our conversation summarized. Apart from the summary, we will also get an ID, which can be optionally used together with a description to store this into the database as a context compaction feature. And we also mark all the messages as summarized, so that whenever we want to summarize again, We don't perform the summarization against these messages as they already have a summary and a description explaining what these messages actually entail. And this is the final summary that was created from the 32 messages. And now we will do exactly the opposite. Instead of compacting the summary and creating a summary ID, we will take the ID to actually query the database and uncompact it by calling the expand_summary tool, as we already registered it into the toolbox. And as you'll see here, we have the summary ID and by getting the information from the database, we have effectively allowed the large language model to expand its memory core with the database, so that we can retrieve the summary context, a description of what was in the conversation, plus the original 32 messages. And now we just have this pipeline which will help us verify that all the summarized rows in the database are omitted from the summarization efforts. So, in this case, after having summarized a thread, we need to check on database that they have been marked properly. So we execute some SQL code to get the amount of rows that actually are summarized and which ones are not summarized. And we print them on these variables. So, as you'll see here, through our memory manager, we have our conversational memory and no unsummarized messages were found for the thread as we already summarized. So they were all marked as summarized. And with this, you have successfully completed lesson 4. Congratulations.