In this lesson, you implement a cost-saving strategy of prompt compression, particularly valuable for applications like RAG and agentic systems. You gain an intuition of what prompt compression is, how to use it, and the operational advantages it brings to the LLM application. Let's get on with it. There are many prompting strategies that have emerged over the recent years such as in-context learning, chain of thought and react prompting. Getting appropriate and quality responses from LLM is an art form. Most of these prompting strategies involved compose an extensive text to the LLM as input. LLMs with large context window are becoming a new norm. Is now coming to see LLMs that can take an input size of over 100,000 tokens and even in some cases, a million token. That is passing an entire novel into an LLM in one inference call. Although useful in some cases utilizing the full context window when accessing LLMs provided through rest API calls can become very expensive. LLMs with large context windows have their place in real world application, but the operational costs of these models can skyrocket. Take for example, paying $10 per 1 million token and applications such as Airbnb, which has several million users per day. We have a huge operational expense just from the volume of interactions alone. Not to mention that there will be an increase latency in response, as the model will have to process more input to extract the appropriate information to respond to user queries. As you continue to learn and build AI application that use LLMs, you come across the idea of prompt compression, sometimes referred to as token compression. You might think I wouldn't have as much volume at the initial development of my AI application, but building robust AI applications requires thinking ahead of scalability and solving for issues that might become bottlenecks. You will implement prompt compression technique in the code section of this lesson, and observe firsthand how easy it is to implement prompt compression alongside existing RAG pipelines. So, prompts compression is a process of reducing the number of tokens. But let's see what this looks like in an example. On the screen, you can observe an original uncompressed prompt that spans across three long sentences. By using the package LLM lingua, which you you will use in the coding section of this lesson, we are able to reduce the uncompressed sentence into two sentences that span across two free rows. This is a power of prompt compression, which you will see first hand in the coding section. I have kept a link of the paper present in the prompt compression technique, in the link on the slides. Feel free to observe and read the research paper after this lesson. In a few minutes, you will compress an extensive prompt of a few thousand tokens down to a few hundred tokens. With prompt compression, passing input into the prompt compression technique is very straightforward. Imagine having an uncompressed prompt of 50,000 tokens, just passing the uncompressed prompt and specifying a few parameters using a prompt compression library LLM lingua, you can reduce the uncompressed prompt down to 10,000 tokens. This is a five times reduction, and then you can pass the input straight into the LLM as you would with the uncompressed prompt, and receive the same quality output as if the name was initially provided with the uncompressed prompt. You are about to see this in code. In the coding section of this lesson, you will go through some familiar steps, which include setting up the right pipeline ad in the relevant MongoDB stages and then implementing a compression logic. And as usual, you will handle the user query and observe the results. Let's code. Start by importing the custom utils module, as you've done in previous lesson. You move on to load your data sets. Where you can observe the attributes. Similar to previous lessons. The next steps are covered in previous lessons, where your model documents connected to your database, extracted objects for your database, and your collection. Deleted existing records within the collection. Ingested new data, and lastly, created your vector search index. Just like in previous lessons you handled the user query. You start by creating the search results item model which will specify the attributes you want from the document to returned from the database operation. For this case, you have the name, address, and other corresponding attributes. Just like in the previous lesson, you add the additional boosting stages. You have the review average stage. The weighting stage and the sorting stage. And then, you add all the additional stages into a variable called additional stages. Orders are the step you took in previous lessons. Now, we're in the main part of this lesson where we have a similar handle user query you've seen in the previous lesson, but you are printing out the uncompressed prompt for observation. This is specified in this two new print statement. Now, you have the same user query used in previous lesson. And also the same handle user query function, but with the difference in this lesson of the print statement where we can see the uncompressed prompt. From the output, you can observe the time it took for the vector such operation to execute, which is a fraction of a millisecond. You can also observe the uncompressed prompt. Here, you can see that the uncompressed prompt is extensive. And do note that it's been truncated to ensure it fits all on the screen. You can also view the full content of the prompt by looking at the documents returned from the vector search operation listed on the table. Pause the video here to take in the share size of what is being passed into the LLM. You will also notice that the system has recommended the homely room in five star new condo. Remember this listing. Now, this is a fun part. We're going to look at a technique that allows us to reduce the extensive prompt you observed before and reduce it by a few hundred tokens. You start this by importing the prompt compressor constructor from the LLM lingua library. Using the prompt compressor constructor, you can specify a smaller large language model that has been fine tuned for prompt compression to do the compression of an uncompressed prompt. You will also specify the utilization of the latest LLM lingua prompt compression logic by specifying true as the value for the argument use LLM Lingua two. You'll also specify to use the CPU to the prompts compression module to ensure you're using the CPU on the device. Now that we've set up our prompts, compressor, specifically set up a smaller language model to do the prompt compression. We can move on to define the compress prompt function. Now, you can define the compressed query prompt function, which you take in the uncompressed prompt as the query for the function. The prompt compress a module requires the input to be structured in a certain way. And that is having a component based structure with the fields specifically demonstration, instruction, and question. I will go over what this means. Demonstration will hold the context as uses additional information that is passed in to the LLM with the user query. This is essentially the documents that has been returned from the database operation. This is going to hold a specific instruction that tells the smaller large language model how to compress the prompt. Finally, the question is specifically the user query itself. Now, you can actually call the compress prompt method on the LLM lingual model that we initialize earlier on. I'm going to explain what each argument does. The first argument specifies how to split each of the contexts up, specifically using the new line. The second argument takes in the instruction. Then the question. Next, We specify the target token we want the uncompressed prompt to be compressed down to. Next, there is a specification of the compression algorithm to utilize. You'll be using the latest compression algorithm from the LLM lingua specifically long LLM lingua. Next, we'll specify the context budget, but allow the budget to overrun by 100 tokens. Finally, we'll specify the compression ratio. This ratio indicates how the compression logic to assign tokens to the context, which is a demonstration and to the overall instruction question. Finally, we enable the compressor to reorder the context using a sort algorithm. This is all the argument for the compressed prompt method. The results of the compressed query prompt function is a json representation of the compressed prompt that will include information such as the token which is the original token of the uncompressed prompt, and the compressed prompt token. You will see this in action in a second. Now that you've specified a method for compressing an uncompressed prompt, you'll begin to specify to handle user query with compression, which would take in the user query and conduct the compression that was defined earlier. This handle user query with compression is similar to previous handle user query function from previous lessons, but the key difference is in this function there's a new input to the LLM specified as query info. This query info follows the structure of our compression logic, which has the demonstration, instruction and question. Remember, the demonstration is just a result from the database operation. The instruction tells the compression module how we want the compression to be executed. And the user query or the question it's simply the user query. This is the structure that is passed into the compressed query prompt by calling the function defined earlier and passing in the query info, and assign the results to a variable called Compress prompt. To visualize the result, you printout the compressed prompt in a structured manner. Finally, the handle user query with compression returns the search results and the compressed prompt itself. The final method you implemented this lesson is the handle system response. The handle system response passes the query along with the compressed prompt as input into the LLM. For visualization, you print out the query and the system response. Now, you can use the handle user query with compression method defined earlier, by passing the query, the database object collection object, additional stages, and specifying the vector search index you're using for this lesson. Execute the cell. The execution for the cell might take some minutes, as you are using a smaller language model to compress a prompt. Although we have an increase latency, the overall operational cost will be reduced. Here is the result of the prompt compression technique. Here, we have a field compressed prompt that has all the prompts. And as you can see, this is much shorter than the prompt we saw earlier. But more importantly, we have the original tokens of the uncompressed prompt that was 4284 tokens long, and the new compressed token is just 512. This is a ratio of eight times compression. There is also an indication of the cost factor that you're saving when using this prompt compression technique, and taking the compressed prompt and passing is input to a GPT-4 model. In this case, for this particular call, we're saving $0.2. If you think about this for a large scale application such as Airbnb, where several millions of inference calls are made to APIs, this can be a saving factor in the hundreds of thousands. The last step of this lesson is to pass the compressed prompt and the user query to the large language model to actually get a system response. Confirm you have a compressed prompt, and then pass the compressed prompt and a query into the handle system response function. Here we can observe the result. The results of this compressed prompt is provided us with a recommendation that meets the user query closely. Specifically, it's in a warm, friendly neighborhood which was included in the spaces of the listing. And also it's next to restaurants, which is what was specified in the user query. We saved on the operational costs of the RAG pipeline and obtained a quality output. This output is not the same as the uncompressed prompt output, but it's of similar quality and meets the requirements specified in the user query. The difference between the results of the uncompressed prompt, which provided us with this recommendation, and the compressed prompt, which provided us with this recommendation, it's quite minimal. This signifies with a lower token count, you can get similar outputs in terms of quality from a large language model. This concludes this lesson. In this lesson, you learned how to create a RAG pipeline, placed a vector search, and also conducted prompt compression to get a quality output, as you would with an uncompressed prompt.