AI is the new electricity and will transform and improve nearly all areas of human lives.

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

By the end of this lesson, you'll be able to use prompt caching to reduce cost by up to 90% and reduce latency by up to 85% for long prompts. Let's give it a shot. Let's start by understanding what prompt caching actually does. What are the benefits and how does it work? Essentially, prompt caching is a feature that helps us optimize API usage by allowing resuming from particular prefixes that our prompts may have in common. In short, we can cache stuff that is going to remain consistent as a prefix from one API call to the next, drastically reducing processing time and costs for any sort of repetitive tasks or prompts that reuse consistent prefix elements. So let's take a look at some diagrams before diving into some code. So on the left we have a hypothetical prompt. I'm just representing it as shapes. So this is our prompt that we're going to send off. None of it has been cached to begin with. So we send this request off to the API. It's processed. And let's say that we decide to cache everything that we send off to the API. So at this point we now have all of our prompt prefix, the previous prompt from our first request stored in a cache. On a follow-up request, I now have a longer prompt. It contains the exact same prefix from the first request, but a whole bunch of other stuff afterwards. We no longer have to process for the API, no longer has to process the entire prompt. We've cached that prefix from the previous turn. So when I send this new request, we're going to get a cache hit, we'll read from the cache, meaning we don't have to reprocess all those tokens. And depending on how many tokens that is, that can save us a lot of time and also save us a lot of money if we're reusing it over and over and over again. And then we can write more to the cache and keep this process going, where we can incrementally, if we want to, add to the cache as a conversation grows. Or we can just cache some particularly long part of our prompt. Back in your notebook, there's the same basic setup we've had. We're importing Anthropic. We're setting up the client and we have our model named string variable. Now to see the most obvious difference with prompt caching. Basically with and without prompt caching, we are going to use a very, very long prompt. You're actually going to use the entire text of the book Frankenstein by Mary Shelley, which is available inside of a file called Frankenstein dot txt. So the first step here is to open that entire file and read the contents in just into a variable. We'll call it "book content" in this case. For content is going to be a very long string. As you can see here. Here's a small slice of it. So we can see some of the contents of the book. Okay. The next step is to send this off the entire book, along with some prompt like "what happens in chapter three?" Any simple prompt will do here, ideally something to do with the book Frankenstein. It's all in a function called "make_non_cash_API_call", and you'll see that there's some timing logic in here to start a timer before and after, but essentially try to calculate the time between the request being sent and the response being received. And then at the end, the function returns the overall response and then the delta, the time between the start time and the end time. To be clear, this version uses zero caching whatsoever. And the highlighted line is where the entire book of content that massive string is being provided. Notice it's wrapped inside of book XML tags. Not required, but it's a nice way just to separate out for the model. Here's this massive document. And at the end, here's the question "what happens in chapter three?" It is separate from the actual book. So we're just demarcating the bounds of this massive book. The next step is to call the function. So this line calls a function. This line prints out how long it took. And then this line prints out the actual content we get back from the model. So run the cell and wait. This is a very long prompt. So it may take a while. The response came back. In this particular instance it took 17.77 seconds. Again, no caching involved whatsoever. Here's the actual response, which to be honest, is the least important thing. That's happening in this function. It's more important to focus on the time element as well as the actual usage, which is right here. So you can see that there are quite a few input tokens that were processed under the 108,000 input tokens, followed by 324 output tokens. That's just the number of tokens used in the generation and zero cache creation input tokens and cache read tokens. Because we haven't involved caching yet at all. Now you can move on to the cached version that actually takes advantage of our explicit caching API. This is a function virtually identical to the previous one, except it's called "make cached API call" instead of uncashed. And there's one very important addition right here. So in this content block where we're providing the massive book content, there now is a cache control property or a key set to a dictionary that has type of ephemeral. Any time you want to set a caching point, essentially telling the API, "I would like to cache all of the input tokens up to this point. So that I can reuse them next time." Our API will do that. It needs to look for this tag. It has to be here in order to know that we want to write to the cache. Now when you run this, the first time, it's still going to take a long time. We won't be having any cache hits because we have yet to write to the cache. So you can execute it. Again, it will take a while because nothing has been cached yet, but as part of this request, the input tokens processed up to this cache control will be cached. That finished running. Here's a response you might get back. So all the content that the model actually generated. But for our purposes down here is the most important piece. We see that input tokens this time it's only counted as 11, output tokens is 324. And cache creation input tokens is 108,427. A large amount of tokens have now been stored in the cache. The next step is to try and read from them. Now going back to this function that you ran earlier, make a cached API, call this cache control tag really acts in a dual-purpose manner. The first time that our API encounters this, it will perform a cache write up to this point. So write all 108,000 tokens in the cache. And then on subsequent requests, when it encounters this cache control point, it is then going to look to see do we have anything cached up to this point essentially acting as a read point. So it's dual purpose in that it acts as a write point and a read point. So you can send the exact same function, just run the same function, call the exact same request to the API. This is now going to act as a read. So you can run this line again. New variable names, response to, duration to. That finished running. You can take a look at the usage property of the response to. Notice cache_creation_ input_token is zero. Because there was no writing to the cache performed cache read input tokens is 108,000 tokens. Compare that to the previous turn cache creation input tokens is 108,000. Cache read zero. Same exact request shape, same messages. This time there was a cache read, so there was no writing to the cache, but there was a massive read from the cache. And if we look at duration two: 6.2 seconds compared to an our totally un cached version 17 seconds. Exact same prompt. I think there's one small difference. This version asked what happens in chapter three. This version asks what happens in chapter five, but basically the exact same length to the prompt, just a different number at the end there. So 17.7 seconds compared to 6.2 seconds. And that's without considering the cost savings here. So let's talk about that. The way that prompt caching is priced is very straightforward. Essentially, you pay a little bit of a premium to write to the cache. So cache rate tokens are 25% more expensive than regular base input tokens. But the significant upside is when it comes time to do the cache reading, reading from the cache, those tokens are 90% cheaper than uncashed-based input tokens. And then the actual input and output anything that's not cached in any output that's generated are still priced at just the standard prices. So this means that it may not make sense to cache every single thing for every single message. But if you have some prompt prefix that is going to remain the same across a whole bunch of requests, it can be extremely efficient and cost effective to cache that long prefix and reuse it. Pay 25% more one time to make the cache right, and then pay 90% less for all of those input tokens for every subsequent request that gets a cache hit. Now, speaking of, cache hits, one thing that's important to note is that caches do not live forever. Each cache has a five-minute TTL or time to live. So we only support currently at least ephemeral caching, and each time you read from a cache, it resets that five-minute timer. So it also really depends on that use case and how you are using caching and what type of prompt you're sending. But any time you have a long, long prompt you want to save money on and you're reusing a lot of that prompt, you can use prompt caching. One common gotcha with prompt caching is multi-turn conversation caching. Imagine you have a very long conversation, maybe a conversation involving the book Frankenstein, where the model is sent that entire book text and there's a whole conversation about it. A long common prefix that should be cached to save on tokens, especially if the conversation has dozens or hundreds of turns and each turn is long. What you can do is cache the conversation as it grows. So by setting a cache control point, remember it really can act both as a write point. Write this to the cache. Everything up to this point and a read point. Attempt to read from the cache when working with multi-turn conversations, what you can do is use two cache control points that you continuously move down the conversation so that you are always caching the very last user message, and you have a cache control point on the second to last user message. Now why do you do that? How does that work? Again, it comes down to this dual-purpose nature. Imagine that this is a very long conversation that has hundreds of messages. This cache control point presumably will get a read from the previous turn. This used to be the last user message and then we got a response. So this was the last message a write was performed to the cache previously and now it acts as a read. There will be a cache hit and then this now at the very end, this cache control acts as the new write point telling the API write everything up to here and then you continue this pattern. So if the conversation grows, which is what this messages list shows it's the same pattern. Put a cache control on the second to last user message and a cache control on the last user message. Again, this tells the API write everything up to this point. This is new and read everything that you can from this point. Because this message tell me more about Mars previously was the last message. And you continue this pattern of moving these cache controls down the conversation, the last user message in the second to last user message to continuously read from the most recent cache and write up to the end of the conversation so that you can read from it next time around and you go over and over and over. This can confuse some people. It's on our documentation. It's also in this notebook, of course. Now, I'm not making any requests here. This is just the messages list that you can refer back to as an example. But again, it does trick people up. Finally, to tie things back to computer use, this is a little excerpt from our computer use quickstart demo that we've been looking at over and over throughout the last few lessons. I just want to highlight that we do in fact use cache control, right? Cache control is set to type ephemeral to cache a long history of messages with the model that includes a lot of screenshots, which can take up a decent number of tokens. So if this model was taking actions and it's a, I don't know, a 2 or 3 or four minute interaction where it is trying something, doing a screenshot, and there's ten different screenshots and a whole bunch of different tool calls, which we've yet to discuss. Caching can significantly cut down on the time and also the cost for the actual computer use usage, so we'll get a chance to look closer at this code later. This is mostly just to show you: "Look, caching is real in the wild as well, not just in our educational notebook."

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

AI is the new electricity and will transform and improve nearly all areas of human lives.

Learn Code

Next Lesson

Building toward Computer Use with Anthropic

Appendix – Tips and Help

Course Feedback

Community

0%