Understand how LLM inference works token by token, why it gets expensive at scale, and how the KV cache eliminates redundant computation by storing and reusing intermediate values.
We'd like to know you better so we can create more relevant courses. What do you do for work?

Session expired â please return to Cornerstone to restart the session and complete the course.
Instructor: Richard Chen
Earn an accomplishment with PRO

Understand how LLM inference works token by token, why it gets expensive at scale, and how the KV cache eliminates redundant computation by storing and reusing intermediate values.
Implement SGLangâs RadixAttention to extend caching across users and requests, and measure the real speedups it delivers.
Apply SGLangâs caching and parallelism strategies to diffusion models, accelerating image generation using the same principles as text.
Introducing Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys and RadixArk, and taught by Richard Chen a Member of Technical Staff at RadixArk.
Running LLMs in production is expensive. Much of that cost comes from redundant computation: every new request forces the model to reprocess the same system prompt and shared context from scratch. SGLang is an open-source inference framework that eliminates that waste by caching computation thatâs already been done and reusing it across future requests.
In this course, youâll build a clear mental model of how inference works (from input tokens to generated output) and learn why the memory bottleneck exists. From there, youâll implement the KV cache from scratch to store and reuse intermediate attention values within a single request. Then youâll go further with RadixAttention, SGLangâs approach to sharing KV cache across requests by identifying common prefixes using a radix tree. Finally, youâll apply these same optimization principles to image generation using diffusion models.
In detail, youâll:
By the end, youâll have hands-on experience with the caching strategies powering todayâs most efficient AI systems and the tools to implement these optimizations in your own models at scale.
Developers and ML practitioners who want to better understand and optimize LLM inference in production. Familiarity with Python and basic language model concepts is recommended.
Gradedă»Quiz
Additional learning features, such as quizzes and projects, are included with DeepLearning.AI Pro. Explore it today
Keep learning with updates on curated AI news, courses, and events, as well as Andrewâs thoughts from DeepLearning.AI!