DeepLearning.AI
AI is the new electricity and will transform and improve nearly all areas of human lives.
Try Skill Builder

Try Skill Builder

Have a friendly voice chat about how you're using AI, get feedback on your skills, and find out what to learn or build next.
Take Me There

Quick Guide & Tips

💻   Accessing Utils File and Helper Functions

In each notebook on the top menu:

1:   Click on "File"

2:   Then, click on "Open"

You will be able to see all the notebook files for the lesson, including any helper functions used in the notebook on the left sidebar. See the following image for the steps above.


🔄   Reset User Workspace

If you need to reset your workspace to its original state, follow these quick steps:

1:   Access the Menu: Look for the three-dot menu (⋮) in the top-right corner of the notebook toolbar.

2:   Restore Original Version: Click on "Restore Original Version" from the dropdown menu.

For more detailed instructions, please visit our Reset Workspace Guide.


💻   Downloading Notebooks

In each notebook on the top menu:

1:   Click on "File"

2:   Then, click on "Download as"

3:   Then, click on "Notebook (.ipynb)"


💻   Uploading Your Files

After following the steps shown in the previous section ("File" => "Open"), then click on "Upload" button to upload your files.


📗   See Your Progress

Once you enroll in this course—or any other short course on the DeepLearning.AI platform—and open it, you can click on 'My Learning' at the top right corner of the desktop view. There, you will be able to see all the short courses you have enrolled in and your progress in each one.

Additionally, your progress in each short course is displayed at the bottom-left corner of the learning page for each course (desktop view).


📱   Features to Use

🎞   Adjust Video Speed: Click on the gear icon (⚙) on the video and then from the Speed option, choose your desired video speed.

🗣   Captions (English and Spanish): Click on the gear icon (⚙) on the video and then from the Captions option, choose to see the captions either in English or Spanish.

🔅   Video Quality: If you do not have access to high-speed internet, click on the gear icon (⚙) on the video and then from Quality, choose the quality that works the best for your Internet speed.

🖥   Picture in Picture (PiP): This feature allows you to continue watching the video when you switch to another browser tab or window. Click on the small rectangle shape on the video to go to PiP mode.

√   Hide and Unhide Lesson Navigation Menu: If you do not have a large screen, you may click on the small hamburger icon beside the title of the course to hide the left-side navigation menu. You can then unhide it by clicking on the same icon again.


🧑   Efficient Learning Tips

The following tips can help you have an efficient learning experience with this short course and other courses.

🧑   Create a Dedicated Study Space: Establish a quiet, organized workspace free from distractions. A dedicated learning environment can significantly improve concentration and overall learning efficiency.

📅   Develop a Consistent Learning Schedule: Consistency is key to learning. Set out specific times in your day for study and make it a routine. Consistent study times help build a habit and improve information retention.

Tip: Set a recurring event and reminder in your calendar, with clear action items, to get regular notifications about your study plans and goals.

☕   Take Regular Breaks: Include short breaks in your study sessions. The Pomodoro Technique, which involves studying for 25 minutes followed by a 5-minute break, can be particularly effective.

💬   Engage with the Community: Participate in forums, discussions, and group activities. Engaging with peers can provide additional insights, create a sense of community, and make learning more enjoyable.

✍   Practice Active Learning: Don't just read or run notebooks or watch the material. Engage actively by taking notes, summarizing what you learn, teaching the concept to someone else, or applying the knowledge in your practical projects.


📚   Enroll in Other Short Courses

Keep learning by enrolling in other short courses. We add new short courses regularly. Visit DeepLearning.AI Short Courses page to see our latest courses and begin learning new topics. 👇

👉👉 🔗 DeepLearning.AI – All Short Courses [+]


🙂   Let Us Know What You Think

Your feedback helps us know what you liked and didn't like about the course. We read all your feedback and use them to improve this course and future courses. Please submit your feedback by clicking on "Course Feedback" option at the bottom of the lessons list menu (desktop view).

Also, you are more than welcome to join our community 👉👉 🔗 DeepLearning.AI Forum


Sign in

Or, sign in with your email
Email
Password
Forgot password?
Don't have an account? Create account
By signing up, you agree to our Terms Of Use and Privacy Policy

Create Your Account

Or, sign up with your email
Email Address

Already have an account? Sign in here!

By signing up, you agree to our Terms Of Use and Privacy Policy

Choose Your Plan

Planning for more users?
MonthlyYearly

Change Your Plan

Your subscription plan will change at the end of your current billing period. You'll continue to have access to your current plan until then.

Learn More

Welcome back!

Hi ,

We'd like to know you better so we can create more relevant courses. What do you do for work?

Join Team Success

You have successfully joined undefined

You now have access to all Pro features. Click below to start learning!

Session Expired

Session expired — please return to Cornerstone to restart the session and complete the course.

DeepLearning.AI
/
Fast & Efficient LLM Inference with vLLM
  • All Courses
DeepLearning.AI
/
Fast & Efficient LLM Inference with vLLM
  • All Courses
DeepLearning.AIAll Courses
Fast & Efficient LLM Inference with vLLM
DeepLearning.AI
Fast & Efficient LLM Inference with vLLM

Course Syllabus

Elevate Your Career with Full Learning Experience

Unlock Plus AI learning and gain exclusive insights from industry leaders

Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Welcome to this course on Fast & Efficient LLM Inference built in partnership with Red Hat. Open source LLMs can be so large that deploying them efficiently for a large number of users can be challenging, especially if you need low latency and reasonable cost. In this course, you learn to take an open source LLM and serve it efficiently using vLLM, which is a widely adopted open source serving system. You'll learn key ideas behind it like PagedAttention, which lets your model serve many requests at once without wasting GPU memory. You'll also learn how to compress the model size and evaluate how well it can handle real-world traffic. I'm delighted to introduce your instructor, Cedric Clyburn, who is senior developer advocate at Red Hat. Thanks Andrew, and I'm really excited to work with you on this course. When an LLM answers a prompt, it generates text one token at a time, taking into account all previous tokens to decide what comes next. These computations rely on two things in GPU Memory: the model's Weights and the KV cache, that is the keys and values representing the context from these previous tokens. These two behave differently. Weights are loaded once and stay fixed regardless of how many requests you serve. The KV cache is dynamic, and every request has its own, and it grows with every token generated. In this course, we'll be looking at a 70 billion parameter model. The weights of which take about 140 GB, thus requiring at least two 80 gigabyte GPUs to load. In practice, you use even more GPUs to leave room for the KV cache. So the memory requirements add up. Maybe about 2.5GB for 8k-token request and 10GB for a 32k-token long context request. Multiply that by many concurrent users and managing this memory becomes critical. So, to serve LLMs efficiently, you can apply Quantization to shrink the weights by storing them at lower precision so they take up less space and also move faster through memory. The KV cache is more complex to manage because it grows dynamically and you don't know its final size ahead of time. Previous methods reserve one big block per request, sized for the maximum context, and that left 60 to 80 percent of the memory unused. But with PagedAttention, you can fit a lot more requests onto your GPUs by splitting the KV cache into small, fixed-size blocks that can sit anywhere in memory. And here's how you'll put all this into practice. You'll start with the fundamentals like why efficient deployment matters, what happens during inference and the core ideas behind LLM optimization. And then you'll go hands-on. You'll use the LLM Compressor to quantize an open source Qwen model and measure the accuracy tradeoff. You'll then serve it with vLLM and see PagedAttention in action, along with other vLLM techniques. such as continuous batching and prefix caching. Finally, you'll benchmark your deployment with GuideLLM, measuring latency and throughput metrics and evaluate model quality with LM-Eval. By the end, you'll have run the full optimized deploy benchmark workflow on a real model and understand the tradeoffs between accuracy, speed, and cost well enough to apply this workflow in production. Many people have worked to create this course. I'd like to thank, from Red Hat, Saša Zelenović, Michael Goin, and Sawyer Bowerman. From DeepLearning.AI, Hawraa Salami also contributed to this course. Please join Cedric in the next video where he'll walk you through what makes efficient LLM inference critical and the challenges of serving open source models in production.
course detail
Fast & Efficient LLM Inference with vLLM
  • Introduction
    Video
    ・
    3m
  • Why Efficient LLM Deployment Matters
    Video
    ・
    6m
  • Inference & Memory Fundamentals
    Video
    ・
    14m
  • LLM Optimization Fundamentals
    Video
    ・
    14m
  • Optimizing a Model with LLM Compressor
    Video with Code Example
    ・
    11m
  • Serving LLMs Efficiently with vLLM - Part I
    Video
    ・
    10m
  • Serving LLMs Efficiently with vLLM – Part II
    Video with Code Example
    ・
    7m
  • Measuring What Matters: Benchmarking and Evaluation
    Video with Code Example
    ・
    15m
  • Conclusion: Putting it All Together
    Video
    ・
    4m
  • Quiz

    Graded・Quiz

    ・
    10m
    Course Details