Welcome to this short course "Quantization in Depth," built in partnership with Hugging Face. In this course you deep dive into the core technical building blocks of quantization, which is a key part of the AI software stack for compressing large language models and other models. You implement from scratch the most common variants of linear quantization, called asymmetric and symmetric modes, which relate to whether compression algorithm maps zero in the original representation, to zero in decompress representation, or if is allowed to shift the location of that zero. You also implement different forms of quantization, such as a per tensor per channel, and per group quantization using PyTorch, in which you can decide how big a chunk of your model you want to quantize at one time. You end up building a quantizer to quantize any model in eight-bit precision using per channel linear quantization. If some of the terms I use don't make sense yet, don't worry about it. These are all key technical concepts in quantization that you learn about in this course. And in addition to understanding all these quantization options, you also hone your intuition about when to apply which technique. I'm delighted to introduce our instructors for this course. Younes Belkada, a machine learning engineer at Hugging Face has been involved in the open source team, where he works at the intersection of many open source tools developed by Hugging Face such as transformers, PETF, and TRL. And also Marc Sun, who's also a machine learning engineer at Hugging Face. Marc is part of the Open source team, where he contributes to libraries such as transformers or Accelerate. Marc and Younes are also deeply involved in quantization in order to make large models accessible to the community. Thanks, Andrew. We are excited to work with you and your team on this. In this course, you will directly try your hand on implementing from scratch different variants of linear quantization, symmetric and asymmetric mode. You will also implement different quantization granularities, such as per tensor, per channel and per group quantization in pure PyTorch. Each one of these algorithms having their own advantages and drawbacks. After that, you'll build your own quantizer in order to quantize any model in eight-bit precision. Using the per channel quantization scheme that you have seen right before. You will see that you'll be able to apply this method to any model regardless of its modality, meaning you can apply to a text, vision, audio, or even a multimodal model. Once you are happy with the quantizer, it will try your hands on addressing common challenges in quantization. At the time, we speak the most common way of storing low-bit precision weights, such as four-bit or two-bit, seemed to be weight spiking. With weight spiking, you can pack altogether 2 or 4 bits tensors in a larger eight-bit tensor without allocating any extra memory. We will see together why this is important, and you will implement from scratch packing and unpacking algorithms. Finally, we will learn together about other challenges when it comes to quantizing large models such as LLMS. We will review together current state of the art approaches in order to perform no performance degradation quantization on LLMs and go through how to do that within the Hugging Face ecosystem. Quantization is a really important part of practical use of large models today. So having in-depth knowledge of it will help you to build, deploy, and use models more effectively. Many people have worked to create this course. I like to thank on the Hugging Face side, the entire Hugging Face team for the review of this course content, as well as the Hugging Face community for their contributions to open source models and quantization methods. From DeepLearning.AI, Eddy Shyu, had also contributed to this course. Quantization is a fairly technical topic. After this course, I hope you deeply understand it so you better say to others, "I now get it. I'm not worried about model compression." In other words, you can say: "I'm not sweating the small stuff." Let's go on to the next video and get started.

Quantization in Depth

Introduction
Video
・
4 mins

Overview
Video
・
3 mins

Quantize and De-quantize a Tensor
Video with Code Example
・
11 mins

Get the Scale and Zero Point
Video with Code Example
・
12 mins

Symmetric vs Asymmetric Mode
Video with Code Example
・
7 mins

Finer Granularity for more Precision
Video with Code Example
・
2 mins

Per Channel Quantization
Video with Code Example
・
11 mins

Per Group Quantization
Video with Code Example
・
7 mins

Quantizing Weights & Activations for Inference
Video with Code Example
・
3 mins

Custom Build an 8-Bit Quantizer
Video with Code Example
・
13 mins

Replace PyTorch layers with Quantized Layers
Video with Code Example
・
5 mins

Quantize any Open Source PyTorch Model
Video with Code Example
・
8 mins

Load your Quantized Weights from HuggingFace Hub
Video with Code Example
・
7 mins

Weights Packing
Video
・
5 mins

Packing 2-bit Weights
Video with Code Example
・
8 mins

Unpacking 2-Bit Weights
Video with Code Example
・
8 mins

Beyond Linear Quantization
Video
・
7 mins

Conclusion
Video
・
1 min

Quiz

Graded・Quiz

・

10 mins