Welcome to this short course, Quantization Fundamentals with Hugging Face π€, built in partnership with Hugging Face π€. Large generative AI models like large language models can be so huge that they're hard to run on consumer grade hardware. Quantization has emerged as a key tool for making this possible. In this course, you'll learn about a variety of flavors π of quantization and the different options and data types, like whether you should use int8 or float16 or something called bfloat16, which stands for brain float16,π§ to compress your models. And you'll also learn about the technical theory and the algorithmic details of how to compress and store a 32-bit floating point number, maybe from a model that you want to deploy using, say, an eight-bit integer. I'm delighted to introduce our instructors for this course. Younes Belkada is a Machine Learning Engineer at Hugging Face.π€ Younes is involved in the open-source team where he works at the intersection of many open-source tools developed by Hugging Face π€, such as Transformers, PEFT, and TRL. Marc Sun is also a Machine Learning Engineer at Hugging Face π€. Marc is part of the open-source team where he contributes to libraries such as Transformers or Accelerate. Marc and Younes are also deeply involved in quantization in order to make large models accessible to the AI community. Thanks, Andrew. We are excited to work with you and your team on this. In this course, you will first learn about basic concepts around integer and floating point representation, and how to load AI models using different data types, using PyTorch and Hugging Face Transformers library. You will also understand the pros and cons of each different data type, in order to make the right decision for your use case. You will also dive deep into linear quantization by understanding how it works in practice. You will see how linear quantization works in simple terms. The quantization scheme is used in most state-of-the-art quantization methods. After reviewing how linear quantization works, you'll directly apply it into a small text generation model using the Quanto library from Hugging Face. Quanto makes linear quantization easy to use for any PyTorch model. We will first load the model using Transformers library and then use Quanto to quantize the model. In summary, in this course, you'll see in detail the fundamental theory behind quantization, as well as the practical aspects of how to use quantization. I hope you'll learn about these techniques and combine these building blocks yourself to create some unique applications. Many people have worked to create this course. I'd like to thank on the Hugging Face side, the entire Hugging Face team for their review of the course content π; as well as the Hugging Face community for their contributions to open source models β¨. From DeepLearning.AI, Eddy Shyu has also contributed to this course π. This is a short course that covers a lot So I'm excited about what you'll be able to learn... ...in a compressed way π I hope you enjoy the course! π