In this lesson you will learn about the new models, how they were trained, their features, and how they fit into the Llama family. Let's take a look. Here is a quick summary of Llama family of models at this point. Llama 2.0 with a 7, 13 and 70 billion parameter models were introduced in July 2023. We then released Llama 3.0 in April and followed quickly with the 3.1 release in July of 2024, with updated 8B and 70B models and most importantly, a 405 billion parameter foundation class model. These models support eight languages, tool calling and 128 K context window. Now we have just released the 3.2 models. The 3.1 8B and 70B models were enhanced with vision capabilities, creating a 3.2, 11 B and 90B multimodal model. And we have released the two lightweight models A 1B and 3B model, which will help support on-device AI. Also part of the 3.2 release is Llama got three vision models which flag problematic images as well as text. The 3.1 models, which are the basis for the 3.2 models, have base and instruct versions. The instruct versions have been tuned for instruction following and tool use. The 3.1 models are multilingual but not multimodal. Llama 3.2 is built on top of Llama 3.1. Let's see some of the features that are new in both 3.1 and 3.2 models. Starting with Llama 3.0, you have a new tokenizer with a vocabulary of 128k tokens, compared to 32K tokens in Llama 2.0. Also, you have a large context window, a 128K tokens in 3.1 and 3.2 models, compared to 8K in Llama 3.0. And although 3.0 was supporting only English language, 3.2 and 3.1 support eight different languages, also in 3.0, there was no support for tool calling, whereas in 3.2, you have native support for it. And finally, we also released Llama Stack, which is a series of APIs for customizing Llama models and building Llama base agentic applications. So Llama 3.2 is built on top of 3.1, but what is new in 3.2 that you didn't have in 3.1? Basically, there are two main additions in 3.2. First, is the introduction of multi-modal input in 11 B and 90 B models. You can now use Llama in different multimodal use cases like image like objects, scenes and drawing, understanding and OCR, captioning and question answering, visual reasoning on equation charts, documents and more. The second main addition and 3.2 family of models is the introduction of smaller sizes in 1B and 3B text only models. Now, with a small language model, you can use Llama for on-device summarization, writing and translation, and question answering in multiple languages. Let me briefly describe how vision is incorporated into the Llama 3.2 models. To support vision, we took a compositional approach. A compositional approach enabled us to parallelize the development of the vision and language model capabilities. We started with two components a pre-trained image encoder and a pre-trained text model. We combine them by introducing and training a set of cross attention layers between the two models. During inference, image and text are provided to their respective models. Image information is conveyed to the language model by cross-attention, and the language model produces the text response. Speech can be later added to this model by using a speech-to-text encoder to provide input tokens to the next model. Let's see the Llama 3.2 family in this table. Llama 3.2 models introduce multimodal capability to the 11 B and 90 B models. The instruct version of the model support tool calling. Even the lightweight 1D and 3D models. Here are the results of multimodal benchmarks on the vision models. Our experimental evaluations suggest that open Llama models perform on par with leading language models across a variety of tasks. And here are more benchmark results in mathematical reasoning, charts, and diagram understanding, and general visual question answering. Here you see a comparison of Llama models with their compatible size models. You can see that open Llama models have state-of-the-art performance at all sizes on key industry benchmarks. Importantly, the Llama 3.1 405B is a foundation class model with industry-leading performance. The key message here is that using open models in your application does not require you to settle for less than state-of-the-art performance. You can run Llama 3.0 models in many ways. You can run it in the cloud on AWS, Databricks, Together, Groq and many more. Or run it On-premise: TorchServe, vLLM, TGI or run it locally on Mac, Windows, Linux via Ollama, LM Studio, llama.cpp And because of the availability of small size models, you can run Llama on-device. You can run it on iOS, Android, Raspberry Pi, Nvidia Jetson via ExecuTorch, Llama.CCP, MLC. This was a brief summary of Llama family of models. Open source is very important to Meta and the Llama family. We strongly believe that openness drives innovation and is the right path forward. Let's go on to the next lesson where you will work on four exciting and cool image reasoning multimodal use cases. All right. See you in a few seconds.