In this lesson, you will learn how Llama models evolve from Llama 2 to the recently launched Llama 4 models. We'll also have a member of Meta's AI research team to explain Llama 4 architecture. All right, let's go. Llama started as a fast moving research project at FAIR, originally focused on formal mathematics, but the team quickly saw the potential of smaller, well-trained models, and that led to the release of Llama models, which has since driven significant innovation across research and industry. The next generation of Llama models deliver strong performance across industry benchmarks and introduces powerful new capabilities. With Llama 3.2, Meta added multimodal models for image reasoning and lightweight models that can run on edge devices. These include vision models 11B and 90B and smaller text only models 1B and 3B. Llama 3.3 brought even more efficiency, with the 70B Instruct model matching the quality of the much larger 405B model from Llama 3.1. Now, with Llama 4, we enter a new phase. It includes two powerful mixture of expert models: Llama 4 Scout, a 17B model with 16 experts optimized for speed and cost, and Llama 4 Maverick a 17B model with 128 experts offering top-tier performance. Llama 4 models are designed with native multimodality through an early fusion design, meaning it can accept text and visual inputs in a single unified model from the start. In practice, this means that textual tokens and vision derived tokens are combined at the input level and processed jointly by the same transformer backbone, rather than having separate encoder-decoder pathways. Early fusion is a major step forward, since it enables us to jointly pre-train the model with large amounts of unlabeled text, image, and video data. We also improved the vision encoder in Llama 4. This is a one-page comparison of Llama 3 and Llama 4 models. The main improvements in Llama 4 includes the 1 million context length in Llama 4 Maverick, and 10 million in Llama 4 Scout, compared to 128 K in Llama 3.1 and later Llama 3 models. The official languages supported have also been extended from eight in Llama 3 to 12 in Llama 4. Here you see the Llama 4 model card showing the 12 officially supported languages in Llama 4. The 17 billion active parameters, the number of exports, total parameters across active and inactive exports, and maximum context length. Next, Kshitiz is going to go deeper into the Llama 4 architecture. Thanks Amit. Llama 4 is a mixture of experts are MoE Models. MoE is a common architecture used in large language models. As we increase the number of parameters or the capacity of a model, we usually see better performance because the model can perform more complex transformations on tokens. In dense models, more parameters also means higher costs during both training and inference. MoE models help for this by using conditional computation to improve model quality while keeping computational costs manageable. They activate only a small portion of the total parameters for each token, which reduces the amount of computation needed for a token while still retaining model capacity. A key part of any MoE model is the gating network. Also known as the router. The router decides which experts get activated for a given token. The choice of the routing mechanism plays a central role in the model's performance. A transformer has two main layer types: attention and the feed forward network, also called FFN. Computation cost is usually dominated by the FFN layer. As is common, the Llama 4 MoE model architecture uses conditional computation only in the FFN layers. The attention layers remain the same as a dense model. The MoE layer and Llama four has a few routed experts and a single shared expert. All tokens go through the shared expert. Each token also goes through exactly one of the routed experts. Scout has 16 routed experts, while Maverick has 128. In Scout, all the FFN layers are MoE, while Maverick alternates between dense and MoE layers, to reduce the number of total parameters in the model. Let's take an example of how a sequence of tokens would be processed by the MoE layer in Llama 4. Assume we have one shared expert and two routed experts. Say the sequence has four tokens. The quick brown fox. In the first step, tokens go to a router to decide which token will be allocated to which routed expert. We compute a router affinity score for each token and router expert combination. A token is sent to the routed expert with the highest router score, and the activations of the token are also multiplied with the router affinity score. In the example, tokens: the, Brown, and Fox go to Router Expert one, while the token quick goes to Router Expert zero. All tokens also go to the shared expert. At the end, the outputs of the shared and routed experts are added together, producing the final outputs from the MoE layer. This was a brief overview of the Llama 4 architecture. If you are interested in learning more, you can check out our blog. Back to you now, Amit. Thanks Kshitiz. The Llama API provides a simple and fast way to use and build with Llama models. To make integration easier, Lightweight SDK are available in both Python and TypeScript. These allow developers to quickly connect the API to their applications. Using Llama API, you can build with the latest Llama models, including Llama for Maverik, Scout, Llama 3.3 8b and 70B models, and more. We have also introduced two new Llama tools. The first one, prompt optimization tool, that automatically optimizes prompts for Llama models. It transforms prompts that work well with other LLMs into prompts that are optimized for Llama models, improving performance and reliability. You'll see how this works in lesson six. Another tool you will use in this course is Meta's Synthetic Data Kit. You'll learn how to create your own high quality data using this tool. This is especially useful when you're fine-tuning or testing your model, but don't have the perfect data set on hand. You'll be able to generate Q&A, reasoning steps and data in other formats later in the course. In the next lesson, you will start building with Llama API. See you there!

Please sign in to view this content

Next Lesson

Building with Llama 4

Introduction
Video
・
3 mins

Overview of Llama 4
Video
・
6 mins

Quickstart with Llama 4 and API
Video with Code Example
・
6 mins

Image Grounding
Video with Code Example
・
9 mins

Llama 4 Prompt Format
Video with Code Example
・
8 mins

Long-Context Understanding
Video with Code Example
・
7 mins

Prompt Optimization Tool
Video with Code Example
・
10 mins

Synthetic Data kit
Video with Code Example
・
7 mins

Conclusion
Video
・
1 min

Appendix - Tips, Help, and Download
Code Example
・
10 mins

Course Feedback

Community