Great. Now that you know how a transformer LLM works, let's now look at a couple of the more recent ideas that are part of the latest models. This is a simplified view of a transformer. This is a transformer decoder. This is the architecture of the original one from 2017. You might have seen this figure before. It is basically the same thing. We just removed a couple of these components. I'll show you in a bit how this relates to the visuals as we've seen previously in the lesson, but we just flipped it upside down. Just because when we were writing a blog or a paper, people tend to view things from the top to the bottom. But basically, if you look here, so you have the words, in the input prompts, there, chunked or broken down, let's say into tokens. And then those tokens become, vectors and then positional encoding is this method of applying positional information. Otherwise, it would be represented to the model as a bag-of-words. There's no order to these words. And the order of words in a sequence matters a lot. And so positional encodings is a is a way is a method. There are multiple methods of adding that position information to the representation of the vectors. And that goes into a transformer block. And then here we're just generally showing how a transformer blocking a decoder network looks like. This figure on the right is the one from the original transformer paper. So the same thing. But a couple of things that we trimmed down in terms of just to make it a little bit simpler and more, relevant to today's Transformers is that the original transformer was an encoder-decoder model. And so that's what you see here. This is the encoder component. And this is the decoder component. While most LLMs now in existence are mostly decoder models that do not have this encoder component. And then here you have the positional encoding. And then here you have the language modeling head. We can put these visuals here so you can just see what a how it breaks down into an encoder block and a decoder block. And so a lot of our focus here has been on these decoder blocks that explain how these vast majority of LLMs that you interact with, are actually built and how they operate. So there are still some encoder- decoder models out there. but the predominant majority of text generation LLMs are decoder models. And then encoder models are also in use in models like BERT or the BERT- like models that do text embeddings or re-rankers, or a lot of these, efficient ways of doing NLP tasks that are not text generation necessarily. And so if comparing just the transformer block from the original transformer, we can juxtapose it to a modern era. So these are the 2024-era transformer blocks and it looks very close. Let's point out some of the things that have changed. So one is that they're no longer a positional encoding at the beginning step here, at the beginning of the processing of the model, we'll talk a little bit of where that happens. But now can see that we use rotary embeddings. We'll touch on them a little bit later in this lesson. But positional information is now added at the self-attention level. The layer normalization, has moved before for the self-attention and feedforward neural network layer here. As you see. Some experimental results showed that the models do better with this, kind of setup. And then you can see that these models also have grouped query attention as self-attention. So we talked about the evolution of self-attention mechanisms. And then one important thing to also note that was both in the original transformer and the current ones is the residual connections. I don't think they're sort of shown here as much, but there are these residual connections that go around these layers to repack that information from the beginning of the layer back to the representation and sort of add them back together. We'll talk a little bit about rotary embeddings. But to do that let's talk a little bit about training. So you know that the LLMs are trained in multiple steps. The first step is the base training, where it's the next generation prediction. Step. That's what's called language modeling. That's why we call them language models. When you visualize it in your head, you might think of training in batches kind of like this. Where each row in the batch is a document. But then the model has, let's say you're training a 16,000 K context window model. The document or lots of short documents will not be able to fill that space. So what do you do? You would add padding, to the rest of that of that context. And so if you're doing it naively, this is one of the first ways that you might think about about doing training is that you would have documents like this where the majority of the context is just padding that is not really used. And this is an inefficient way to pack the training data. In reality, a more efficient way of packing these documents for training, is kind of like this where you have multiple short documents you pack them all into, let's say, one row of your batch, and you do that with the second row as well. And so you end up with less padding. And so you're using a lot of that compute. Since the GPU is going to be doing that crunching regardless. Is it a document or a padding. And so this is a high level let's say, or side piece of information about how that training is done. And we say that because this also has an impact on the architecture specifically with positional, encoding. So if you're doing this here you can say that okay token number one I'll assign it with a positional vector that always denotes this is position number one. And then you do that with position two and three and four. And you can have multiple of these static positional encoding methods that denote either they can be either learned or they can be algorithmic in terms of you know, having some combination of sine and cosine that the model with time learns that, okay, this kind of vector or this kind of information in a vector relates that this is position number one or position number two or 3 or 4. Now of the token in that context link. So that's what's called static positional encoding. There are other methods that are more dynamic that they denote that this token is three tokens before this one, for example. But if you're doing this on the bottom here, you need something else. When the model is, for example, being trained on document number two, first, the self-attention mechanism should not be able to look at document one. But that's not a positional encoding component. The positional encoding property that is needed here is a way to allow the model or the mechanism for positional encoding to say that okay, this is the first token of this document. And so in in the counting of, which token, which position we are in the context length, you support a way where the model can make sense that okay, this is token number one in document number two without sort of counting the everything that comes before it. So this is a little bit of the intuition of why a lot of the more recent models use positional encoding methods like rotary embeddings. Well, I discuss rotary embeddings in detail, but we can say that they're added that positional information is added at the self-attention layer of each transformer block. Rotary positional embeddings or rope adds that positional information in this step of self-attention. This is just before the relevance scoring step, the first of the two steps that self-attention does. It basically has a formulation that adds that information to the queries and keys vectors. And so these here have positional information that denotes that okay, this vector comes before this vector that comes before this vector. So that information is in the set here on the right. But it's not present in the set here on the left. And that is added using rope. And positional encoding methods like rope. One more recent development that you might have come across, is the idea of mixture of experts. and this is a concept that uses multiple submodels to improve the quality of LLMs. Now that's not to say that all LLMs are becoming a mixture of expert models. And you can think of this as a variant of transformer language models, rather than replacing these dense models that we've covered in this course so far. The idea of mixture of experts is that at each layer you have multiple sub-neural networks, each one we call an expert. And then a router in each of these layers that decides which expert should process this token or this vector. Maarten has an incredible guide to it that you can look at for a more detailed explanation. But we'll also look at that in the next lesson in more detail. For the intuition of experts, it's important not to think of each expert as being one monolithic component. Each layer has its own set of experts. And then another important intuition is that these experts are not specialized in specific domains, like a psychology expert or a biology expert. Rather these experts might tend to focus on, let's say, specific kinds of tokens, and then focus on how to process them best, like punctuation or verbs or otherwise. In the flow of using a mixture of expert model, you're not assigned to one expert. So if you're assigned in layer one to expert one, layer two might use expert 3 or 4. This routing happens at each layer. And each layer sort of routes to the proper expert at that layer. And there are methods that would route, for example, in each layer would route to two different experts and sort of merge the information from both back. So there are a couple of different methods using these. But this is a high-level, let's say intuition. And this is where they sit. So they are part of the feedforward neural network. Where you would have multiple of these networks and each of them is called an expert. In addition to having these experts, the mixture of experts layer also has this router, which basically is a classifier that classifies, you know, for this type of token, what is the best expert that is most suited for this. So you can think of this as a classification score. Where here the router has deemed that to process this token, I know this expert will do the best job for example. And then that processing is how the feedforward neural network is applied to this token in this processing step in this layer. This was a high-level look at a mixture of expert models. The next lesson will dive a little bit more deeper into how they work.