In the architectural overview of the transformer, you have learned that after tokenization, the token embeddings are passed through a stack of transformer blocks. Let's get into the details of those blocks. Let's look at a basic example. Let's say we have this product towards "the Shawshank". To understand how transformer blocks operate, let's think about the two tracks that are flowing through the stack of transformer blocks. In the beginning, our tokenizer has broken down the prompt into these two tokens. So the, is its own token and the Shawshank is the second token. And then since we have the associated embeddings vectors for each of these, we just replace them. And that is what we start calculating on. Now we have turned language into numbers, and we can apply a lot of interesting calculus on them to predict what the next word is. So those flow to the first transformer block which generates a vector of the same size as its output. But something has happened in the middle. There's some processing that has happened in the transformer block in its components that we'll address here. But before we get into that understanding, this general flow through the model is, is is useful. And so the same thing happens now with the second transformer block that operates on the outputs of the first transformer block in parallel across the various tracks. And this happens down the list of of transformer blocks all the way to the end. And then the final layer, this vector, let's say for the final token in the prompt is presented to the language modeling head, which then outputs or generates the predicted the next token. So this is the flow. So everything flows from the beginning towards the end. One direction from the tokenizer down the transformer blocks one by one in sequence up until the language modeling head. The transformer block itself is made up of two major components: the self-attention layer and the feed-forward neural network layer. The feed-forward neural network layer is for a high-level intuition of the feed-forward neural network. If the transformer block only had this, it didn't have, attention, it would be able to generate this this completion, to say the token that is most probable to come after this Shawshank would be redemption. And so this is in reference to the famous film, because if you look at the training, data sets, if you get data from the internet or Wikipedia, redemption would be the word that often statistically appears after Shawshank. And in most cases, this is one thing that the feedforward neural network is able to do. So you can think of it as a storage information and statistics of, let's say, the next word that comes in after the input token. That's a high level intuition. Like a lot of these neural networks do, do things that are a little bit more complicated, but this is reasonable first approach of how to think about about the feed-forward neural network. And this contrasts it with self-attention that we'll look at, after we finish talking about the feed forward neural network. We have looked at neural networks earlier in this course. They generally tend to look like this where you have a layer that expands into another layer, that then shrinks back down into third and or output layer. That's exactly what happens in the feed forward neural network. The connections of of these, the dense layers is presumably where all of this information that models know is stored and modeled and interpreted and interpolated between to enable the models to do the incredible things that they do to generate code and encode information about the world and speak to you in fluent and coherent, language of your choosing. Self-attention, on the other hand, allows the model to attend to previous tokens and incorporate the context in its understanding of the token it's currently looking at. Let's look at this example. Let's say we have a prompt that is "the dog chased the llama because it". The model is processing the words it it needs to bake in some information about what does it refer to here. Is it the dog or is it the llama? It's important for the model to be able to have some sense of what that it refers to. And that's a little bit of what, selfish attention does. It enables the model to bake in some of that representation of the llama tokens, for example, if the previous context points towards this being the llama, self-attention allows the model to bake in some of that representation of the llama into it. And so while it's processing the token, it has some understanding that this is referring to the llama, quote unquote, understanding. This is an NLP task called coreference resolution. And if these are only the words that are presented to you, it might be difficult to really ascertain is it the dog or is it the llama? But in this example, let's assume that previous tokens in the in the prompt indicate that it is the the llama. So to understand attention at a high level, let's formulate it like this. So we have other positions in the sequence. So we have previous tokens that we've processed in the past. And then we have the current position that we are processing. We have its vector representation. So this is the embedding if we're in transformer on block number one. But if you're in block number two or 3 or 4, this is the output of the previous block. And then towards the end we want this vector representation that bakes in the relevant information from the other positions here. So here you're looking at these are let's say five tracks. And we're basically pulling in the right information that is useful to represent this token in this in this step of the model's processing. So this is a high-level formulation of self-attention. We'll get into more of the specifics in the next lesson. But what self-attention does is two things. The first thing is relevance scoring. So it assigns a score to how relevant each of the input tokens are to the token we're currently processing. And then the second step is after scoring that that relevance is combining the relevant information into the representation. So at a high level this is what self-attention boils down to relevance scoring and then combining the information from the relevant tokens into the current vector that we're processing and representing. In the next lesson, we'll look at more details of how that is done.