You learned so far about all the different components of the transformer. Now it's time for you to explore the architecture of one model example using the HuggingFace transformer library. Let's code. To reinforce everything that we've learned so far, let's take a quick look at the code of how HuggingFace Transformers, allows us to look at the transformer and its tokenizer and how that process happens that we've looked at. So how the processing and how the flow of information happens from the tokenizer to the stack of transformer decoders to the element. And look at how that works in the code. So let's start with just a couple of logistics. Just we want to remove a little bit of these warnings that we're going to see. And then we're going to be loading a language model. So this is the Phi-3-mini model. So here we're downloading it with its tokenizer. Since we're using it on a CPU, it's showing us a couple of these warnings, but we can just ignore them. Let's now define a hugging face pipeline. So we're saying this is the model that we've downloaded and the associated tokenizer that we've downloaded. So you can see that this is the Phi-3 model. And this is the Phi-3 tokenizer. We pass them both to this pipeline which is just a crude abstraction that makes it easier to generate code within the LLM after we give it and load up the model and tokenizer. And here we're saying that, okay, whenever we give you a prompt, we want you to generate 50 tokens in response to that. The do sample parameter means that we're doing greedy decoding as we've looked at. So with each token it would score the probability of the output tokens. And it would choose the token with the highest probability. So this is almost exactly like setting the temperature to zero. Now that we've done that, we can declare our prompt and pass it to the model. Now this is the prompt that we're giving the model. It's being processed right now. So it's saying write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened. Now the model is processing these. Let's skip through to when it's done generating and talk about what it did in the process. And here we see the output. So these are the 50 tokens that the model generated email to Sarah and the subject. And then you started to write the body of the message. And then it stopped at the middle of the 50th or last token that it generated. Now you can take the chance to change the prompt to whatever you'd like and see how the model responds. it might take around two minutes to generate the output because it's running on CPUs in this example, which is why in industry, a lot of these models actually run on GPUs that are highly optimized. And a lot of these efficiency methods that we've discussed are important to speed up that that generation. That's the reason why there's a lot of focus on that on that efficiency. Now a lot of people have not used language models beyond just a chat interface or a playground. And so if you're one of these people and this is your first time generating with a language model in code, congratulations. Now you understand these these models a little bit better. And if you've seen this in the past, stick around and I'll show you a couple of the cool things that are HuggingFace sort of it enables you to do it in terms of just understanding the hierarchy of the model and how it works together. So a couple of things to point out with HuggingFace is now that we've loaded the model as a model, we can just print that. And that will show you the structure, the architecture of the model that we have here. And you can see that it's indented. So this shows you the the hierarchy here. So you have a model. So this is the Phi 3 for causal language modeling. This is a decoder model. It is a causal language model or auto-regressive. which means that the attention step only focuses on the previous tokens. Inside this model we have two major components. We have the model itself. This is where all of the layers the these are the transformer blocks sit. You have the tokens. So this is the tokens matrix. The model has 32,000 Tokens in its vocabulary. And the model dimension is 3072. There are 32 decoder transformer layers, and you can see the exact components of each one of them. So we have self-attention the projection, the rotary embeddings addition here, the NLP, the multi-layer perceptron is what we call the feed forward neural network. That is the component here. And you can see that it projects it up to the high-level dimension, which is 16,000 in this case, and then projects it down to the model dimension of 3000 again. You have your activation function and then your layer norm. And then towards the end of the model you see the language modeling head and which outputs a vector of dimensions 3072. In the end here we see the language modeling head which takes in the final vector of 3072. That's the model dimension, and it outputs a score for each of the tokens that the model has in its vocabulary. HuggingFace also makes it possible to do something like this so you can browse like this. So we now we went into the model itself. We can address all of these layers kind of like this. So we can access the embeddings, layer or matrix, like this. So we can say, okay, let's say we want to do model dot model dot layers. And then you want to access that first transformer block. It looks like this. And so this is a way for you to be able to access each of these layers and matrices. And see the input size and output for each of these layers, but also access to the actual matrices themselves. If you want to do that. Let's go back to the generation. Let's say we have a simple prompt like this. "The capital of France is." Now we'll process this in a different way. So instead of doing a pipeline, the pipeline sort of abstracted a few things from us. Let's do it in a way where we get to see a little bit of the mechanics of how that information or how the processing is actually done. So we have this code here where we have our prompt, we send it to the tokenizer, and we have it return, input id, let's say we want it to show us the input IDs. And that's the result of the tokenization. So the tokenization said, okay, this one string of text is now broken into these five tokens for number 450. Number 7483. The number 310. This is how this information is represented. Now we can send that to the model. And notice here that we are sending this to the model component of the model. So this will not go through the language modeling head. This is just flowing through these stack of transformer blocks. So let's look at how that output looks. Looks something like this. Now you have this tensor of dimensions one by five by 3072. One, this is the batch dimension. So we pass the model only one text. And so we processed only one. While in training you'll have a lot more of these. And so this can be a lot more. Five is the number of tokens in this sequence, and 3000 is the dimension of the output vectors. Now, to think of this, think about this visual that we've at in the past. The vector that we're currently looking at is a matrix that is basically this row. And this row together sort of mushed as let's say two rows of one matrix. And so 3000 is the number of dimensions for this vector. And then here we have two, but in here we have five tokens. And so that is the the output here, which is the vectors before we had the language modeling head. We had we can pass that to the language modeling head independently and then see what that output looks like. Let's look at the shape of this output. And so this is a tensor of one by five by 32,000. And when you see 32,000, now you know that that is the vocabulary size that we have. And so these are the scores, for each of these outputs. But to get the actual output for this string that we or these this prompt that we've sent to the model, this is what we do. So we address the actual vector that we want. So it's the last token in the sequence. And then we pass it to this will give us the token ID which is a number. Let's print to that here. So the number the output token, the first output token that the model will generate is the token number 3681. And then to decode that back into a human language that is Paris. And I find that fascinating for two reasons. So one is now we have this piece of software that is able to sort of download this piece of software to your computer or your phone, and it's able to tell you information about the world somehow. That can be very complex, but also there's another thing that you can only see here, which is the models really never saw the text. The model only sees these lists of numbers, and it only outputs these lists of numbers. Everything that the model touches is just this is token number 4 or 4000. Never have seen through the actual letters that we think of. It's it's fascinating that language models operate in this way where the models never really seen human language in ways that, that we actually see it and only sees this list of indices of tokens. So this has been a quick look at the code of a HuggingFace language model. In the next lesson, we'll look at other recent improvements to large language models.