Llama 3.1 and 3.2 models use a new tokenizer and an expanded vocabulary. Let's learn more about the tokenizer. Large language models don't process the strings you send. These strings are broken into smaller chunks and then converted to integers called tokens by a tokenizer. These chunks represent the smallest meaningful units LLMs can understand. They can be words, subwords, punctuation, or numbers. Let's say, for example, we have a sentence called "GenAI is amazing!" which is broken into five tokens. Gen AI is amazing, and the exclamation mark represented by these integers one zero, one seven, two, one, five, eight, three, six, and so on. If you would like to know more about how tokenization works, here is a link to another course that goes into more details. The Llama 3 family uses the open source tiktoken3 tokenizer. The family has a vocabulary of 128K tokens. This is four times more than Llama 2. And just for reference, most people use a vocabulary of 30 to 50,000 words. So it is big. Compared to the Llama 2 tokenizer, the new tokenizer improves compression rates on a sample of English data from roughly 3.2 to 4 characters per token. so large prompts require fewer tokens, which speeds up inference. In the 128K tokens, 100K tokens are from the tiktoken3 tokenizer, and the additional 28K tokens are to better support non-English languages. Let's try this out. Let's start by initializing the tokenizer. Now there's a lot of code in this cell, but the key part is down here where we are initializing our tokenizer. I'll explain some more of this in a little bit. Let's just run it for now. Now let's try encoding this string. "Hello." What just happened is the string "hello" was encoded to an integer 15339. Now let's try the reverse the decoder. And we get back the string. "Hello." Let's try. "Hello, Andrew." The encoding for hello stayed the same. And we have a new token for Andrew. Let's try a different version of this with a lowercase "a". Here we can see it didn't encode Andrew as a single token, but instead of two tokens. Let's go take a look at what these are. So back when we initialize our tokenizer we used a file called tokenizer dot model. This contains 128,000 encodings. We have created a version of that that's readable. You can open it up by doing file open. It's the tokens notebook. In this file are 128,000 encodings. Each line has the string that's being encoded and the integer that's representing it. Let's look for the token Andrew. That was encoded as 13929. And here we see it. Now when we had lowercase Andrew we saw there's two different encodings. First was 323. And you see that's a-n-d "and". And the second one was 4361. And as expected that's r-e-w "rew". And that's how andrew was encoded. Now coming back to our notebook let's finish describing initialization. We can see where the 128k entries came from. Those are loaded into this dictionary. And then special tokens are added. These are the tokens we saw previously in the prompts the beginning of text, end of text and so on. And then we set some reserved tokens. Those became special tokens and are included along with some default encodings when we create the tokenizer. You won't often have to use the encoder. The encoding is done at the API endpoint. You send strings to the endpoint and it does the conversion. But one application you may have use of is finding the length of the prompt that you are going to send to the model. Models often charge per token rather than per character, and you can use the tokenizer to tell you how many tokens are in your prompt. For example, "Hello World" has two tokens. We can check the length of the prompt from the last lesson. "Who wrote the book Charlotte's web?" has 18 tokens. We can take a look at those 18 tokens this way: Here, we can see the beginning of text string we had used previously, and that's represented by a single token. Similarly, all of the special tokens start header and end header are single tokens. Here's the user prompt "who wrote the book Charlotte's Web?" and so on. There's more convenient way to look at these. Here the tokens are separated by colors. Okay, now it's time to try some prompts of your own. I know you have been anxious to try "supercalifragilist- icexpialidocious", or at least I have been. Give it a try and change the prompt. Feel free to pause the video and try your own. Let's look at another example. Let's ask the model. "How many r's are there in strawberry?" It says "there's two r's in the word strawberry" when clearly, there's three. We can take a look at the tokenization to see why this might be. Here's the token for strawberry. It's a single token, so the model doesn't actually see the individual letters of strawberry. And this may help explain why it's unable to figure out how many r's are in strawberry. It's really just looking at a single token. If we rewrite the prompt and put spaces between all the letters, let's see how it does. There. Now it understands how many r's are there in the word strawberry. There's three. Here we can see it's encoded each letter individually as a separate token. Now there's some additional exercises and explanations at the end of the notebook. If you would like to explore more details about the tokenizer. This was just a brief introduction to the tokenizer for the Llama 3 models. The ideas I hope you will take away from this lesson are: the models now use the tiktoken tokenizer. The vocabulary is expanded to 128,000 entries. This improves encoding efficiency and adds support for more languages. And remember, models don't see the strings you send them, but instead, see the tokens. And this may sometimes help explain how they respond. I'll see you in the next lesson.