The way a model consumes text is through something called tokens, which are these different pieces of text that are very efficiently encoded. In post-training, what's important to know is that these tokenizers, basically these modules that are able to convert text into these byte-size encodings, they are trained with the model, and as a result you sometimes need to freeze or unfreeze them to get the right result. So you have your text, how do you actually encode it into numbers efficiently for the model to actually read and process it, and then output text? So you could encode text like people do, you could use words like in a dictionary, or you can use characters like in an alphabet. But are there more efficient ways to encode text? Yes, there are, and they're called tokens. You could do definitely word-level or character-level tokens, but there are really interesting algorithms like BVE, byte-pair encoding, that compress your training text into more efficient strings, like ING, for example. So you can think of swimming or going, it'll repeat and reuse that ING string. And all the tokens a model has in its vocabulary is, again, called its vocabulary, and for GBV3, it has 50,000 byte-pair encoded tokens. So its vocabulary size is about 50,000, which is enormous if you think about it, but it's very efficient to encode it into these small pieces. Tokenization also introduces some fun problems. So one of the most popular problems has been, count the number of Rs in strawberry. That's a very difficult problem for language models because they're tokenized, they operate on tokens. So the tokens might be straw and berry, and as a result, those are two different numbers, and each of them, the model can't really see inside of a token, so it might count only two Rs, and this has been a long-running joke in AI that the model can't handle this. And of course, you can alleviate this by making sure that the tokens, you know, individual Rs or the tokens will actually be able to handle that effectively. And you can see in this graph here that when you tokenize with BPE versus tokenizing with just characters, you can actually get the number of tokens per sentence to be very low, right? Like this is the same data set, and you can see the distribution shift all the way to the left over there. That is showing just how efficient BPE is at tokenizing text. So how tokens really fit in. So you have your text, you turn it into tokens, and then the LLM is able to process those tokens and produce the next token. And then from that token, you can also process it back into text. And of course, that next token then gets fed in to append to that whole set of tokens going into the model to predict the next token in a loop. What's used to encode text into tokens and then to decode tokens into text is something called a tokenizer. So just looking at a tokenizer a bit more, you can have the text, what words are indivisible. And you might see these tokens, right? So you might see that this whole sequence is subdivided into different pieces. And each of those pieces is called a token, and they get mapped onto an ID in the vocabulary. And that's just a simple lookup table. But essentially, it gets mapped into a number. And then those numbers are then fed into the model. So the model can then operate on those numbers using math that you'll take a look at in a sec. But essentially, the tokenizer in code kind of looks like this. It's just tokenized, you pass in the text, and it's able to give you these IDs. Same thing for decode, just in the opposite direction. You give it those IDs, and you just call tokenizer.decode, and it turns it back into that text for you. Next are embeddings. So how do the tokens actually go into the model? So they're basically semantic representations of every single token. So when you have your token IDs, those get mapped onto the first layer of the model that are the embeddings. And the embeddings are the size of the vocabulary. So you can think of the token IDs as essentially indexing into this embedding matrix. And then this embedding matrix is trained with the model to represent each token semantically initially as it enters into the model. Next, the model will produce probabilities. So it's going to produce these probabilities over the entire vocabulary. And then it's going to take from those probabilities a token ID. And the simplest way is to just do some type of greedy sampling or greedy selection, which is pick the token ID with the highest probability. And this is by default just in the model.generate function in HuggingFace. So you don't need to specify anything. It's just greedily picking the highest probability token, which in this example is that 0.5 token ID 0. You can also sample from this distribution of probabilities over the vocabulary. So you could sample the ID 0. But through sampling, you might also sample this ID 2. And when you're sampling, you can pick essentially different things using your sampling algorithm. And what's really interesting is you can modulate how much you want your sampling algorithm to be more random or to be more deterministic. And by more deterministic, I mean you can actually set the temperature here, which is a parameter. You can set it to zero, which means you don't want any variation in terms of how you're sampling. And that actually just makes sampling the same as doing that greedy selection. But you could also turn it up. So here is just visually showing what temperature does. On the left, you see temperature 0.25, mostly selecting vanilla. There's a little bit of strawberry if you look there. But essentially, it's constraining what can actually be sampled. And then in temperature 2, which is much higher temperature, you can see that the distribution across the different flavors is much more even. So then a model will be able to sample across this distribution, get a little bit more randomness in terms of what it gets for its next token. Finally, another method to get the next token ID from these probabilities over your vocabulary is something called beam search. And what beam search does is it's able to track multiple different sequences as you sample them. And these branches will branch off, right? Things will get a bit different as you sample different tokens, but it's keeping track of the top beams. So the top candidates, essentially. So for two beams, you might sample two different token IDs and you keep the top ones in memory. And so if you were to have three beams, you might actually ultimately end up with something like this as an output. So when a user asks, you know, write a poem, the model might have the sun is setting soon and the sun rose above the hills and then three roads diverge. So it might keep those three beams as the top candidates. So we have an idea of how we go from those end probabilities that ultimately are processed out of the model to then pick that next token ID and keep looping that for the next token. All of this can be processed in a batch for efficiency on GPUs so that we can parallelize processes and get outputs much faster across one text, but also many texts. So when you batch multiple different sequences together, you might see something like this where, you know, multiple sets of token IDs to embeddings to do the embedding lookup. But realistically, they're going to be different lengths. So typically what's done is adding padding tokens. So these padding tokens are basically like null tokens in there to make sure the sequences are the same length so that we can process this efficiently on a GPU, which requires things to be put in a batch of the same size so that we can do matrix multiplications on it and operations on it much more efficiently. And so here's an example of taking a tokenizer and actually looking at how it's padding things. So this is padding things ahead of, you know, prepending or prefixing the padding tokens on the sequences so that they are the same length with the ID zero. So you saw how tokenizers work. Tokenizers often are tied to a model or rather models tied to their tokenizers. So typically how you instantiate it in HuggingFace is there is an AutoTokenizer feature, which basically is able to take any model name and map it to the appropriate tokenizer it used. As you probably can tell, because these tokenizers are associated with different numbers for vocabulary, it really matters like what tokenizer you use for a particular model and what it was trained with. And this will become really important for post-training. So many tokenizers, just to compare a few, you can use through the Transformers library here. So here is the tokenizer for BERT based on case. So there's a BERT model and you can see that the tokens produced here have this hash symbol, right? And so it does that when it chunks up a word. So using HuggingFace is pretty manageable. So if it's kind of chunking up and saying it's the same word, it uses that hash. Here is another tokenizer for a model called T5 small, and this uses underscores for spaces and includes the spaces in the tokens. It also prefixes the whole sequence with an underscore as kind of a sequence symbol. And then this tokenizer might look really weird from a DeepSeek model, but essentially uses this really strange G character for spaces and it doesn't merge spaces into a single space, but it does group them into single tokens. The main takeaway here is not to memorize these different tokenizers and what they output, but essentially to see that there's a lot of variation in how these tokenizers tokenize and choose basically substrings to represent different sequences and not to be alarmed if you see something like that weird G character. So how does this all play into post training? So in post training, when you're making really small changes, you can actually just get away with freezing the embeddings. You don't need to change much. When you're keeping the same vocabulary, for example, in your RL phase, you can freeze the tokenizer as well. You don't need to continue training that because the vocabulary doesn't change. However, you do need to make changes when you're making larger changes, right? You'll be inclined to do that. You'll want to train your embeddings because those semantic representations might change. Let's say you're training a legal LLM and it learns all this legal vocabulary and jargon and maybe there's some new acronyms in there. And to learn that, you really, really need to continue training those embeddings so that it can capture that semantic meaning more effectively and efficiently. When adding new terms or special tags, you'll want to train your tokenizer because your vocabulary will change. You might not be representing those tags very efficiently if you retrain your tokenizer, if you keep it frozen. And so this is just something to consider. When you're changing your vocab size, you'll also want to resize your models embedding layer because that's dependent on the vocabulary size. And typically what people do is kind of do a warmup or train the new embedding first and then train everything. Now that you learned how tokens are created, it's time to get into some of the fun stuff with fine-tuning math.