In this lesson you will train several tokenizers. Since they determine how the transformer model processes input text, it is important to give them extra attention. All right. Let's build something. Tokenizers are trainable components and they learn the best vocabulary for given training data. There isn't a single definition of what "the best" means in that case. And various tokenization algorithms exist. Contrary to neural networks, the tokenizer training process is fully deterministic, and it's based on the statistics of the input data. Various models choose different tokenization methods. OpenAI prefers byte per encoding, while WordPiece is quite commonly used by some other providers, including open-source sentence transformers, that you use. However, Cohere select work best for their English model, but Unigram to create a multilingual one. The size of the vocabulary is a hyperparameter we need to choose upfront, and it's usually at least 30,000 tokens. For the multilingual models, that might be even a few times more, but that's expectable as the set of characters to support is way wider, and there are also more sequences to cover. Let's check the most popular tokenization algorithms and see how they split the text. Byte per encoding is a common choice. BPE starts by splitting the input by whitespace characters. Natural boundaries defined by words are kept. So a single token will never overlap two words. The vocabulary is initialized with all the characters in the training set. New tokens are iteratively created by merging two tokens, which are the most often put next to each other. The most common part is selected and added into the vocabulary, but we do not remove the tokens used to create it. There are kept in the vocabulary, so we can also use them to tokenize some other words. Initially, our sentence is split into single letters, but after the first step, two consecutive tokens are merged to form another one. The process continues until we reach the desired vocabulary size. We select the most common pair of tokens from the previous step, and then add the new token into the vocabulary. The number of steps depends on the number of tokens we want to have. If we set the vocabulary size to 14, this is how the final vocabulary will look like. Let's repeat that with one of the existing tokenizer implementations from HuggingFace. The HuggingFace Tokenizers library provides implementations of various tokenization algorithms. You will train multiple ones using the same very simple training data set. Implementing byte per encoding is as easy as running a few imports and gluing the components together. A tokenizer is a general component that requires a selected tokenization model to be passed as an argument. It also allows setting some tokenization that will be run on each input text before we really start tokenizing it. Whitespace pre-tokenization helps to split the text into words. The last thing you will set is the trainer object and the size of a target vocabulary. Practically, you will never go below at least a few thousand, but in this case, 14 should be just fine. You can also experiment with different values. You will now train the tokenizer from a Python iterator by passing the training set with a corresponding trainer instance. Then you can get the vocabulary to see what are the tokens learned. The training process is iterative, so we can see what was the order of adding new tokens into the vocabulary simply by checking their IDs. This is how our training example would be tokenized. Created tokenizer might be also used to encode some new text. Let's say we had a typo and we want to see the output tokens for this text. Obviously, IDs are not that easy to read, so let's access tokens instead. Another interesting thing is to what will happen if we pass some letters that didn't occur in the training phase. The letters S and H has never been seen during this process, so they are omitted in the output as no corresponding token is available. Byte pair encoding is commonly used in large language models, but typically on a byte, not character level. Wordpiece is a similar algorithm and it's quite often used in an embedded models. The main difference between BPE and WordPiece is that the latter distinguishes between the first letters of words and the middle letters by appending, double hash prefix to each middle letter. The general idea of WordPiece is to learn words, prefixes, and suffixes separately, as prefixes are thought to keep the meaning which the inflected suffix does not change much. Words: Walk, walking, walked, and walks all refer to the same activity, so we expect the model to treat them similarly. Ideally, one token should represent an abstract concept of walking and the next should represent the tense. This is how the initial vocabulary of the word tokenizer would look like. The training process starts with a vocabulary built from each word's letters and the middle letters of words prefixed with double hash. Then the algorithm iteratively merges the pairs of tokens, but it selects the tokens to merge based on the score, which is different compared to BPE. We do not select the most frequent part anymore, but also consider how often these two tokens occur in other contexts. If we calculate the scores for each pair of tokens in the first iteration, it becomes evident that if two tokens always ocurre next to each other, they will be selected for merge. The word "long" consists of letters we cannot find anywhere else in the training data. Similarly to BPE, we add new tokens iteratively. In the same example, our tokenization already works differently and prefers different pairs of tokens. Again, the process will stop when we reach a desired vocabulary size. Here are the steps performed when we set the vocabulary size to 27. Since WordPiece adds all the single letters and middle letters as separate tokens, we usually need a bigger vocabulary to capture these regularities. Let's implement the tokenizer again for the same training set. The HuggingFace Tokenizers library has a slightly different implementation of WordPiece. Instead of calculating the scores, it always selects the most common pair of tokens, similarly to BPE. But differentiates the middle letters by attaching the double hash prefix to them. Due to that, you will now use another library implemented specifically for this course. It is based on the HuggingFace Tokenizers and provides a real-world implementation using score to choose the tokens to merge. This time the target size of the vocabulary is set to 27. The real world WordPiece library is not 100% compatible with HuggingFace Tokenizers in terms of how we run the training process. This time you have to call the trainers train tokenizer method and pass the training data, as well as the tokenizer instance you want to train. Our vocabulary already looks different than the case of BPE. Let's check how our training example will be now converted into tokens. The training procedure was able to find a prefix that occurs multiple times, and it sounds like a desired behavior. When it comes to typos, you shouldn't notice any huge changes. If you now pass the text with some unknown characters, then the results will be a bit different. Our tokenizer throws an error when it finds an unknown character. Thus, we need to specify a special fallback unknown token. It has to be passed on to the model and the trainer. You will now back use the WordPiece training as implemented in HuggingFace library, but with this unknown token used. You can also experiment and use the real-world WordPiece train procedure. See how the results will differ. You will now define the same training procedure using HuggingFace Tokenizers, but with the unknown token used. It has to be known by both the model and the trainer. We'll also increase the target size of the vocabulary by one, as the unknown token will also get its own ID as well. Let's run the training process again. The learn vocabulary is different than before, and so will be the tokenization for the same examples. If we now run the newly created tokenizer on the same training example, the output should be already a bit different. The HuggingFace variant seems to cover full words with individual tokens a bit better. For typos, we still shouldn't see any major change. However, when we pass some unknown tokens, they will be automatically converted into the special token we created. We could obviously select a different value for it, but that's a typical convention. BPE and WordPiece are not the only algorithms, but they have much in common, especially because they start with a basic set of characters and then iteratively merge them to form more complex tokens. Unigram has a different approach. Instead of building the tokens from the bottom up, we can also do it the other way around. Unigram starts with a huge vocabulary, quite often created with BPE, using a way bigger vocabulary size than expected, and then it removes some of the tokens based on the calculated loss. The initial vocabulary allows to tokenize a single word in multiple ways. The set of all the possible tokenization is important to calculate the loss in Unigram training process. At each iteration, the algorithm computes how much the overall loss would increase if a specific token was removed and looks for the tokens that would increase it the least. Probabilities are defined by the frequencies of the tokens. In our case, there are 63 occurrences of the tokens in total. That means we can calculate the probability of a particular tokenization as a product of frequencies of all the tokens used. Even in such a simple example, the number of tokenization to consider is pretty significant. Unigram training will check how removing each individual token will impact the loss, and then select the one that might be removed with the smallest impact on the loss value. Analyzing each step of the unigram training process would require lots of computations. Practically, we can train the Unigram model with the HuggingFace tokenizers library without changing too much. The Unigram model is the last that we will train today. The training pipeline is similar to any other model you tried so far. We expect the vocabulary size of 14 tokens. You might be surprised that the vocabulary is shorter than the expected size of 14, but that's just an implementation detail. Unigram removes a couple of tokens at each step and only guarantees that the size of the vocabulary won't be bigger than desired. We can see the common prefixes were found to be important, so the hope is that our tokenization can capture the meaning correctly. Also, there are no double-letter tokens anymore. What was the case with WordPiece and BPE? We only have the base alphabet and the common prefixes of the words. Unigram seems to be a solution to the problem of so-called glitch tokens, as it removes the ones which won't be using the tokenization of the training data, while the other methods keep them in the vocabulary. We can also try out the tokenization on the same examples as we used before to see the output tokens. Similarly, if we make a typo or when we pass data with some unknown tokens. In this case, the unknown letters seem to be supported when we check the tokens, so let's see their identifiers. A know token is used for the sequences of unknown characters, but checking the tokens does not show it clearly. If you read some other materials about tokenization, you can easily find sentence peace to be mentioned quite often. Let's see how it differs from the other methods we discussed today. SentencePiece is just an implementation of the same tokenization algorithms, with one additional assumption about the text, not a different algorithm itself. SentencePiece does not text by white spaces, but treats them like any other characters. That enables it to work for languages that do not use them as separators. Internally, SentencePiece uses byte per encoding or Unigram to build the vocabulary. SentencePiece allows building tokens that span across multiple words. The fact that it allows whitespace characters to be parts of the tokens is essential. For example, if you want to build an embedding model for code. Languages like Python, for which in dictation matters a lot in terms of meaning will benefit from this approach. However, there are also English data in which some proper names may contain multiple words like San Francisco or Real Madrid. Having a single token for both words might be beneficial in some cases. Checking the tokenization algorithm and its vocabulary is an important step in choosing the model. The practical implications might be still unclear, but we will try to shed some light on them in the next lesson.