We'd like to know you better so we can create more relevant courses. What do you do for work?

Loading...

We'd like to know you better so we can create more relevant courses. What do you do for work?

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Now that you have cleaned your dataset, you need to prepare it for training. In this lesson, you'll learn how to package your training data so that it can be used in HuggingFace. Let's dive in. As Lucy just mentioned, there is a bit more manipulation over the data that you have to do before you can use it for your training run. The two main steps are tokenizing the data and then picking it. LLMs don't actually work directly with the text, their internal calculations require numbers. Tokenization is the step that transforms your text data into numbers. The exact details of how text is mapped to tokens depends on the vocabulary and the tokenization algorithm of your model. Each model has a specific tokenizer and it is important to choose the right one or your model won't work. Picking structures the data into continuous sequences of tokens. All of the maximum lengths of this model sport. This reshaping makes training much more efficient. One important step in packing is the addition of special tokens to indicate the start and end of the sentences. This is easier to see by looking at real examples. So let's head to the notebook. All right let's start with tokenizing. Recall that we have created this parquet file in the previous lesson. In this step we're going to use the HuggingFace dataset library so that we can use all the goodies that comes with the library. But note that we won't be using the whole file. We will shard the entire data set into ten pieces and deal with only one of those shards to reduce execution time. So you can see that before we had 40,000 rows, but now we only have 4000 rows. Now let's load a tokenizer. You can choose anyone from an existing model hosted on HuggingFace or create your own. Many times you will see models in the same family use a same tokenizer. In this case we will be using a Tiny Solar's tokenizer from Solar, which is in the same family. Note that we disable the use fast flag in this case. If use fast is actually true, the auto tokenizer uses a tokenizer implemented in rust, which operates in parallel and is much faster. However, in this case, long text samples sometimes tend to hang, so we will set use fast to false and separately use the map function and the dataset library for parallel processing. Are you curious about the outputs of the tokenizer? Let's try it out before processing in batches. As discussed in the slides, you can see that the entire sentence is tokenized into multiple tokens, where a special token here indicates where the original white spaces were located. Note that the output is still in text. Now we're going to convert these into numbers. Now we will create a helper function in order to take advantage of the map method of the HuggingFace dataset library. Our function will tokenize text, convert tokens to ids at BOS and EOS tokens. We add BOS and EOS tokens because we want our model to have an idea of which tokens are actually coherent. We then calculate the number of tokens in each example. Do you want to take a look at the results? We're currently mapping the tokenization function to our data set. Do you notice the difference? Now we have input IDs and num tokens within our dataset. Let's take a look at the first example. So here you can see that we have text, and input ID that goes along with the text. And the total number of tokens that go with the text. All seems good. If you want to check out other samples, feel free to change the index. Now, we are going to calculate the total number of tokens in our data set. When training our LLMs, we are often interested in calculating the total number of tokens, and we can easily check this with numpy. So with this small dataset that actually started out with approximately 4000 text samples, you actually have 5 million tokens. You can see how a data set built from most of the internet or entire libraries of books can end up with billions or even trillions of tokens. Now we're on the last step of preparing our dataset. Let's pack our dataset. Our dataset currently contains rows of variable lengths, so the lengths of examples are all different, but when training a large language model, we want to transform the variable lengths to lengths of equal sizes. To do this, we will perform the following. First, we will concatenate all input IDs so that it turns into one large list. We also call this "serialization". Second, we reshape the large list by partitioning the lists into smaller lists with a max sequence length. So, let me show you how this actually works. This is concatenating all of the input IDs for each example into a single list. You can see that the number of input IDs equals the number of tokens that we calculated above. Now, we will choose a maximum sequence length. Here we will set it to 32. But this can be set to a longer length if your device has enough memory. A longer maximum sequence length will enable better performance for long text. Note that recent models such as Solar Llama-2 uses 4096. Now, let's calculate the total number of input IDs so that the remainder is zero when divided by maximum sequence length. You can see that the number of input IDs shrank a bit. Next, let's discard the remainder input IDs from the end of the data set so that it can be exactly partitioned by maximum sequence length. If you check the input IDs shape, it will be one dimensional. And you can tell that by the comma here. Let's reshape input IDs to minus one max sequence length. We can also make sure that the type is int32 here. Here, using minus one, lets the reshaping function to automatically determine the number of rows. If you check the type of input IDs reshaped, the object is currently a numpy array, but you need to convert this to a HuggingFace dataset so that it can be used in training. So let's do that. So what we're doing here is transforming our input IDs into a list and then into a dictionary. Finally, let's save the data into the local file system to use again later. Great. So we now have our clean data tokenized and packed into the right shape for training. The next step is to configure our model. Let's move on to the next lesson to see how to do that.