In this lesson, you'll build the SFT pipeline on a small scale training dataset. All right, let's dive into the code. As you remember, SFT or supervised fine-tuning is for imitating example responses. We usually start from any language model, which can be a base model where the assistant tries which can be a base model where the assistant tries to only predict the next most possible tokens based on user queries. Then we can curate some chat data or instruction following data where the assistant respond to the user queries in a more natural fashion. When the user asks how are you? The ideal response will be I'm doing great. And we use this labeled data to do this supervised fine-tuning on top of the base model to get a fine-tuned language model, which can chat with you more fluently. In the lab, we will start from a base language model and prepare a label data for chat and instruction following. and prepare a label data for chat and instruction following. And will conduct a SFT to get a fine tuned model that can chat with a user. Okay, let's see all of this in code. We'll start from importing important and relevant libraries first. We first import torch which is essential for training with PyTorch. And we'll also import pandas here for displaying some of the tables we'll be using new dataset. And we use HuggingFace datasets library to load all the relevant dataset. And there will be a dataset class here for defining those relevant datasets here. And there's another import a library from Transformers, which is also from HuggingFace where we need the training arguments, the auto tokenizer, and also the auto model for causal LLMs. And lastly, we'll be using HuggingFace TRL throughout this coding lesson where we'll be using SFT trainer and the data collector a SFT config for setting off a SFT training process. After getting all the important libraries, let's first set up some helper functions that will be used throughout the coding lessons. The first function we're going to write is an auxiliary function The first function we're going to write is an auxiliary function for general responses. It takes in the argument of the model itself, the tokenizer, the user message, and path fully the system message if there is any, and along with the maximum number of new tokens allowed during this generation process. So when we start, we usually first start from a clean empty list for the message. And if there is a system message, we'll append with a dictionary where the all the system and the content being the provide a system message in a string. And later we also append the user's own message And later we also append the user's own message And later we also append the user's own message in a similar way to accomplish our final messages. So with these messages we'll be using the Tokenizers API chat template function to convert that into a format where the language model is trained from. to convert that into a format where the language model is trained from. And for sync free specifically, it would require enable syncing to be set to be force specifically, it would require enable syncing to be set to be force specifically, it would require enable syncing to be set to be force in order for the model to not enter into a syncing mode. So after we got the prompt in the text on that, So after we got the prompt in the text on that, we'll call the tokenizer to convert the text into tokens. we'll call the tokenizer to convert the text into tokens. That's the language model can recognize. And we'll also send that to the same device as model in case the model is located in our GPU. After we catch the token as input, we'll use HuggingFace all models dot generate to generate the corresponding outputs, and here we set max new tokens to be the argument part here, so that the function can control how many new tokens can be generated here. Besides model dot generate, I also recommend you to try VLLM, sglang or TensorRT, which are inference libraries VLLM, sglang or TensorRT, which are inference libraries that can be faster and more efficient than HuggingFace own models dot generate. So after we get these outputs, we can extract the generated IDs and responses using these few lines here. And essentially, what we got from generated IDs will still be in the format of tokens. And we will call tokenizer dot decode to convert those generated IDs to a text-based response. And we'll return the response here. That concludes the first helper function for generating responses. Next we'll implement another function on test models with questions which text in the model tokenizer and a list of questions, and possibly assess a message and also a title for printing. So we're first print the title and then we call the generate responses to generate each questions response using the post function. And we print out the model input model output for different question and responses. After this we also have a helper function for defining model loading and tokenizer loading part. Another function we will need is to load the also model and tokenizer, Another function we will need is to load the also model and tokenizer, where we're taking the model name from HuggingFace, and we're taking whether you want to use GPU or not here as an argument. And we'll call this auto tokenizer to load from HuggingFace And we'll call this auto tokenizer to load from HuggingFace And we'll call this auto tokenizer to load from HuggingFace And we'll call this auto tokenizer to load from HuggingFace the corresponding tokenizer and we'll call auto model for casualLM to actually load the model itself from HuggingFace. to actually load the model itself from HuggingFace. And if we use a GPU or send the model to CUDA so that assuming we're using a video GPU, this can be sent directly to locate our model until the GPU has. Another thing we might want to pay attention to, since we're using a pie chart template in the previous generate response function, if there's no such a template existing, we'll just create one ourselves. So the chat template is usually in a change of format, So the chat template is usually in a change of format, where we iterate over all possible messages provided here. And if the role of the message assistant or just makes the string, with a system and followed by the real content here, and if the role is user, we'll just say a user followed by the content here. And if the job is assistant or just use assistant followed by the content provided there. And after this, there's some minor tokenizer config, where if there's no such And after this, there's some minor tokenizer config, where if there's no such And after this, there's some minor tokenizer config, where if there's no such And after this, there's some minor tokenizer config, where if there's no such token exist or by default, shout out to be the end of the sequence token. As a result, we just return the loaded model and tokenizer in this way. Another function you will need is this display dataset. Taking the dataset and try to display in a Jupyter Taking the dataset and try to display in a Jupyter notebook-friendly fashion where we start from the datasets examples and then take a look at the user message and the system message and append the user message as a system message in a loss here. Then we turned out to rows as a table and then display it with pandas. All right. That's everything we need for the helper function. And next, let's load the base model and test it on simple questions. So there are two parameters we set here first. So there are two parameters we set here first. The first one is we set use CPU to be false. On DeepLearning.AI platform we currently only have access to CPU. So I'm turning use GPU to false. But once you try it on your own like GPU machine, please feel free to tell that use GPU as true. And I also set a few questions here for testing the base model, which is give me an what sentence structure of a language model. Calculate one plus one minus one and also difference between a thread and process? Next, we'll try loading the model and tokenizer from a small Qwen free model, Qwen 3.6b base. from a small Qwen free model, Qwen 3.6b base. And we'll test those questions on this model. And note that this is a base model. And we didn't do any SFT on top of that. This might take some time or speed it off in the post edits. Now we'll see that the base model before any SFT will output some random tokens for any given instructions. This is first because the chat template we use This is first because the chat template we use This is first because the chat template we use is never seen during such pre-training. And second, pre-training model is really not great at answering questions from user. Now, let's take a look at another checkpoint that has been trained through Now, let's take a look at another checkpoint that has been trained through supervised fine-tuning, which will detail the training process later. Now I load at different checkpoints that we trained through a SFT and look at the base model. After training our SFT the output will be different. This might also be slow, so we will speed it up in the post edits here. This might also be slow, so we will speed it up in the post edits here. Now we can see that after doing supervised fine-tuning on the base model, the output is much more natural and the model is able to respond to any request here of giving one sentence introduction of a language model calculate some, last questions and explain the difference between thread and process. I have trained the Qwen 3 model using a SFT to compare the model performance before SFT and after SFT. So next, I'll show you how we exactly conduct the entire SFT process. However, due to resource limitation, we won't be performing a SFT on the exact Qwen 3.6 B model, but instead we'll be doing a SFT on the exact Qwen 3.6 B model, but instead we'll be doing a SFT on a much smaller model and a much smaller dataset. And feel free to use the entire dataset on the same model to reproduce my SFT result. Now let's try doing a SFT on a small model. we'll first step the model name to be HuggingFaceTB/SmollM2-135M which is 135 million prompt a model that's smaller than Qwen3-0.6B which is 135 million prompt a model that's smaller than Qwen3-0.6B We will load the model and tokenizer here. And when you train your own model on GPU, And when you train your own model on GPU, please feel free to change the model name to Qwen 3. Also, prepare a training dataset with a few prompt-response pairs that we created beforehand. with a few prompt-response pairs that we created beforehand. And here's a short list of the example user prompt and assistant responses. So this instruction can span from questions or instructions or even translation requests, etc... So this is a very diverse, supervised-finding dataset. And if we're not using GPU here in a simple environment, we just first train on the first 100% samples for illustration purpose. And when you use GPU, please feel free to train on the entire dataset to get back the Qwen 3 performance. The last setting we need to config is SFT trainer configuration, where we need to set important hyperparameters here in order for SFT to work well. So here are a few key parameters So here are a few key parameters So here are a few key parameters so we usually set during the SFT procedure. The first one is a learning rate which is then the learning rate for training. And usually you need to play with this learning rate a lot to figure out what's the best learning rate for your own dataset and model. And there are also number of training epochs. Here, we set that to be one to speed up the whole process. If you want to train on the dataset for multiple times, you can set that to be two or even higher. And then the next two, per device train batch size and gradient accumulation steps are two important factors to determine your effective total batch size. are two important factors to determine your effective total batch size. So the per-device trained batch size is the batch size for each device or GPU. If you have eight GPUs and two If you have eight GPUs and two set per-device trained batch size to be two then your effective batch size without going to an accumulation would be two times eight, which is 16. And gradient accumulation step will be the number of steps before performing a gradient descent, which means that this eight before performing a gradient descent, which means that this eight will also be multiplied with the per device trained batch size with the train number of GPUs, you have to fully determine the total batch size. with the train number of GPUs, you have to fully determine the total batch size. In our case, because we only have one CPU and the per-device trained by size one, the gradient accumulation step is eight, so the final effective batch size is one times one times eight, which is eight. If you set the per-device trained by size to be larger, then usually you would need more memory on each GPU. That's why we sometimes need gradient accumulation steps, which tries to effectively increase the batch size without increasing the memory usage. Next, there is one additional functionality of gradient checkpoint, which, when enabled, can help reduce the GPU by skipping some of the activations. And here we set that to be false. And if you see auto memory, tweaking that to be true might be one of the first thing you want to try here. And finally, the logging step will be the frequency of logging the training process. And we'll see later how this can affect the different outputs of the training process. After setting up all the hyperparameters here, we're ready to kick off the training using SFT trainer where we'll put in the model a SFT config as arguments, the training dataset we prepared before, and the tokenizer as a processing class. Then we can kick off the training here. Let's now run the SFT trainer and begin training. You'll see there will be a progress bar showing the progress of training where we're training for one epoch. And since we're only training 100 samples and the batch size is eight, so the total steps of gradient descent is 13. It will take in a scale of minutes to train the small model on 100 samples It will take in a scale of minutes to train the small model on 100 samples here. Now the SFT training is complete. Though it's trained on a smaller model with only a 100 samples, so one won't expect us to have an extremely well performance. Now, let's test the incomplete SFT training results. We test the model by filling in the SFT training model as arguments and see how it performs on the questions we prepare here. You might see that for the inputs here, the model is able to give reasonable responses. Though sometimes it can be repetitive. Sometimes it not be able to give the right answer. This is mostly because first the model is small. Second, the dataset we train on is only 100 samples, which may not be enough to update the model to a good shape. We did this mostly due to access of limited resources We did this mostly due to access of limited resources We did this mostly due to access of limited resources We did this mostly due to access of limited resources and we'd encourage you and train and try on a Qwen3-0.6B model on our own GPU on the full dataset to reproduce our previously illustrated results here. In this lesson, we have trained turning a base model into an intruct model we have trained turning a base model into an intruct model we have trained turning a base model into an intruct model that can chat with user based on Qwen3-0.6B based model. We also tuned and go through the whole SFT procedure with a smaller HuggingFace small LLM model. In the next lesson, we'll be going over some basics of DPO. This concludes the lesson three for safety in practice. Now let's have a look at the incomplete, safety training results. We directly test a model on a safety trainer and observe the best model output after safety. So when I see that it's giving a reasonable output, except for that, in some cases it might be repetitive. In some cases it might not be that great in S3. Gives a query. This is because first is a relatively small model and second is only 100 samples. So it still doesn't know how to respond properly. So for further training I'd recommend you try. Try. It was 3.6 B model on the fourth training dataset to reproduce my previous results. Oh, actually, before that you can see, we did this because of the limited resources. Okay, on the platform, but be free to go with a larger model and a larger dataset. Okay. Makes sense. Yeah. Oh, okay. Start from that, then. We did it. I'll start from. We did, or should I we do this? I think yes. You say if you say that editor. So. Okay, after you show the result and, talk about, you know, they are relatively boring. Yeah. It's actually continuation of the present event. Okay. Okay. We do it from. Let's do it. Okay. See if this part is so tricky that we don't want to be so apologetic. But at the same time, you know, keeping the excitement. Yes. This is so powerful. Powerful. But it's not for us because of the limitation. Okay. So, yeah. Okay, we're doing the testing part.