This week's lab lets you try out fine-tuning using PEFT with LoRa for yourself by improving the summarization ability of the Flon T5 model. My colleague Chris is going to walk you through this week's notebook, so I'll pass you over to him. Hey, thanks Shelby. And now let's take a look at Lab 2. In Lab 2, you will get hands-on with full fine-tuning and parameter-efficient fine-tuning, also called PEFT, with prompt instructions. You will tune the Flon T5 model further with your own specific prompts for your specific summarization task. So let's jump right into the notebook. So, Lab 2, we are going to actually fine-tune the model. So, Lab 1, we were doing the zero-shot inference, the in-context learning. Now we are actually going to modify the weights of our language model specific to our summarization task and specific to our dataset. So, real quick, let's do these PIP installs, same as Lab 1, where we have, we are going to use PyTorch. There's also a library called Evaluate, and this is what we're going to use with our Rouge score to calculate Rouge. You learned about Rouge in the lessons as a way to measure how well does a summary encapsulate what was in the original conversation or the original text. Now, PEFT, you heard about a bit in the lessons. This is what we will use to do the parameter-efficient fine-tuning. Now, I'm going to do some imports here from those PIP installs. All right. So, once again, we have the auto-model for Seq2Seq. This is what's going to give us access to FlanT5 through the Transformers Python library. The tokenizer, we used GenerationConfig in the previous lab. Now we're going to see two new classes, one called TrainingArguments, one called Trainer. These are all from Transformers. These are all ways we can use that simplifies our code when we're trying to train our language model or fine-tune our language model. We see that we are going to import PyTorch and the Evaluate, and we will use, I believe, Pandas and NumPy later on. So, let's load the dataset just like we did in the first lab, and let's load the model just like we did in the first lab and the tokenizer. And so, note, this is called the original model, and this will be useful later when we compare all the different fine-tuning strategies to the original model that is not fine-tuned. Here is a convenience function that prints out all of the parameters that are in the model, and specifically the trainable parameters. And this will become useful when we introduce the PEFT version of the model, which does not train all of the parameters. Here we see there are approximately, what is this, 250 million parameters being trained when we do the full fine-tuning, which is the first part of this lab where we full fine-tune. The second part of the lab will be where we do the parameter-efficient fine-tuning, specifically with Laura, where we will only train a very small number. So, keep that in mind. This is kind of a lot of messy code, but it's pretty useful for the comparison. Okay, and let's just do just like we did in the first lab. We're going to show a sample input. We're going to show the human baseline. We're going to do the zero shot. So, this is not one shot, not few shot. We're kind of past that. That was lab one. Here, we are trying to get to the point where one simple call into our model can give us a decent summary without having to pass in the one shot and few shot examples. That's the goal. And the first way that we're going to do is we are going to do full fine-tuning. Here is a convenience function that can tokenize and wrap our dataset in a prompt. And so, as we saw in the first lab where we had a prompt that said, summarize the following conversation, and then we're actually going to give it the dialogue, and then we're going to end the prompt with the summary colon. And this function will let us map over all of the elements of our dataset and convert them into prompts with instruction. And that's what we're going to do here, which is full fine-tuning with instruction prompts. Okay. So, here we're just going to take a sample just to keep the resource requirements low for this particular lab, speed things up a little bit. And let's take a look at the size. Here we have about 125 training examples. We're going to use five for validation. We're going to use 15 to actually do our holdout test later on when we compare. Okay. So, we're going to fine-tune with the training, and we're going to validate with the validation. And then when all of that's said and done, we're then going to use the 15 test examples to then compare the different strategies for fine-tuning with instruction. And so, here we see training arguments. And we see some defaults here for the learning rate. We see some pretty low values for the max steps and the number of the epics. That's because we do want to try to minimize the amount of compute that's needed for this lab. If you have more time, you can certainly change these values and bump them up to maybe five epics, maybe max steps 100. In a bit, I'll show you how we actually work around that. We have trained offline a much larger model with much higher max steps and training epics. And in a bit, we will actually pull that in and then continue from there. But this is what the code looks like. Here's the training dataset. There's the evaluation validation dataset. Here's where we call train. So, actually, let me just do Shift-Enter. And then here's that model that we trained outside of this lab that is a little bit better. And so, we'll actually start with that. I do want to keep an eye on the size of this model. So, this is a fully fine-tuned instruction model. And you'll see it's close to one gigabyte. And that will come in handy later when we compare it to PEFT, which is on the order of 10 megabytes. So, here we see 945 megabytes. So, we pulled that model down into a directory here called Flan Dialogue Summary Checkpoint. Now, we're going to load that instruction model. So, now, this becomes our new model that we are then going to use to compare here in a bit. So, now that we've loaded what we're calling instruct model, let's actually try from our test dataset using the human eye. Let's qualitatively test and see how does this look. So, the baseline summary, Person 1 teaches Person 2 how to upgrade in Person 2's system. The original model, without any instruction fine-tuning, just zero shot. The instruction fine-tuned model that we just got done training is Person 1 suggests Person 2 should upgrade their system hardware and CD-ROM. Person 2 thinks it's a great idea. So, that's qualitatively. That's just sort of looking. Now, we only took a look at, you know, one example. But this is why we have quantitative techniques to do this comparison, to do the evaluation. Specifically, let's load Rouge, and we're going to take a look. I think we're just going to do maybe the first 10 here, and let's compare them. Okay. So, let's take the first 10 from our test dataset. We will run them through these conversations through both the original Flan T5 model, as well as the instruction fine-tuned model that we used up above, that we trained up above. Here, of course, we're going to wrap it in a prompt, similar to what we used to train. And then, let's see how it did. So, these are... So, this is sort of qualitatively taking a look at them side by side. Okay. Let's compare the Rouge metrics for both the original Flan T5 and the instruction fine-tuned model that we tuned up above. Here, we see that the instruction fine-tuned model scores much higher on the Rouge evaluation metric than the original Flan T5 model. So, this is showing that with a little bit of fine-tuning, using our dataset and a specific prompt, we were actually able to improve on the Rouge metric. One other thing that we did offline was we did this much longer with a much larger test dataset. So, it wasn't just the 10 or the 15 examples. This actually was the full dataset. And let's take a look. So, that's what this file is, the CSV file that came along in this data directory with this lab. So, here we see with a much larger dataset, the scores are still pretty similar, where we're getting close to double, not quite double in some cases, but pretty significant improvement upon the original Flan T5. And here, we see the percentage improvements specifically. So, if we actually do the calculation, we see Rouge 1 is 18% higher, Rouge 2, 10%, Rouge L, 13, Rouge L sum, 13.7 as well. All right. Now, let's get into parameter efficient fine-tuning. This is one of my favorite topics. This makes such a big difference, especially when you're constrained by how much compute resources that you have. You can lower the footprint, both memory, disk, GPU, CPU, all of the resources can be reduced just by introducing PEFT into your fine-tuning process. In the lessons, you learned about Laura. You learned about the rank. Here, we're going to choose rank of 32, which is actually relatively high. But we are just starting with that. And here, it's the Seq2SeqLM. This is Flan T5. And so, with just a few extra lines of code here to configure our Laura fine-tuning, then here, we see we're only going to train 1.4% of the trainable model parameters. And so, in a lot of cases, you can fine-tune very, very large models on a single GPU. And here's some more of those training arguments. So, this is really back to the original HuggingFace training and training arguments, except instead of using just the regular model, we are actually using the PEFT model. And here, this is a convenience function offered by the PEFT library. And we give it the original model, which is the Flan T5. We give it the Laura configuration, which we defined above with the rank 32. And we say, get me a PEFT version of that model. And that's what comes out as 1.4%. And so, now we do the training arguments. You know, again, small number of steps, small number of, like, epochs here. We do have a version that was trained offline that's a little bit better than the one that is in this lab. So, let's do that. Here's the other model that was stored. Now, we see this is only 14 megabytes. So, these are called the PEFT adapters, or Laura adapters. And these get merged or combined with the original LLM. When you go to actually serve this model, which we will here in a bit, you have to take the original LLM and then merge in this Laura PEFT adapter. But these are much smaller, and you can reuse the same base LLM and swap in different PEFT adapters when needed. Okay, now that we have the PEFT adapter, we're going to merge that with the original LLM, which is Flan T5, and use that to actually perform summarization. Now, one thing to call out that's not entirely obvious is that when we do this, I can actually set the isTrainable flag to false. By setting the isTrainable flag to false, we are telling PyTorch that we're not interested in training this model. All we're interested in doing is the forward pass just to get the summaries. And so, this is significant because we can tell PyTorch to not load any of the update portions of these operators and to basically minimize the footprint needed to just perform the inference with this model. And so, this is a pretty neat flag. This was actually just introduced recently into the PEFT model at the time of this lab. And I wanted to show it here because this is a pattern that you want to try to find when you're doing your own modeling. When you know that you're ready to deploy the model for inference, there are usually ways that you can hint to the framework, such as PyTorch, that you're not going to be training. And this can then further reduce the resources needed to make these predictions. And so, here, just to sort of emphasize it, I do print out the number of trainable parameters. And so, keep in mind, at this point, we are only planning to do inference. And let's move on to that. So, 0% of these trainable parameters. Here, we're going to build some sample prompts from our test dataset. We're just going to pick something, you know, randomly here, essentially, index 200. And we're going to see the instruction model got it mostly right, I think. The PEFT model gets a little bit, you know, starts to find a little bit more nuance here. But really, as we'll see qualitatively when we run the Rouge metrics. So, here, we're going to compare the human baseline to the original Flan T5, to the instruction full fine-tuned, and then to the PEFT fine-tuned. For the most part, just kind of glancing here, it looks like these are pretty similar. But let's take a look at the Rouge metrics and see what's going on. So, here, we see the instruction fine-tuned was a pretty drastic improvement over the original Flan T5. We see that the PEFT model does suffer a little bit of a degradation from the full fine-tuned. It's pretty close in some cases. So, it's not too bad. But we use much, much, much less resources during fine-tuning than we would have if we did the full instruction. And so, you can imagine, you know, this is only just a few thousand samples. But like, you can imagine at scale how this really can save you, you know, tons of compute resources and time by using PEFT. By looking at the larger dataset, so up above, I was just looking at maybe 10, 15 examples. Here, we see larger, looks like, I think I have it here, Rouge 1. PEFT loses about 1 to maybe 1.7% across all four of these Rouge metrics. And that's not bad relative to the, you know, savings that you get when you use PEFT.

Generative AI with Large Language Models

Intermediate

10 hours 18 mins

Topics

Fine-Tuning

GenAI Applications

Generative Models

Prompt Engineering

Transformers

Collaborator

AWS

Week 2: Fine-tuning and evaluating large language models

Introduction - Week 2
Video
・
4 mins

Instruction fine-tuning
Video
・
7 mins

Fine-tuning on a single task
Video
・
3 mins

Multi-task instruction fine-tuning
Video
・
8 mins

Scaling instruct models
Reading
・
10 mins

Model evaluation
Video
・
10 mins

Benchmarks
Video
・
5 mins

Parameter efficient fine-tuning (PEFT)
Video
・
4 mins

PEFT techniques 1: LoRA
Video
・
8 mins

PEFT techniques 2: Soft prompts
Video
・
7 mins

Lab 2 walkthrough
Video
・
15 mins

Lab 2 - Fine-tune a generative AI model for dialogue summarization
Code Example
・
10 mins

Week 2 quiz

Graded・Quiz

・

1 hour

Week 2 Resources
Reading
・
10 mins

Lecture Notes Week 2
Reading
・
1 min

Week 3: Reinforcement learning and LLM-powered applications