One of the most popular ways to fine-tune models is actually through parameter-efficient fine-tuning techniques. Let's take a look. So fine-tuning your whole model is pretty expensive. You're updating all the weights, which means you need an extra amount of memory to hold the gradients in the GPU. That's double the memory. And then there's actually states for your optimizer to store as well. So you're looking at two to three times memory of the model itself just to fine-tune. You also need more compute to calculate those gradients, and that translates to just a lot more money overall to run all of that. So if you need multiple fine-tuned models for different tasks like you see here, you're actually going to need to run them on different GPUs, right? They're all very big, and they can't fit maybe on a single GPU. It might be three separate GPUs here. So each model is not very easily portable. It's hard to move between the different servers. So is there a better way? Well, actually, yes. And it comes from a really, really interesting finding. It turns out that the change in the LLM weights, that delta W you see there, during fine-tuning actually have a lot of redundancy. So this doesn't mean the LLM weights themselves have a lot of redundancy. It's actually just the weight updates or the matrices to update the weights based on the gradient and based on the optimizer. So a lot of the updates from new fine-tuning data that are applied to the weights are basically uninformative and don't have to happen. There's more noise and signal. Maybe to put it another way, you can represent the LLM changes during fine-tuning pretty accurately with a lot fewer parameters in that weight update matrix. So it's if you represented it with more signal and then you throw out the noise. To visualize this empirically, you can actually take that weight update matrix and do singular value decomposition or SVD on it. And you can see that most of the information can actually be represented in the first few singular values. And that means the remaining directions contribute very little and can often be approximated away. So in linear algebra, that means the weight update matrix can be approximated as, quote, low rank. So a much smaller matrix. And that's a huge savings on parameters. And what that means is that the updates can actually be smaller and the base model can just actually remain frozen. And when trained, the smaller update weights, they're often called adapters. More intuitively, this basically means that updates to the weights during fine-tuning can be broader strokes rather than fine-grained. And, you know, maybe it should be called broad-tuning rather than fine-tuning. Okay, so here's the analogy. On the left, you have multiple individual fine-tuned models. But on the right, you get multiple adapters, and they can share that same base model. Again, huge savings on compute and storage. Adapters are lightweight to fit multiple on the same GPU. There's also, in terms of inference latency, if you're running them all on the same GPU, you can swap adapters, not just full models. Okay, so since the savings are so good, how does actually getting to that low rank matrix work? So it's just some basic matrix math called rank decomposition. It's essentially lossy compression or a way to denoise the matrix and only keep the parts that are high signal. And a large matrix, for example, here could be 1000 by 1000 with a million parameters total, but it can actually be decomposed into these two smaller matrices with rank two, which reduces it to only 4000 parameters. And that's insane, because that's a 250x fewer parameters to train and store in memory. You can generalize the two to just R, right? So that's just the rank. R is a hyperparameter, of course, you can use for the max rank of the decomposed matrices. Ideally, the rank is much less than your original matrix dimensions, right? If it's the same, then you're back to actually full fine-tuning. R equals four is often called a good starting point on the original LoRA paper, but realistically, it really changes per task. What's really crazy is a rank of one will also work, and even crazier for reinforcement learning, that rank of one often has shown to work well. Okay, you also don't need to have the same rank for every single LoRA that you put in, but that's research for another time. And as all hyperparameters, you find the right R empirically based on how many LoRAs, where you're putting the LoRAs, and most importantly, your data size. So smaller data sets and changes are probably able to get away with smaller R's, and larger updates probably need larger R's. Your other hyperparameters will change too, for example, using a 10x higher learning rate than full fine-tuning is often recommended for LoRAs. So needless to say, the benefits are pretty clear, and the impact accuracy pretty minor when you consider how you get to a better model is through iteration and smaller, faster updates, and this will get you faster time to accuracy anyways. And you can always dial up the R if your task is too big and requires a major update later. So now you know what low rank decomposition is, here's how it's actually implemented. So zooming into one weight matrix here, the matrix gets an input x, it's modified by all the layers, and the output is this hidden state which is then processed by subsequent layers, calculate the loss, and backprop. Okay, so this is just regular full fine-tuning. At some point, the weights of this matrix gets updated. In full fine-tuning, all weights are basically changed, so that delta w is the size of all the weights. So now to understand what's going on in LoRA, you can move that out separately and visualize that the full weights of delta w you can actually approximate with those LoRA matrices. These are dot product together to create a delta w matrix, which then takes the input x and outputs that hidden state just the same. The rest is the same, but what's really interesting is your main weights are all frozen, and when back prop actually happens, you only do it through that delta w, right. You only do it through your LoRA adapters, and that saves you so much here. Okay, so maybe you get how LoRA adapters work now, but where is it actually happening in the model? So if you look into a standard LLM architecture, you can see decoder blocks, and there you can see feed-forward layers and self-attention layers, and the original paper for LoRA looks at adding adapters to this self-attention block. The recent work has also shown to apply LoRA on all layers, not just attention, so you can also experiment with what works best for you. But specifically, the LoRA paper looked at adding LoRA on the query and value matrices visualized here, and the remaining weights are frozen. Okay, so you know where LoRA goes. If you're curious about digging more into LoRA, you'll often see the LoRA diagram represented by these trapezoids, like in the original paper. It's to signal that the matrices actually get smaller rank, but technically the matrices should be more accurately depicted as what you saw previously with the matrix rectangles. But anyways, using this diagram, you can imagine having LoRA adapters for a task on custom code getting generated, or different adapters on custom unit tasks. And it's shown for all fine-tuning tasks here. Keep in mind that you can also update weights of a model in RL using LoRA as well. All right, one more important hyperparameter of LoRA that matters is alpha. Alpha scales how much LoRAs matter versus the original weights when updating the weights, and empirically you need to increase this when the rank increases. And the default is 1. Again, it's something to tune here. All right, so you heard all this talk about saving GPU memory. It's time to compare. So, the top is regular full fine-tuning. The bottom is LoRAs. First, you have to fit the whole base model into the GPU memory no matter what, so that's the same. Second, you want to add space for your LoRA adapters. It's usually way smaller than what's shown here, but this is just one case here just to show and visualize. Then you need memory for your gradients. For the full model, it's all of it. And for LoRAs, it can be tiny, way tinier than what's shown here, right. Even that 0.1 percent. But here's kind of a fat LoRA at 25 percent. The next piece depends on the optimizer, but essentially the optimizer state, which depends on the gradient memory. Finally, the forward pass will need a little bit more for LoRA, especially if you expect to hot swap them. The LoRAs can also be fused back into the model for the same computational efficiency as just a regular forward pass like that. So, clearly, LoRAs actually enable you to save a lot here. This is a very generous depiction for full fine-tuning. LoRAs can actually save you significantly more than this. So, now that you understand LoRAs, go build with them. There are a ton of open-source frameworks to help you. The pros are that you can get started very quickly, sometimes even locally. And typically, LoRAs are the way to go for getting started with fine-tuning anyways. The cons of LoRAs is the hyperparameter tuning. There are fewer good defaults out there, although that's changing than regular full fine-tuning. And local training is typically only on smaller models, of course. So, here's what it looks like in Hugging Phase Transformers and their PEFT library, or Parameter Efficient Fine-Tuning library. You can see the rank R and alpha hyperparameters here. You can also see the query and value being where the LoRAs are getting attached. You can add your own. And all you need to do is take any model, then wrap LoRA model around it with a LoRA config. LoRAs are a part of a much, much broader set of Parameter Efficient Fine-Tuning or PEFT techniques that make fine-tuning or just generally updating LLMs much more efficient. And that's both during and after training, actually. So, more often, you'll see it for fine-tuning, but it's also getting used in RL, which you'll learn about next. Switching gears to reinforcement learning a bit, you'll learn how rewards inform LLM updates instead of that target output in fine-tuning.