So, next up, the goal is to get the model to actually minimize this term. This is the loss representing what was off about the model's predictions overall, and yes, the smaller the loss, the better. And so that next step is exactly that. So with that loss term, you can now calculate basically how much or what direction to update every single weight in the model to minimize that loss. You want to put more mass on outputting Sacramento and less on SF, and that's what Backprop does. It will compute backwards into the model from the very last layer, closest to the final loss calculation, to the very first one, and it's going to calculate the direction and how much, the magnitude, each weight influences that loss. So going through this a bit, it computes the gradients of the loss with respect to every single weight in the model, and the gradients are telling you how to essentially change the weights to make the LLM more likely to output the correct answer, in this case Sacramento, and less likely on all the other ones. So less likely on SF, less likely on Boston, LA, and the gradient is just a derivative calculation of the loss with respect to each weight, and it's very iterative. So you keep doing the gradient backwards through the model, and that's why it's called Backprop. To think of it more intuitively, if you're standing on a mountain, say, you can think of your loss as your altitude, like how high up you are, and your weights essentially as your XY position, where you're standing, and then the gradients are your slope, right, derivative, so like the steepness and the direction. And so that's kind of the direction you need to go in for gradient descent. So for cross-entropy loss, the derivative is pretty simple here, so it's just negative log likelihood, and that's the probability of the token minus the target probability. So if we actually do the calculation here, you can take a look at this column of gradients, and you'll see that Sacramento, the correct one, has a negative gradient, while everything else has that positive gradient. Okay, you understand how much and kind of in what direction change of weights, so let's actually update the weights themselves. It's not as simple as just using the gradients directly. You have your weights with respect to your output, and then let's say your last hidden state is here. Basically, your input transformed by your model turns into this. This is a toy example, so the hidden state is usually larger, but it essentially represents a semantic compression of your input, and that gets then projected and turned into a distribution over your output token predictions over the vocabulary. Every single token here has a row of weights that connects the hidden state, and the gradient of the loss with respect to every single weight is the output gradient times that hidden state, and the hidden state really scales how much to update each weight row here, so it tells us how to adjust the weights to improve our predictions. And then you can update your weights with this information, so gradients are ultimately somewhat local signals, right? You can say, like, push this weight up a bit or push this weight down a bit, but if you updated by the full gradient from one example, the model would probably overfit to that one example and potentially forget others, so raw gradients are often just, like, too small or too large, and that could lead to pretty unstable learning, and that's where something called optimizers come in. Optimizers basically help you decide how to use these gradients to update the weights. Here is just showing one of the simplest ones called stochastic gradient descent, SGD, and you update each weight by subtracting your gradient scaled by some kind of learning rate called eta, and here eta is just set at 0.1, so you can see here for Sacramento, it's scaling that here. In practice, LLMs are going to use more advanced optimizers, Adam or AdamW, which you'll see kind of tricks later for. In an LLM with billions of parameters, this update happens in parallel for all weights using gradients averaged across your batches of data. Now that you've updated your weights, one really cool thing is you can take a look at the change from just a single update. So this graph here is showing the distribution of all your weights in the model, with blue being the old ones, orange being the new ones, and showing how the entire distribution slightly shifts after just one update step, and the orange should get the LLM closer to Sacramento, hopefully. So these days, it's kind of easy, and the steps are really wrapped up into a Hugging Face trainer class if you're using Hugging Face, so you can just specify the model, your training arguments, your datasets, and SFT, it's called SFT trainer, SFT stands for supervised fine-tuning, again that is the type of fine-tuning we're doing here. In your hyperparameters for fine-tuning, it's important to specify completion only loss equals true, and that means that you'll only be calculating loss on the outputs, which are also called completions, typically, and not on your inputs. And then you can just call trainer.train to kick off the training loop that just walked through. So that's that, but just taking a look quickly inside of trainer, I think it's important to kind of drop down a little bit into PyTorch just to understand what's going on in there. So inside of that .train, what it looks like is for every training epoch, so that's across all of your dataset, your whole dataset, and for a small batch, a set of input target output pairs, the model is going to predict some outputs, it's going to calculate loss on the outputs versus the target outputs. With that cross-entropy loss from before, the loss is then used in backprop to calculate gradients for all of the weights. And then finally, the weights are going to be updated using the gradients, and the optimizer is going to decide what direction to take a step in and how large of a step to essentially take. And that cycle is repeated for the next batch, and you'll walk through all the data, in this case, num epoch times. So that's it. Next up would be probably diving into those hyperparameters and how to tune them so your model can actually effectively learn. And that's it. You've gone through the nitty-gritty math of how to do fine tuning. Next up is deep diving into those hyperparameters to tune your model effectively.