In this lesson, you'll learn about fine-tuning. How to make it efficient with parameter efficient fine tuning or PEFT techniques, and how to remove hallucinations with memory tuning. You'll explore these different variants of fine-tuning and the applications they are each the right tool for. Finally, you'll debunk a few myths of fine tuning around cost and ease of use, which are much more true this year than the last. So, first of all, why should you fine-tune at all? Well, you can fit more data than what actually fits into the prompt. So instead of stuffing the prompt with all the information about, you know, the SQL schema and the base model produces this SQL query, you're actually putting that SQL table schema information into the weights of the fine-tuned model itself. Okay. So that's why it's able to essentially retrieve it from its own weights as it's doing those computations and be able to come up with the right result. You can learn from data rather than just getting access to it. And learning from data, what that means is that the model can, be able to produce different results that follow a certain type of user experience. And it's able to learn that information as well. This gives you a deeper control of the LLMs so you can achieve what you want it to do. And ultimately there is no accuracy ceiling. It's just there is effort to get to a higher amount of accuracy. But essentially there there is no ceiling. You can continue to improve it over time. For instruction fine-tuning, there are multiple types of fine-tuning. And in this lesson, you'll learn about two of them. One is a very popular form of fine tuning called instruction fine-tuning. So that's going to be a pre-trained LLM. We'll go over what pre-training is in a second. To follow instructions. So to be able to take a question and respond to it. And that's what Meta has done to LLama 3 to turn it into Llama 3 Instruct. And then for memory tuning is another type of fine tuning, and that's getting an LLM, perhaps a fine-tuned LLM to not hallucinate. So, Meta's Llama 3 instruct as a very, you know, general purpose. And that's getting it to not hallucinate on certain facts. The same process here for instruction, fine tuning is like getting ChatGPT from the original GPT-3 model. These are two types of fine-tuning, that you can see here. Multiple other forms of fine-tuning also possible. To set back, I mentioned pre-training. So what exactly is pre-training the model? And why is it not enough? So pre-training is a process of reading the entire internet one token at a time. Kind of like auto completing the internet, essentially. and the model is, trained to make the next word prediction kind of like auto complete. and this makes it very easy to, train on like this, this kind of method because, all it has to do is auto complete. And so we already have what the right responses, it's the next word, the next token, and the pre-training objective is for the model to reduce what's known as average error over all the examples or generalization error. So once an average correctness over all errors. And this creates, you know, very powerful foundation models that are able to do multiple, multiple tasks. It's very general capabilities learning a lot. But there is a gap in how useful they are. One of them is that they can't follow instructions. So when you ask it, what's the capital of France? It might respond, what's the capital of Spain? Because it's the in surveys, for example. And it thinks it's in that setting. It doesn't know how to respond to your question. So it doesn't have the chat in ChatGPT, so to speak. the LLM will also hallucinate on facts. It's seen many times over the internet, for example. So it doesn't necessarily know about your proprietary data or it doesn't know to follow one thing that's popular in the internet or another. So, in sum, what it is, is it's pretty good at everything, but perfect at nothing. And it turns out, despite its flaws, you and me, we are, in fact, perfect at some things. I remember my name, my birthday, etc. So, we are in fact perfect at certain facts. And ultimately, you know, these hallucinations as, as you've seen already they're a result of thinking that nearly, the nearly right answer, the slightly right answer, is the same as right, when in fact, in the case of facts, that is wrong. So fine-tuning is helpful for every new foundation model. So Meta might produce these Llama models. Llama 3. Here's where we're at. Eventually they'll have Llama 4 or 5, 6, etc. And all of them will have to, you know, after a pre-training step, be instruction and fine-tuned to it, instruct model and then for your particular application memory tuning them so that they understand those facts, whether it be financial data, customer data, operational data or illegal data or SQL data. So, exactly what is memory tuning? It's teaching the model to have perfect recall and facts and, maybe to dive into this a little bit more, it's actually reducing the error to zero. So to almost make it perfect essentially on these facts. And as a result, it's near perfect on these facts. And still pretty good at everything else. so it's no longer perfect and nothing it's near perfect on, on these things. So beforehand, when the model, was asked, what year did Dave Aguilar climb the Golden Gate Bridge and is it he climbed in? And then it had to produce a date where the right date is 1981 before, the model had to sample from, this distribution of possible dates, and it could sample 1981 the correct answer or 1970. Now, look, this is already a huge improvement from an untrained model. So where, you know, the model might sample the word cat with the same probability as 1981. So that's already much, much better. Right. So the loss has a brought and brought down a lot more so that the, probability distribution that we're sampling from is a lot more reasonable for what could be true. And so this would be a pre-training and instruction fine tuned model. And you know, the problem is, is that still slightly right is not necessarily right. And that's not true for fact. So what memory tuning is doing is for these specific facts, it is actually bringing the loss to zero so that it absolutely needs to get that 1981. And there is no alternative. So it can only commit to that as the right answer. Back to you, instruction, fine tuning mainly. what is it useful for? What are the applications of it? We mentioned chat is putting in the chat and ChatGPT, it can also be used for function calling, basically changing the behavior of the model, the UX of the model so that it produces, essentially API endpoints and results. And it can also, just very broadly taken any prompt in response and really change the interaction. It doesn't have to be a chat situation. Chat is just what we're most familiar with and how we as human beings are able to very easily interface with with the model. And for SQL, it's, producing SQL. One of the myths out there, which I think, is now, pretty much people understand is not complete, is prompting and I will solve all my problems. Basically everything I can put into the context window can solve all my problems. And for following instructions, you know, let's say you didn't instruction fine-tune the model. If you use prompting to get the model to follow instructions, you can actually get it to be a little bit better. So before ChatGPT came out, you could actually put some pretty complex prompts into GPT-3 and try to get it to respond, in a chat-like interface by giving it many, many examples. But it's still not enough to shift it out of essentially auto completing the internet you know, and it's just still was not enough for it to go and follow those instructions consistently. the same with fact recall I'm using RAG is often a way to do a little bit better. And that does shift the probabilities, from the prompt as the model is producing the next token, it does shift the probabilities a bit so that whatever was retrieved and added to the prompt, it does encourage, for example, a closer date, or a more relevant date, in similarity space, but it's still dealing with similarities only. And similarities don't fully solve this problem of, actually picking something that is the correct answer where similar ones don't seem similar at all. So it's still sampling from a distribution of similar but wrong facts. The next one is around, fine-tuning being still too expensive. And I want to go over a kind of a few techniques that, folks are employing today. So the reality is that it is, in some cases, actually cheaper than running very large prompts, in RAG, like filling up the context window is actually quite expensive at inference time. Techniques like parameter efficient fine tuning have been able to reduce the cost by around 10,000 times, which is a dramatic efficiency at the same level of accuracy. And then there's a technique called MoME or mixture of memory experts. And this is related to memory tuning, which turns any LLM into a million way mixture of expert adapters. And those adapters are using path to using parameter efficient fine-tuning. And that reduces, the time by 240 times. So there are many ways to dramatically reduce, the time for fine-tuning. And I think the future might look a little bit interesting with fine-tuning being, similar to the time of building a RAG index. But to get these efficiency gains, it's actually fairly difficult. So you have to implement it correctly to get these actual gains. Otherwise it is true. It is extremely expensive. I mentioned two things there. So parameter efficient fine-tuning and the mixture of memory experts. Let's actually go over visually what really is going on. So in PEFT, parameter efficient fine tuning, one technique of doing it is a very popular one is called LoRA. that stands for Low Rank Adaptation. And essentially what's happening is you can see here there's a weight matrix. Let's say this is like a feedforward layer in, in the transformer model. So just you can think of it as just like some weights of the model, instead of fine tuning it directly and changing those weights directly, let's actually tune these set of LoRA weights that are external to the model and smaller, and let's actually tune those instead so that it's much more computationally efficient. We only have to do what's known as back propagation, basically learn on the LoRA weights and not touch the main weights at all. And then when we're inferencing, when we're done training, when we're done fine tuning the model, we can actually fuze those LoRA weights back into the model. So that it takes the same amount of efficiency. It takes the same amount of latency, speed, of running through the model itself. And that fuzing process is I'll just say it's using math. It's using math, Low Rank Adaptation and basically, finding low rank matrices. So that's, that's not what that process looks like. And then in, the MoME model, it's a very similar technique. The LoRA adopters are still there. But in addition to those LoRA adopters, you actually have an array of what's known as memory experts weights as well, that you're tuning, and that you're sampling, at every single stage where you do have, those adapters, and you're sampling a subset of them that contain this understanding of facts that have been learned from your data. and you're fuzing those into the adopters themselves. And this makes it so that you can actually grow your model via the memory experts in a way where you can get essentially the intelligence of a huge model with the cost and latency of a smaller model. So, you know, a common way of putting this is calling it sparsely activated, model. So the last myth is around fine-tuning being too hard for people to roll out. While fully managed fine tuning does exist, I will say rolling your own fine tuning is hard for many different reasons. And, you know, one is that it isn't efficient. So just like leaning into this myth a bit I guess, it takes a lot of compute sometimes, to get the same accuracy. Right. If you're not actually, implementing an efficiently, you can't parallelize efficiently across multiple GPUs. I see a lot of lost idle compute where people are not fully utilizing GPUs to their full capacity as they are, fine-tuning, certainly not full capacity, but not even, half or a quarter. It often does crash on real use cases. You can't continuously fine-tune, because there are errors of running it across multiple GPUs, as well as the integration between fine-tuning and inference is not necessarily seamless. And then one other pain is that the LLM doesn't seem to improve. It can be hard to tune for a use case on a specific model and a specific data set. And sometimes people say, well, this LoRA thing actually can't get the same accuracy. I'm like, well, it actually can if you can get right parameters and there are different parameters than, hyperparameters than, regular fine-tuning. Another is that it's not necessarily easy to use. It's hard to scale. And the issues are not always just related to AI. It's often or data. It's related to, you know, GPU and memory issues. And finally, you know, integrating fine tuning with inference, is can be buggy as well. So, these systems are often split up, with pre-training versus inference. And that makes it actually very difficult to transfer model weights effectively, across those different formats. And finally, folks often use kind of the wrong tool for the job. So instruction fine tuning doesn't necessarily solve hallucinations. It's still optimizing for average error over all of its examples. Just like pre-training except for the set of examples is smaller. It's not bringing the loss to zero. Cool. So what does fully manage fine tuning in your environment look like? It can be more of a one-line call or a few line calls to run parameter efficient fine-tuning, and memory tuning. And that can happen on platforms like Lamini. And there are a few other is where you can go try and see them running. So here's just instantiating, you know, the arguments, the weight, the data set on the model itself and then just an LLM that train and you'll get to run this, in the next lab. So fine-tuning for your app, it's important to focus on specifics. Namely, data is probably the most important thing. And you're going to go over that in the next lesson. It turns out you actually have more data than you think. And having this mindset will actually enable you to really win, because you'll build out the right pipelines to have valuable data and transform your current data to valuable data. So in the next lesson evaluation, which you just went over, you don't necessarily need a fancy tool to get started, but sometimes they can help with, tracking, your, your work, and understanding how evaluation deeply does connect to your use case. It often is pretty custom, as you saw with the SQL, app so far. And so it's important to get this right and to continually iterate on it. It's okay if you don't get it right the first time. And with that, let's move on to the next lesson where you can actually go fine-tune and generate the data for fine-tuning.