In this lesson, you'll go through one of the most important steps generating data and fine-tuning. Without much ado, let's get building. So the first thing is, you usually have more data than you think. And it's actually important to first take stock of the data you do have. Figure out if you have all the facts that are needed to teach the model about this new task of yours. Whether if you gave that data to a person, if they had, infinite time to go through all of it, could they actually figure out what the right answer is? And you usually do have enough data. The problem with the data, though, is usually it's not in the right format. And historically, to put it in the right format was a very laborious manual process. But today, you actually don't have to do much manual data labeling or clean up to reformat it. You have LLMs. They are here to the rescue as long as you're able to specify what the right format does look like. So this diagram down here, you have tons of data, maybe only a little bit of clean data. But you can actually do a batch job with, with an LLM or a pipeline of columns here to generate data to get that all to be clean and ready to be put into the model. Identifying the right format is kind of important. So let's go over a few different examples. It is highly application-specific. So for a SQL application for example, you might expect a prompt like given the schema in question, write SQL query to answer who is the highest paid player in the NBA? And the expected response is a SQL statement. So this is what we've been exploring in this course. But an alternative is just asking who is the highest paid NBA player and expecting the model or the agent or pipeline of models, to be able to respond in natural language with, you know, for example, the response of that SQL query. So that would be a different expected response for a similar task. Another one for a different task is this is a medical-related application. So you might, give symptoms of something and have to produce the ICD code for it. So that's another example of expected prompt and expected response. This can be very, very application specific and highly varied. But being able to understand what kind of prompts you expect the model to be taking in and what kind of responses you expect it to output is very important in building out your applications. Okay, so specifically for SQL, here and we'll also talk broadly, when data seems kind of sparse. Again take stock of what you have and then get kind of creative, and work backwards, to generate, generate data. So in text-to-SQL, you actually only need these schema to make this happen. And I think this is really surprising to people when they do hear about this. And here's why. So the LLM already knows how to write SQL generally. Right. So it might not know about your schema, the complexities of your schema in particular. But it knows how to write general SQL. And it can operate on a very basic schema. So basic pieces of that schema. And next is the schema actually contains all the facts you need the LLM to know. Right. Like this screenshot down here of that schema that you put into the prompt actually has all the facts you need it to know. Just like if you were to hand this to a person, like a data analyst, to be able to operate on. You're not expecting the LLM to learn facts from thin air or just happen to know facts from thin air. That would be a much, much more daunting task. But you actually do know where all the facts are about your schema, and about your database scenario scheme. Okay. So what does working backwards look like? So you can go from the schema to generating new queries. And in this workflow, in this prompt template, you'll see, you know, first telling the LLM you're an analyst, again putting in the schema information, giving an example. So giving an example of what a question looks like and an example query looks like and then asking it, how about you write a query that's similar but different to those above. And format the response in this Json object. So write me some queries. Okay, so now that you generated your SQL queries, you can actually take those to generate user questions in turn. So let's take a look at what the prompt for that looks like. You can again tell the LLM you're an NBA analyst. You give it the schema. You're probably an expert at this already. You give it an example of a question and a query, and then you put your generated, query, SQL query in and then ask for a question that this query could be used to answer, and it would output a question like, how many players on the Chicago Bulls are 25 or younger? All right. So what are some practical tips that you can follow? One is an example you just saw adding, you know, query user question examples can be really helpful. This is essentially called few-shot or in-context learning. So adding those examples is helpful for the LLM to understand what right looks like. And one pro tip here is actually including corrected hallucinatory examples. So hallucinations that you had previously seen but then correcting them helps the LLM actually understand what it needs to learn and to create examples that are similar to what's been hallucinated on. Generating variations is incredibly important. You want to be able to reach as much breath as possible without needing to write those yourself. So getting the LLM to assume different persona, for example, not just NBA analyst to be able to generate different questions and different queries, can be really helpful as well. And it depends on who your end user is, right? Maybe it's not just NBA analyst, maybe it's, you know, some people are more senior and want an executive kind of understanding of what's going on in this database. So, this is how you can also generate additional variations. Next is around filtering generation. So as you generate more and more data, inevitably these models the whole point as you're trying to improve them. So inevitably you're going to get some things that are, incorrect. And so, being able to filter and rely on your filters, automatic filters, just scalably downsample your generated examples into a higher quality data set is incredibly important. and what's special here is that you can actually use very similar methods that you had implemented during evaluation, both those that are LLM based and also those that are more deterministic. So you can check if something's valid SQL or not. For example. And finally, it's worth really digging through what worked and what didn't, and kind of using your head to classify what's working, what what isn't and what are patterns of what's working, what are patterns of what isn't. This is called error analysis in AI and machine learning and it's an important skill to help you, figure out what to generate next, what to filter next. And usually, you know, in this specific case, the more complex queries are harder. So that's a good observation. And this helps you adjust the prompts and actually go more specific in trying to generate those, those queries. So now that you have all the data, what does the fine-tuning process actually look like. Here are some minimum requirements. Actually you you only need one data point to do memory tuning. You also only need one data point to technically instruction fine-tune. It will just, not make the model, capable of continuing to generalize very well after that. So instruction, fine-tuning usually I do suggest around a thousand data points. Fine-tuning really needs, you know, these pairs of prompts and responses so essentially that format you were specifying before, memory tuning in particular, needs these prompts and responses to include the facts that you want the LLM to actually learn, especially in the responses. And then what does fine-tuning, like ease of use look like? Well, you could use a library like Lamini, which manages, fine tuning for you. So it's just a one-line API or Python call. And you can also roll your own, and hyperparameter tuning, calling the model forwards and backwards. And here's an example of running fine tuning so you can instantiate the model. You get the dataset, and then you'd call LLM dot train on that dataset. And your arguments. Okay. So fine-tuning doesn't come totally for free. So understanding, you know, what are the time and compute requirements you need to do fine tuning, instruction fine-tuning, specifically with LoRA, gets you to an accuracy of about 50%. But it is really fast, or it only takes a couple minutes. 2.5 minutes. and it takes 19.2 petaflops of compute. And this is all benchmarking on one Nvidia A100 with, with high performance of about 40% MFU or the equivalent AMD GPU is in mid to 50. For memory tuning, it's actually quite more intensive when you use the unoptimized version. and this is still, actually using LoRA, but just unoptimized and the accuracy on a thousand facts can go up much higher because a whole point is to reduce hallucination. But the reason why it takes so much longer and so much more compute you can see 1.9 to exa flops are in that second row and four hours is because you're bringing the loss to zero on the specific facts. With Lamini memory tuning, we've implemented, you know, those different, optimizations that you saw previously. And the time needed is only a minute and 24 petaflops. Okay. What should you expect? So that all feels fine and good, hugs and rainbows. but it's not just. Boom. One time you fine tune it and your life is good. It realistically takes many iterations. And that's why iteration, it's art. Like, that concept is so, so important. So it can take, you know, 10 to 30 iterations I think is about 20 for what you're about to do over different data pipelines to get to that 90, 95% accuracy. Fine-tuning in parallel, so fine-tuning multiple variants in parallel can speed up your experimentation. You can actually do iterations in parallel in a way they don't have to necessarily be sequential. If you're experimenting with, hey, if I change my data pipeline in this way, will it influence the model to improve in this way versus that way? And those are orthogonal paths. You can actually experiment with, during fine-tuning. And the exciting part is you can start talking about nines. So you can, actually start eking out accuracy from 95 to to 99.9 and this makes it very exciting because there are many applications, new applications that get unlocked, with that new frontier of accuracy. Note that with every nine that you do add on, it is far more effort than the previous one. So it is it is a it is a game of effort here to get to that level of accuracy. But that usually goes hand in hand with how critical it is to be accurate for that use case. Now, how can you become a wizard at this? Well, if you have the facts somewhere, you have enough data. You should be able to find a way to use a LLMs to take care of the rest. and it does take some creativity sometimes to think of, oh, how can I use an LLM in these ways to, you know, pipe them together to get the data that I do want, from the data I already have. and from that, your data pipelines can get pretty intricate. And you need you can be calling the model many, many times. But because it's a one-time batch process to transform your data, it can, actually be worth it. What might make you feel like it's more worth it is if you're running a startup or, an enterprise. this is your AI moat. This can Help you actually build up something quite unique. And another skill to be a wizard is just spotting these LLM mistakes. Doing that error analysis and hallucinations to teach your LLM what correct actually looks like. If you don't know what that frontier looks like, it's very hard to tell the LLM what the right objective is. Again, the LLM is super, super good at optimizing, But it doesn't know what to optimize for. And if you can tell it what to optimize for, that is within reach for it, then I can go there. One you know, tip that's not necessarily put here is for certain applications. You are sitting next to, the person who is evaluating the model who can tell you what is a mistake or not. And sometimes you're not. Right. So, the easiest applications and as a result, you know, the applications that are built out the fastest are actually those where, the people who are who can understand when the LLM's making a mistake for an application is sitting very close to the person developing it. and as a result, I think I see a lot more like code related things because the developer is the evaluator and so it's important to think about how you organize your team so that you can very efficiently be able to develop these applications. Okay. So first things first. Gotta load those environment variables so that we can use GPUs. And now importing Lamini here, I'm also going to import more arguments so that these are standards so that we can run LLama 3. Next up, we want to generate close variations to your reference data, but not exactly those, of course. So first let's actually create a system prompt for one of these examples. So let's just go through one example. And then we can systematize this over all our all our data. So system prompt or an NBA analysis actually see what this is. this is adding that schema in. And then consider the following questions and queries. This is just the schema. Let's add an example. So let's add an example question and query right. So we're adding an example here question and SQL into that system prompt. And then let's ask right two queries that are similar but different from this. Okay. So all right let's see what the full system prompt is. So your NBA analyst for one here it has this example okay. And then let's ask it to write similar queries okay. So that's in the user prompt. Just gonna look at that real quick. So right to queries are similar but different from those above format of the queries as this. Okay. Great. So that's you're trying to get variations. You know, here we're doing the programmatic chain of thought thing. I'm going to just say first write an explanation of why you did it and then just print that out so you can see it. Yeah. And make sure each query is complete and ends with a semicolon. This helps. This is a prompt engineering thing. If you didn't add that before, add it in to help the model produce valid SQL queries. And then let's make our prompts and run our model. Okay, great. So here's an explanation of what's going on there. And then here's a couple queries, related to it. Great. Next, let's actually check if that query is valid. I think we've run that before. So here is and check query valid function. And let's check if both of those generated queries are valid or not. The first one is let's see if the second one is. Great. They both are. This will be an important filter. So you can imagine. Next, let's actually just wrap this all in a class, called Model Stage. You know, this looks like a lot, but you already went through the hardest parts of it. So, class is model stage. This is again just formatting that prompt to be explanation to the two SQL queries. Here is checking if these queries are valid. Here's our prompt that we went through together. And that's just calling, the check SQL query function. All right. So you have some generated SQL queries. How about we do the same, and generate user questions from those generated SQL queries? I'm just like you learned about before. So first, on the system prompt side very similar to before kind of you're an NBA analyst. There's a schema. There's an example, okay. And then let's grab one of those generated queries, okay. Let's generate a query from above. and let's wrap it into that user. User part of the prompt. So now consider the following query. And you put in your generated SQL query and then now write a question that this query could be used to answer. And I'm going to just ask for that programmatic chain of thought again. So as for explanation, then question to get a better response. Again, that's like more of a prompt engineering thing to ask for. And then now let's turn the results. Boom. Okay. cool. Great. Now let's wrap this all together in a class called Question Stage. And again I know these sometimes look intimidating, but you actually went through all of this. And the meat of it is over here. And we'll be using again these classes later on so that we can systematically go through the entire dataset. So that's exactly now we want to create this query generation pipeline. This is the code for it to go through the model stage. Then the question stage, okay. So that's a pipeline. I'm going to have an asynchronous way of running that over all of your essentially gold queries, so that those can be examples, to model after as you generate, similar questions. Next, I'm going to go through how to gain some variation and also to be able to modify how large of a dataset you have. So this is code here that's loading your, gold standard dataset. Right. Your evaluation set. It's going through those examples. And there is an argument for a number to generate. And how this is working, is for a certain number to generate, it will use a certain sample size and sample certain, certain number of times. So maybe this is ten. So you sample ten times and each time you sample three random examples in your gold set. So then the model sees three different examples each time and that enables it to create variation. So you can imagine, as an example of this, and one sample, it might get the question and query for what's the median age of people in the NBA. another one might be, you know, what's the average salary? And another one might be, how many people are under the age of 25 in the NBA? So, those might be three random ones one sample. And then in another sample, it would, grab three different ones, randomly and not basically nudges the model to generate different queries. That's a simple way of getting it to generate a much larger dataset, which you'll see shortly will be very helpful because it turns out generating these queries is much harder. And generating data in general is much harder than filtering it. Okay, next you definitely need to save your results. So, here is a function for saving those results. All it's doing is grabbing the the question in sample saving them as pairs so that you can use them later for fine-tuning. Great. So now let's actually run through this data generation pipeline. Fantastic. 27. So now let's take a look at what these results look like so we can cat the generated queries are Jsonl. file which is specified in the args file. Would you look at that? If you inspect closely they're obviously not perfect. A lot of these are invalid. There might be duplicates in here. but it is a decent amount of data. So, instead of making it perfect, let's actually just go through a first iteration of fine tuning it specifically with memory tuning. So I'm going to... Just make a few more imports here and new args to grab those, queries. And then just a familiar function for, creating a question. So this is just creating a prompt, that goes from a question and turning it into a prompt that we can then use to fine-tune the model with. Next, grab those arguments, instantiate the model. Next, get the data set and make a question from that data set. So turn everything with just a question string into this larger schema. Grab those fine-tuning arguments. The default ones, and then you get to do the line for training where you pass all of these into the data set in the fine-tuning. And I'm just going to do is public is true. So, we can actually share this together. Okay. Great. So as you can see it uploaded data, there's a dataset ID that you can refer back to. You can, for back to it like this, like LLM dot train dataset ID equals that, so you can reuse your dataset and then your fine tuning job is submitted. And you can check the status at that URL that is public. Okay. So what's going on here? The fine tuning job is actually queued when you run this, so it's just queued up and once they begin, in this case, this is, the unoptimized, memory tuning case. it takes about 30 minutes. And you can continue in this notebook, by using the four other prepared pre-prepared models provided. So, you can actually see how the performance improves over time as the dataset improves. So let's actually go do that. So instantiating a pre fine-tuned model we're going to use this ID here that you can use this model name equals that. And let's see what happens. So, who is the highest paid NBA player? Let's do a familiar thing like this. Who's the highest paid NBA player? And let's get an answer from the model. Okay. So now it's outputted, this SQL query. Notice that we didn't actually have to use the structured output because it's been tuned to produce this format. Exactly. Right. So this makes it easier to use in some ways. But seems much better than before. Let's take a look if it actually is correct by running it against the database. Cool. Steph Curry with salary of 50, nearly 52 million. Okay. So let's actually see how results have improved quantitatively, not just in this one-off situation. Let's run it against your evaluation set. So now I'm just pacing things from lesson three's lab, which, you already went over before. Just so we have evaluation here. This is to run evaluation and then save those to a file. And then you can run your evaluation using that model name, and save those results. Great. And as you can see, there are far fewer errors that are being thrown here, because there's far more valid SQL, than before. And while it's still not the highest number of correct SQL, it's far higher than the original 30 ish percent from the from the base Llama 3. So big deal. But then now let's take a look, at errors that are still in here. Right. Because we want to improve this even further. And what does this iteration process look like? Okay, so let's take a look at the errors. So I've saved the errors at this Jsonl. file. These are all the things that are wrong here that are, you know, basically not correct based on our evaluation criteria. and I think, you know, as you look through this, what's the median weight? 25th, 75th, 99th percentile. That seems like something pretty hard for the model. So, you know, in general, this idea of percentile, and so we might inspect a little bit there. Okay. That's a pattern. Right. So what's incorrect? Let's take a look. So exactly this query here is incorrect. So if we actually do look at our generated queries, we can maybe try to triangulate why it is incorrect. So I'm just here casting the generated queries file and grabbing for 75th percentile so I can find it. Okay, here it is. So this is, clearly wrong because it's selecting height and average weight. It's not even related to salary. Right. So that's that's pretty off. But that's okay. So maybe, you know, one approach is to instead, like expand the data set even further. and then also to filter out things that are blatantly incorrect. So let's try to expand it further to see if the model can actually or pipeline models can actually generate more data and maybe sometimes data that is able to either answer that correctly or something close to that correctly. So one common approach is just generate a lot more data and so that's increasing that parameter that you saw before of sampling. Let's take a look at what that does. so it's still, you know, outputs, still generates, that question there. But it does get some variation on, other related items but there's still an issue with how percentile is being calculated in general. The next step is filtering. So related to that, obviously that needs to be corrected in some way. So, one way is to go through an automated process of filtering your dataset. And so we see here. Here is going to be a lot of different filters that we've already written. So one is that it's not valid SQL you're familiar with. That's that's running in against the engine to see if it actually executes or not. Having no, in the SQL or null in the question. So, you know, sometimes the model gets around it is you know, validly executing, but it's null or it's an empty data frame so it doesn't actually have anything there. Okay, so, one error that we did see was, the model trying to output something like an average height or average salary. And that doesn't make any sense for this dataset because the height and salary will need to be casted first. So we're actually just catching it in the string here. So filtering out anything that outputs that way having you know, putting all those filtering conditions, into a string, and then running that over, the generated queries and then having that output as generated queries, large filtered. So that we can take a look at the large filtered. So as you can see a lot of them got filtered out that were not as valid from before but still obviously some issues there. Finally, after this, you can clean up your dataset. Whether that be manually inspecting things or, basically editing things. It's still can't be caught through your filters and if you take a look here, the percentile calculations are adjusted so that they are actually correct in, in your, in your fine tuning data. One thing that's interesting about this step is that among all the errors that you do classify, for example, if you did look at the errors in the model was very much outputting order by descending, constantly because it was basically overfit to that idea of order by descending so including an example where you're actually doing ascending, is helpful for the model to actually get a sense of breath, even if it's just 1 or 2 examples. Okay, moving on to another class of errors. Let's just go take a look at what we were catting from the SQL errors before. So what is the median weight in the NBA? You can see that this was incorrect. And let's see exactly why it's incorrect. Okay. It's actually taking the average instead of figuring out what is the median. See if it actually is valid SQL in addition to that. So we can run that query and it does not execute. So it's actually invalid SQL as well. And if we go into the generated queries we might be able to see exactly why that is. So when we actually grab the generated queries here. And grab for that median weight. Okay. Clearly there's something wrong selecting college and count. from the NBA. So that's definitely the culprit. So if we actually were to look through that, it is I'll putting the wrong type of query. So with more cleaning, you can actually be able to add more examples of median, for example. So if we actually go through the larger data set that that's gone through filtering, enlarging and cleaning, we can see that there's a correct calculation in there for this exact query. And then when you tune the model on this dataset, you'll find that it's able to handle the median, calculation. Well, we can actually look at median show up far more in this dataset. So far more examples of median being shared with the model. Okay. Let's actually grab that model that was tuned on that data set. That's that model ID. We can ask that exact median weight in NBA question. we can also, let's say ask a different question about median age. of the Chicago Bulls. So this is a different question that doesn't really show up, in the dataset that the model can answer. Great. So it actually does grab the median instead of the average there. Let's see if that actually is able to execute. So over time, you'll be able to create harder and harder evaluation sets as well as you get the right performance up here. And as an example, you can take a look at one that we've prepared for you, which is gold test set B2. Which is expanded, to show way more breadth, and other difficult examples to make the problem even harder since your model has already improved, a bit. And so now let's see, kind of what the final model looks like on this hardware evaluation set. So, this argument is using this larger set of queries that are generated, from that gold test set, B2 still the same model. And it's actually load these things up for fine-tuning. Same deal. Awesome. And then we can run our eval pipeline one last time, again pasting this from our lab three notebook. Okay, that's just from lab three. And then being able to run that final model describing that model ID for you, and running the results over it. Great. So, as you can see, the percent of valid SQL, has dramatically increased as well as the amount of correct SQL queries as well. So we've iterated quite a bit, both on the evaluation dataset and on the training data. And I think these are some of the hardest parts to, and work most custom parts to actually get right. But it is what moves the needle on your model and, figuring out and diagnosing those errors is probably the most important thing to understand Where are the models that actually move next? And with that, let's look at an overview of what we learned. So if we step back the life of an LLM app or, you know, this whole process of lifting accuracy for your LLM app is really all about iteration. And this is very different from other things you built, because it's not all at once. It's really iterative. If you are making micro adjustments as you're improving your model before you ship it, and it actually is more about what threshold of evaluation you set to stop that iteration, because there really is no stopping point unless you set it. And I think that's one of them was difficult things for people where they reach a certain level and then they say, okay, well, it's still not good enough. And they keep increasing the bar. And so there's there's no place where they actually do ship. So figuring out where your stopping point is, where you're ready to ship is pretty key and important. for getting your app out there and that's it. Thank you.