In this first lesson, you'll look at the prompt format of Llama 3 as well as some built in capabilities of the model. I should also mention that the set of notebooks you are going to implement in this course is part of Meta's Llama recipes. This lesson is also an overview of the steps you'll take to get your application ready for production. Let's dive right in. So in lifting accuracy in LLMs, we're going to step back and think about LLM apps, on their own. And they are really their own animal. ideally Llamas, these cute Llamas because of LLMs, but they're their own animal, their own beast, because they are highly probabilistic. Right? They're sampling from a probability distribution. And that's why it can be a little bit unpredictable what they're producing versus a deterministic system where you're guaranteed a fixed output. And as a result, it's important to be very iterative, in improving these models. You're not going to get it right the first time with putting all the best practices together like you do for a deterministic system. So deeply understanding and internalizing even this iterative process is extremely important if you want to be able to lift accuracy to something like 95%, and you'll actually get to do that in this course, which is really exciting, on your own, for a specific application. Text to SQL, but also, one that generalizes to your other LLM apps. And now I'm going to hand it over to Amit, who will give you an overview of Llama 3. Llama models built by Meta's research teams are large language models based on transformer architecture. There are currently two Llama 3 models. There's a small 8 billion parameter model and a large 70 billion parameter model. In general, the larger the model is, the more capacity it has to learn from its training data. However, larger models are also more computationally expensive to train and deploy than the smaller models. Each of these models can be used for different application scenarios and purposes. The instruction tune models are created by taking the base models and running them through additional training called instruction tuning. This enables instruction tuned models to better follow human language instructions such as summarize this or tell me a joke. These instruction tune models are called Llama instruct models. Depending on your use case, you can take any of these models and further fine tune them for your application needs. In this course, you will be using the Llama 3 8 billion parameter instruct model. The Llama 3 family of models have industry leading performance. Importantly, the family is open source. So you can modify these models for your own applications. The Llama 3 8 billion parameter model is used in many applications. It's small enough that it can run on your laptop. It can have low latency, making it useful in many tasks. In this course, you will fine-tune this model to improve its performance on generating database queries. All right. Let's start by importing our packages. We'll import the Lamini package. Then let's instantiate an LLM. We use the Llama 3, the 8 million parameter instruct model. Now let's create a prompt. Here's the beginning of text tag and then the system header. And then the system message. You are a helpful assistant. Followed by the end of turn tag and then the beginning of new header. The user header. Please write a birthday card for my good friend Andrew. End of turn. And a header for the assistant which lets the LLM know its time to reply. Now let's call the LLM with that prompt. Here's a very nice birthday card message for my friend Andrew. Happy birthday to an amazing friend like you, Andrew. And so on. We have all been writing long prompts for LLMs lately, and those are long strings in Python. Python's pep or Python enhancement proposals PEP 8, gives you a style guide for Python, and it suggests that for long lines, they can be broken over multiple lines by wrapping expressions in parentheses. Let's see how that's done. So here, these strings are wrapped in parentheses and all of the individual strings are concatenated together. And this makes a little easier to read. But a big advantage of this is the fact that you can add comments to your strings. Let's try this out. You can see this looks similar to our last one. Let's check to make sure that's true. Yeah, even though that looks nicer, we don't want to be writing those prompts throughout the course. So let's create a subroutine to do this for us. And we'll use this through the rest of the course. This will be make Llama 3 prompt and it will take in a user prompt and if you would like, a system prompt. Let's build the system prompt. First, set it to an empty string. If the input which is a system input, is not blank, then we will add this header. The start header system, and then the system message you're passing in, followed by the end of turn tag. Then we can write the rest of the prompt. So we have the beginning of the text tag followed by the system prompt that we just created. The user header, followed by the user message. And I've done tag and then the assistant tag. And then we return the prompt. Let's try this out. That looks like the same prompt as before. But let's check it. Sure. Okay. Let's try this on a new example. Tell me a joke about birthday cakes. Notice that this one does not have a system prompt, just a user prompt. Why was the birthday cake in a bad mood? Because it was feeling crumby. Now's a good time to pause and try a few prompts on your own. The Llama family was trained on a huge amount of data. As a result, the model is able to generate SQL. We can use this capability to ask the model questions about how to generate SQL. Let's try it out. We can ask the question given an arbitrary table named SQL table right a query, to return how many rows are in the table. Let's print out the result of calling the LLM with that prompt. So, the model returns a description of what it's returning, returns the prompt and a little explanation of why it chose that response. We can ask more complicated questions. Let's try a few of these. Given an arbitrary table named SQL table, help me calculate the average height where age is above 20. Great. Here we can see it's now selecting entries where the age is greater than 20. We are not going to be executing these in this lab. We are just exploring what the model is able to generate. In the following labs, you will be generating these commands and then executing them on a database. Let's try an even more complicated example. Given that table, can you calculate the 95th percentile height where the age is above 20? Perfect. So here's the SQL command to generate that. What if you are using SQLite. Let's try that same query. But this time we'll add use "SQLite". It returns the queries along with some hints about how to use these. This is a good time to pause the video and try a few queries of your own. Thank you. Amit. And next, we're going to go over, errors we might see in the model. For example, one common one is hallucinations. And, this is incorrect or made up information that the model has generated. And it arises from the LLM thinking, you know, something that's slightly right is the same as it being right. Which in some cases is totally okay. So you might say hi or hello and then that's totally okay if it interpreted those or said one or the other to you, when you said hi back. But for, let's say, your birthday, that is not okay to get wrong or, specific fact, a revenue number, for example, in an earnings report, these are not things where the slightly right answer is the same as right for these facts. And this can be especially detrimental for not just critical facts, but also when you just need precision. So when you're connecting to a downstream system, like hitting an API, or you're connecting to certain IDs, this is, this is especially important. And so as an example, Llama 3 might produce this beautiful SQL statement and you're hitting a SQLite database. But the hallucination here is that while this might run, this might execute in an MySQL, database. It won't actually run in SQL lite because percentile_cont is actually not available in SQLite. So this will fail. This is an invalid query. Okay. So how would you tackle this? Well, the most common approach is obviously just prompt engineering. It's the easiest, lightest lift that you can do. For a case like this though, prompt engineering we found, often gets to something like 20 to 30%. Adding some self-reflection can sometimes help the model, lift it up to maybe 40%. And then retrieval augmented generation or RAG, get it up to 50%. and instruction fine-tuning, you know, mixed results from, you know, something that's under RAG or a little bit over RAG. It can be pretty difficult to get the model to commit to sampling or producing the right results. And why is that? So let's actually look into technically what's going on in this model. So you might ask, you know, what year did Dave Aguilar climb the Golden Gate Bridge. And you might RAG up a Wikipedia article about the Golden Gate Bridge. And, the model is prepared to answer the question. It says he climbed it in and it's going to say a date. And, this often might work because I might say 1981, which is the correct answer, but it sometimes fails. And what's going on here? Well, let's take a look at this probability distribution here. It's sampling in this probability space. It's not going to sample cat here on the edge. Right. Not very likely. But among the dates among things that are similar, it is going to consider sampling there. And that's why it fails. It's because it's considering, things that are close. and as a result, it's not going to say he climbed it in cat. But it will say, you know, he climbed it in 1970, which is, hallucination and in some ways more detrimental than even saying cat because, you know, it, it you don't know. It doesn't know. So the question is then, given all those techniques that, that don't fully work and reducing all these hallucinations, how can fine-tuning actually help? one way it can help is that there is a method of fine-tuning where you can embed facts into the model. but instruction fine-tuning, which is the most common form of fine tuning, isn't actually that tool to remove those hallucinations. And it can be kind of costly as well. There's a technique called memory tuning invented by Lamini, which lets the model actually recall these facts precisely by embedding the facts directly into the weights of the model, and actually inject a little bit of determinism into this, very probabilistic process. And one important thing about this is, you might ask, okay, well, are you just going to overfit the model on everything? One important thing is we're not right, because otherwise you lose the magic of why we're using LLMs in the first place. So the goal is to be extremely precise on those facts without compromising the generalization and instruction following that, we absolutely love. And memory tuning is able to lift us up to 95% accuracy. And the reason why is because when we ask this question again, the model is, sampling from a distribution that looks more like this, essentially sampling from only 1981 being the only possible response. And the way this happens, which we'll go over in greater detail, is to bring the what we call loss to zero on these facts. Okay. So now moving on to an actual, example. So an example of a SQL agent, LLMs are quite incredible. they can generate SQL and often pretty complex SQL that you already saw from Amit, but they don't reliably generate SQL that matches your schema, your complex schema. And these are those hallucinations. So, in this course, you'll fine-tune Llama-3 Eight billion, to generate SQL specifically for your schema. And what that process looks like is, you know, user asks a question, the model generates SQL query, it executes against a database, and then the user sees that response. Stepping back, you know, why is this application even useful? Well, I've actually talked to a lot of amazing folks about this and asked the same question. And here are some of the top few answers. Number one is around finding a better user experience, for business users, analysts who are trying to look around data, to get faster answers, directly, without going through necessarily another team, to be able to run those SQL queries. There's also operation efficiency here. So, this is less time the data team need to spend on answering simple questions for, for their business users. And they're often a bottleneck within an organization and feel overloaded while also having to juggle other tasks. And finally around reliability. This is actually an approach if you can tune the model to high accuracy, to be more reliable in breadth than in business users who write their own queries and who do not necessarily understand all the data models, in there. So, let's dive specifically into hallucinations in the SQL agent. There are a couple types, that I would kind of categorize them into. So you'll see invalid SQL. So the model will generate SQL that's completely invalid. It doesn't run ,right. So exactly, like the one you saw before. and that percentile can't it just it doesn't run the SQLite query right. Other forms are missed column names, missed IDs, missed formats. The functions here being wrong. Right. So that's invalid SQL. There's also malformed SQL. You can generate valid SQL, but it is semantically completely wrong. So it's not answering your question or it even outputs something null. Right. And it's trying to get around that invalid SQL. So that's an example here, which you'll explore and in greater detail. But essentially, this is trying to get the highest paid person in this database of the NBA. And specifically what's going on is, you know, it's trying to get salary information from this database. And the salary information is stored as text format with a dollar sign, which is all fine and good for a person to look at. But in SQL, if you're actually going to sort by salary, you can't actually sort that string because it's going to include the dollar sign. The string doesn't sort like an integer. So there's something wrong here when we cast it as, salary as a real or essentially text, and, and try to do an order by operation on it. So how do we address this and how do we quickly improve accuracy? Embedding facts about the database into Llama 3 very efficiently is the approach we're going to take. So that when you do ask who's a highest paid member of the NBA, it goes through the fine tuned model that actually has that SQL, you know, schema information inside of the weights of the model. And is then able to produce the correct, SQL query, as you can see here, casting salary as an integer and removing that dollar sign in order to do the order by statement correctly. Now you've heard it before, and you'll hear it again, especially for me, iteration is the most important thing to deeply understand and internalize. So iterating many times on all these different steps, including evaluation, data generation and, and fine-tuning, so starting with your original evaluation set, which will explore building, tuning your, LLM your Llama 3 model to be able to produce better SQL, having another set of LLMs evaluate that LLM diagnosis hallucinations, generate more data. So, generate more data to actually tune the model better with, again, another LLM call or multiple LLM calls and appropriately expand your evaluation set or improve it. So It's a better gold standard. You make it harder. And those three steps are, pretty critical, to getting it right. And all of them call the model so that you can do it, do all of these very scalably. In the next lesson, is about creating your own SQL agent. See you there.