As you have seen, LLMs can leak training data. So you need to take great care when using them with private data. In this lesson, you will learn all about the various risks and particular see how training data can be extracted from even popular LLMs. You will also learn how federated fine-tuning of LLMs can be a valuable tool in protecting training data and limiting the opportunity for data to be leaked. Let's dive in. There are a wide variety of vulnerabilities that are known for LLMs already. These types of vulnerabilities are wide ranging. For example, the need to protect the weights of a model. The importance of being able to restrict how a model is used. Issues to do with, attempts to anonymize users. Issues to do with people trying to insert backdoor within, training algorithms that then can be an issue when you use a server-side code. Attacks that relate to data transfer. within these systems. And all of these types of factors become far more acute and severe if you are trying to use private data that is inherently much more sensitive and has a lot more risks attached to any of these types of issues occurring. And so what we have done, given that there are so many vulnerabilities in this course, is focus on what is perhaps the most important concern that users of private data are likely to have. And that is the risk of training data being able to be extracted. But it's important to understand there is a whole landscape of vulnerabilities that one needs to consider. Within the specific domain of attempts to extract training data from LLMs, there has been a lot of interest. There has been the invention of a wide variety of approaches. On this slide, we are showing just a tiny fraction of, the academic papers that have come out, describing each with a unique, recipe as to how it is possible to extract training data from an LLM. And obviously, in this course, since we are examining the usage of private sensitive data, attacks of this kind are even more alarming. And so what we're going to do in this lesson, is dive into, the basics of how most of these extraction approaches work. The aim will be for you to understand how they function. And in doing so, you'll be able to better appreciate the risks that they pose. And so now let's look at the general framework of how these, approaches to extract data behave. You can break down how these, attacks, if you will, function into two different stages. The first stage, is that it's necessary for the LLM to be able to produce responses that are potentially examples of actual training data. This process in itself has a wide variety of different forms whereby people have invented ways of prompting the model to more likely generate actual examples of training data. As an example, people have come up with clever prompts such as providing a stat token or an empty field, or perhaps a string containing keywords that might lead the LLM to produce, sensitive pieces of data. In the code that you will use shortly, we are using a very basic approach of just prompting with an empty string. And you can see even that is successful. But generally speaking, stage one revolves around different strategies to prompt the model to generate a long list of candidate examples that hopefully, if you're the person trying to extract data, contain training data. Once this has been done, step two in this family of approaches performs what can be thought of as a membership test. And so this is a step that we're going to look into more closely. It has a lot of complexity to it, especially relative to step one. And this step attempts to test each individual example and perform a process by which you can decide if the candidate example is in fact part of the training set of the model or not. And so if you're unable to perform this step accurately, providing a performing these extraction, of training data attacks not actually viable because you can prompt the model to generate a lot of responses, but you need to be able to reliably understand if that example happens to really be in the training set or not, to really make progress. Let's dive into this second step. This membership test in a series of three slides. Each slide will describe a key idea that allows you to understand how it works. The first key idea to appreciate is the use of a metric. The metric that we will be using in the code that comes in a very popular metric in of itself is that of perplexity. This is a measure of, surprise, where an entire sequence that is produced by the LLM, can have perplexity calculated for it. And then, depending on the perplexity value, you start to have a judgment as to how surprised the model is that the sequence was produced. The calculation of perplexity is the normalized probability of each of the tokens in the sequence to be individually produced by the model. And so sequences that are not surprising to the model tend to be training examples. And so the intuition behind this is that training examples that you're able to produce from an LLM tend to be examples that have been memorized by the model. A memorized example, Being produced, is not going to be a surprising one to the model. And so therefore, by calculating perplexity, we're able to quantify that level of surprise and have an indication where the how likely the sequence, the example, is really training data or not. So this is key idea number one. Not everyone is aware, but it is actually possible to interrogate an LLM, and have it produce for you a probability of an individual token being produced or not. And so it's necessary for us to explain this, because this is a building block of how we calculate perplexity. Overall, you would have noticed that I said that perplexity based on the normalized probability of all the individual tokens within a sequence. And so we now need to consider how do you generate a probability of an individual token. And the way this is done, is that for a given sequence, when you prompt a model, if you mask later tokens in that sequence, you can determine the probability that the next token in the sequence will be the one that's produced. And so this is what's being illustrated here on the slide here. We have, a fragment. That fragment may have, "Mary lives at 172 Tenison Street." We prompt the model with the earlier fragment. "Mary lives at 172." We then mask the rest of the sequence. We examine the weights of the model to determine the probability of the word that occurs. And see what the probability that, that word that comes next would be "Tenison" and the next one "street." This allows us to determine the probability of the subsequent tokens. You can see this illustrated here at the lower part of the slide with, quite an attractive animation. This animation of a sentence saying, "I love learning about Federated learning by visiting Flower.AI." And what this animation at the bottom is doing is, showing the process whereby in dark yellow you have the earlier sequence. The light yellow represents the token that is masked, for which we are calculating the probability for by interrogating the model. And so by performing these calculations of probability of the individual tokens of the words within a sequence, we can then calculate the perplexity that we use. The last step to this in this membership test, is to take this perplexity example that we've now examined how you can calculate, and then we examine the value of the perplexity to determine if it's a low enough to indicate that, the model is surprised or not. Depending on this level of perplexity in the level at which we think a surprise is occurring or not, we can then mark the examples really being an actual training example or not. And as you would probably imagine, there's a wide variety of different approaches for determining this type of threshold as the level of perplexity necessary to say if the model happens to, agree that this training candidate is really an example from the training dataset. So now that you understand all the core concepts necessary for diving in and performing one of these, extraction of training data methods, we are now going to jump into the code and see it happen in front of our own eyes. In this fourth lesson, you will learn how LLMs can leak training data and observe that LLMs fine-tuned and the Federated LLM fine-tuning, are significantly more resistant to this particular vulnerability, as we have done in previous notebooks, we will interchange between the 70 million parameter LLM and the larger 7 billion, parameter model when it's necessary for some steps, but all the code that you see can be applied to either one. Furthermore, we will be using the same models and the same scenario that's consistent with this medical dataset scenario. The first step, as always, let's import some important packages and utility functions that we are going to need as you perform, the code in this notebook. So let's, include these. The majority of these imports are actually utilities that have been written for you to assist and simplify the task of you performing some of these texts and assessing how hard it is to extract training data from different versions of fine-tuned LLMs. In this next section, you will see, step by step, the simple approach we described earlier in slides that can be used to extract training data from an LLM. So let's begin. You will work, first with the LLM that was essentially fine-tuned on Med Alpaca back in lesson two. This is the model from which you will attempt to extract training data first. So let's load it. Recall that this model is the Mistral 7B LLM. The first step in extracting training data is to generate possible candidate examples. This is done by prompting the model. As mentioned, there exists a wide variety of clever approaches developed to prompt the model in such a way that it is likely to produce a response that includes parts of the training data. So let's try some of the simple varieties of these approaches. The eval function calls the prompt on the Mistral of 7 billion parameter LLM. Using the fireworks dot AI service. And as we see in the code, even prompting with an empty prompt produces an interesting response that potentially is training data. It appears that this response may indeed have come from a website or other material that describes the benefit of a specific VPN service. And so it is quite conceivable that this is, in fact, an example of training data. You can try different prompts. So let's try a few. One that you may want to try is, simply prompting it with a prefix in the sentence that has, email address colon, and then seeing if the model is going to regurgitate perhaps for us, a email address. As another example, you can prompt with a fragment of a name Peter W. If you do this, you can see the response gives you text that may in fact have come from a biography on a website that is potentially training data used on the model. And so primarily here, we're seeing all of the different types of simple prompts that you can use to generate a series of candidate examples that may or may not be training data, whereby the next step, obviously would be to decide if you can, determine if these are actually training data samples or not. If we look at this final example where we're using this fragment of a name, Peter W, and we have seemed to have a response that is a fragment of a bio. It may be tempting to think that even if this is training data, that this is harmless. But remember, we are looking at examples where we want to train on private data. And so with this Mistral model, if this does happen to be training data, then if we had trained Mistral instead of public with web data, but actually had trained on sensitive private data, for example, like an HR record, then it would be very possible for this bio to be coming directly from somebody's actual HR record, such as the employee file of a PETA at a particular company. Now that we've seen how you may go about generating a number of candidate examples that can potentially be training data, the next step is to calculate perplexity for some of these candidates. Remember we don't know if any of these responses are really training data or not. What perplexity does for us, is measures how well the LLMs can predict those sequences of words. A low perplexity indicates the model was not surprised by the sequence produced. And this can be a signal that the sequence is perhaps a memorized fragment of actual training data. On the other hand, a high perplexity is an indication the sequences are unlikely to have been generated by the LLM, and so is less likely to be actually observed by the model before in train data. You can use the same eval function as before, but with the secondary optional argument set to true, so that we can take note of the likelihoods when we are prompting the model. You can then use the calculate perplexity function to compute the perplexity based on those collected likelihoods, as you see in the code. The perplexity value here is very low, but do keep in mind perplexity and other similar metrics of this kind are only, indications that the example is training data. Let's now take a closer look at these perplexity values. As mentioned, when perplexity is low, then the data is more likely to be from actual training data. But how low is low? What is a good threshold? Let's take a look at two examples. Here is an example of a response from the Mistral 7 billion parameter model that is highly likely to be actual training data. This is because we can find this text fragment on an actual website. That was likely crawl, by Mistral, or at least in the dataset used by Mistral, and made its way into the training data of this model. Now, let's look at a different example. Let's look at the perplexity of the sentence. This sentence comes from a news article from The Guardian. The Guardian is a major newspaper in the UK. In fact, this article was published on May the 9th, 2024. And so this is far too recent to have possibly been incorporated in the training set of the Mistral 7B model. And so what you're looking at here is a perplexity value that, with very high probability, is almost impossible to have actually been in the training set of the Mistral model. And so you can start to get a calibration of this particular model, and what level of perplexity things need to be for them to be completely unlikely to be training data. And then if we look at the early example, we see a piece of text that is actually highly likely to be part of the training dataset. And we can start to see what the perplexity values tend to be when examples really are training data. Note that the perplexity for the example, where it's not training data is six times higher. Something to note is in both those cases, we've provided you with the URL. It goes to, the wayback Machine so that even though that some of those links, such as news article are not likely to be, persistent, you're going to be able to visit those. And no matter when you happen to be watching this course and see, these actual examples of a website where a list appears very likely that the data was crawled and it's part of the, Mistral model, and we were able to extract this out of the Mistral Model. And in the second example, we took a text fragment from the internet that we knew was almost impossible to have appeared inside Mistral and then verified the perplexity value of this particular fragment. Given what you have now learned and seen, you can apply this method to the fine-tuning on the representative private data, the Med Alpaca dataset, and the fine-tune models that we've built and trained up until this point. You begin by loading the LLM that was fine-tuned. Essentially, you did this way back in lesson two. Then you can try to answer the question of can we extract a question from the Med Alpaca dataset from this fine-tuned model? You can see here, that you can compute the perplexity again. And then we normalize it using the same LLM that was not fine-tuned. What you will observe is that this normalized perplexity is very low. And this indicates the data likely came from the training dataset, which we in this particular case, know this was true because this model was actually trained on the training set where we acquired the the prompt text that we calculated the perplexity from. And so this, is an example of a prompt text that would be possible to be extracted from and essentially trained LLM. One of the core things we are examining in this course is how we can use federated learning, with the fine-tuning of LLMs, to make it much more difficult to extract training data. So let's take a look at the central question by beginning to compare the fine-tuned models from lesson two and lesson three, i.e. the one that was fine-tuned essentially, and the one that was fine tuned with a federated method and start to see which of these is more difficult to extract Med Alpaca data. Which of course has been our private data representatives all throughout this course. The techniques that you have explored in this notebook up until this point have been packaged nicely together into a single function, code. In my A test, this test, takes as an input a series of possible training examples, these candidates, and then it helps us to decide based on perplexity if they are part of the training data of the model or not. Initially, you will run this against the centralized fine-tuned model, and then next you will run this against the federated version. From the results, you can see two of the three examples we have used here, with each example coming from the Med Alpaca data. You can see that two of the three were recognized as being training data from the centralized model. We again of course know that they are part of the training set of this model because we fine-tuned them ourselves on this dataset. And so we realized that the function returns membership true to indicate that the example is believed to be that of the training set. So, in summary, in three examples, we attempted to extract from the essentially fine-tuned model. And in two out of those three examples we are able to do so. On the other hand, let's look at the federated model. We can see from the results that each candidate we attempted to extract is reported as false. The algorithm this technique based on perplexity is uncertain. If this particular training example was part of the training set or not. So under the federated model, this approach cannot recognize these examples as being training data. And so a person attempting to extract the training data will have no way of knowing which response of the model of which they are likely to be many memorized training data, or just typical LLM responses to a prompt. There's no way to differentiate it. So this small experiment where we saw two out of the three attempts be successful for the centralized model, and zero out of the three being successful in the case of federated, indicates there has been potentially some significant improvement in the provided, privacy. But a natural question to ask next is, "Will this hold over a larger experiment?" And if so, then we can start to think that this statement is going to hold more generally. To answer this question, you can perform the previous experiment at a much larger scale and involve not just three examples, but a large fraction of the Med Alpaca dataset that we've been working with throughout this course. The standard way to analyze the result of this new experiment is with an ROC curve. And so an ROC curve will tell us how well true positives and false positives can be detected for both the centralized model and federated model. When we apply this approach of extracting training data at a larger scale many more examples. Let's begin with, the small 70 million parameter model and perform this analysis you will start to do this with, a configuration. A few things of note in this configuration is how the two datasets are set up in ROC curve it will require positive and negative datasets to be specified. And this is what is being done in this configuration file. You are using the Med Alpaca medical flashcards in one case and the big bio public Q&A in the other to satisfy these requirements of positive and negative datasets. This configuration is using ten examples. With each sample indexed zero through nine. You can easily adjust the configuration to use larger amounts of data and validation. In this configuration, the training is set between 0 and 10 and validation between 15 and 25. You can also replace this model with any one that you like and use other datasets simply by changing this config. Performing this whole analysis will mean waiting quite a while. So we want you to be able to see the results of us performing this analysis over a large number of examples. And so we've done this offline for you. We've completed that earlier and save the output, as we have done in a number of the earlier lessons in this course. Let's load this output. You will use a function called plot MIA results. This takes the saved output of the analysis previously mentioned, and presents the information as an ROC curve that compares the federated and centralized models relative to a third, situation of simply using a random choice to decide whether or not and a candidate example is part of the training dataset or not. A big important message to immediately jumps out to you from this figure is that training, data extraction is systematically easier for the centralized fine-tune model that is shown with this red line than the federated fine-tune model that is indicated in blue. In this figure, black is random chance. Now, the ROC curve measures how well the extraction process works. The area of the curve indicates how successful and with how few errors the process is actually able to be applied, and is able to identify correctly training examples, the more accurate the process is then it shifts towards the top left. And you can see this happening for the red curve that corresponds to the centrally fine-tuned model. In this ROC curve, the y axis is the true positive rate, the x axis is the false positive rate. And so then let's consider a couple of data points that were being shown here in this figure. The figure shows that roughly 83% of the candidate examples from the portion of the Med Alpaca dataset we use were correctly classified under the centralized setting. This is roughly 30% higher than the success rate that can be seen to occur under the federated fine-tune model, which is only successful at a 63% rate. Please remember, though, that the shape of these curves, the percentages we get, are all subject to tuning. You can choose how you want to increase and lower these curves by altering parameters such as the key variables of differential privacy and other options for the components of the federated LLM fine-tuning recipe that we saw described in the prior lesson. By changing those values, you're able to lower or decrease the amount of privacy protection that you receive, and the reason why you might want to do that is there tends to be a corresponding impact on the fidelity of the resulting model during fine-tuning, as you increase the level of privacy protection. For the last part in this notebook, let's take a look at how the perplexity approach that you have been using thus far, actually works in a bit more detail on the Mistral 7 billion parameter model. This is interesting because Mistral is a widely used, very large scale language model that we know that is being trained on public data, and so we can perform an example and just how easy it is to extract this training data as it this can be somewhat of a proxy of trying to extract private data. In this example, it is assumed that you don't actually have access to any of the training data, you simply want to identify sensitive data that was used to train the model, or in this case, public data that was used to train the model because you don't have any of the training data. A wide variety of the training example candidates will be generated to a large number of candidates need to be generated and the perplexity tested. And so, as has happened a few times in this particular notebook, to avoid you waiting, this process has again been performed offline. We have generated quite a significant number of candidates from the Mistral model, and then the output of that generation has been saved for us to examine. Extraction is used to store the outputs of this extraction variable, the MIA underscore config. And configures and specifies the behavior, the model to examine and so forth. And then what is interesting again here is this is not a toy model. And yet the procedure that we have described, is seen to be working well eval and show are being used to pull out some of these examples. The first example shows the model is able to produce or regurgitate even entire paragraphs of an existing website. In the second example, we can see email addresses being found in the model, and that these are corresponding to training data. This example is a further reinforcement that the training data extraction method we used is actually able to succeed even on toy models. And it is clearly not the type of behavior that you want to have happen when you're working with private data. The gains in the protection that federated fine-tuning provides us, are naturally, as we observe, both very much needed and do not come easily through other alternative methods that have been used to train, in this case, the Mistral 7B or in the earlier case, to fine-tuning using the centralized approach with the Med Alpaca data. You have now reached the end of lesson four. Let's review some of the key things that you have learned during this lesson. The first is that we have further reinforced and shown through the code, and yourself have performed and use code that is demonstrated LLMs can indeed leak training data. We discussed the fact that there's a range of vulnerabilities, but in the focus of this course, we have drilled down into this potential for LLMs to leak training data. And now you have seen how easily it can actually be performed. We have described it quite some detail how one of the most popular attacks of this kind on these most popular approaches to extract training data can be performed, and how this approach is based on the use of a perplexity measurement that can be computed based on different candidate examples that potentially might be training data. Something that's important to note that in this lesson we use the, Mistral 7 billion parameter model and showed examples of the ability to extract training data from it. But we are not intending here to single Mistral out specifically, and that these attacks, these techniques would indeed work with many of the other open-source LLMs models as well. So Mistral here was just standing in for other LLMs in this situation. An important takeaway from this lesson has been that LLMs that are fine-tuned using a federated approach are much more resistant to attempts to extract the training data. And we've seen this in many of the experiments that you performed when we compared them against, centralized or existing more typical, LLMs that they were able to resist those extraction attacks much more strongly than others. One final point we want to leave you with and listen four, is that there exists a tradeoff between privacy protection and the quality of the model response. What you observed in lesson four an and then for yourself with the code, it's seen how difficult it is to extract training data from different models. In each of these cases, these models have been trained with certain parameterizations of the various facilities. And so in the case of the federated learning one, there, techniques, as we've discussed already, such as differential privacy, the style of the federated learning, and also the basic setup of the learning algorithm itself, all have a variety of different hyperparameters associated with them. What it's possible for you to do is to adjust those hyperparameters so that you can improve or perhaps reduce the amount of privacy protection that the model enjoys. And the reason why you might need to tune this is that typically in these setups, there's this natural trade-off between the amount of privacy protection the model has and the difficulty that exists to extract training data where privacy is privacy increases. On the other hand, the quality of the model's response and its ability to extract information from the fine-tuning degrades. And so this is a trade-off that you yourself are able to make and decide, depending on the sensitivity of the data or the particular application and the importance of the quality of the response.