to unlocking large amounts of currently inaccessible training data. You'll see how important training data is for training good models, but also how limiting traditional training approaches are. but also how limiting traditional training approaches are. You'll see examples of how federated learning is used to train models on data that is distributed across different organizations, or even hundreds of millions of user devices. Let's go. Let's start with a very recent example. Llama three. Only nine months of the llama two. Llama three was announced with a big jump in performance over llama two. One of the most impressive details of this launch was that One of the most impressive details of this launch was that the smallest version of llama three, llama three 8B outperformed the largest version of llama two, llama two 70B in a major way. How is that possible? How is that possible? One of the most notable changes between llama two and llama three is that llama three was trained on substantially more data. To quote from the announcement blog post, our training dataset is seven times larger than that used for llama two, and it includes four times more code. Lama two was trained on roughly 2 trillion tokens, Llama two was trained on roughly 2 trillion tokens, and llama three increased that to 15 trillion tokens. This demonstrates the importance of high volume, high quality training data. This demonstrates the importance of high volume, high quality training data. At the same time, there's a discussion about whether LLMs are running out of training data. The amount of available data in the world are difficult to estimate. The Educating Silicon blog has a fairly recent estimate of the amount of LLM training data that exists in the world. There estimate is that, and I'm quoting at 15 trillion tokens. Current Llvm training sets seem close to using all available high quality English text. We might be able to 4x that amount of data, We might be able to 4x that amount of data, but they estimate that to be the upper limit on publicly available training data. Of course, for the sake of completeness, we had to ask an LLM. we had to ask an LLM. As it turns out, even LLMs think that LLMs are running out of training data and that this problem is going to get worse over time. One important aspect that is less often discussed is public data versus private data. Compared to the 15 trillion tokens in FineWeb and 18 trillion tokens in non-English data, there are an estimated 650 trillion tokens in privately stored instant messages alone, 650 trillion tokens in privately stored instant messages alone, and even 1200 trillion tokens in all stored emails. and even 1200 trillion tokens in all stored emails. Now, this is not to suggest that this data should be included in training. It is just to be taken as a data point to compare the amount of public data to the amount of sensitive private data in the world. to the amount of sensitive private data in the world. But isn't this interesting? We know how important data is for training a good model. It seems like we are running out of training data, but at the same time there are huge amounts of data that are not being used. We will go into federated LLM fine tuning on private data in course 2. In this course we set the stage by introducing federated learning. Let's dive deeper into the topic of sensitive data. Data is naturally distributed. It's distributed across organizations and user devices. And healthcare, for example, is distributed across different hospitals and government. It's distributed across different governmental agencies and finance. It's distributed across different regulatory regions. And in manufacturing, it's distributed across different factories. When looking at user devices, we have sensitive data on phones and laptops. When looking at user devices, we have sensitive data on phones and laptops, but also on other types of smart devices like cars or even robot vacuum cleaners at home. Traditional training assumes centralized data. Traditional training assumes centralized data. It only operates on a single one of these data sets. All the other data sets are ignored. And the result of this, it's that very valuable data does not get used for training. Actually, most of the world's data Actually, most of the world's data is not readily available for model training. The common way to work around this is to try to collect more data The common way to work around this is to try to collect more data in one place, to increase the size of a single one of these datasets. But in a way too many cases collecting data simply does not work. But in a way too many cases collecting data simply does not work. Data needs to move, but that's often not possible for a number of reasons. Data might be sensitive. The volume of data might be too high. User privacy might prevent us from being able to collect the data. Regulations might force data to stay in a certain region, and sometimes it's just not practical. Or you might wonder how big of a problem is this, actually? What happens if I have data but it's not evenly distributed? To understand that, you're going to build three datasets. Each dataset will be different. They're all going to be based on mNIST, but they will have different digits missing for example, dataset one will have no examples for digits one, three and seven. Dataset two will have no examples for digits two, five, and eight, and dataset three we'll have no examples for digits four, six and nine. Using these datasets, you will train three simple models, one on each dataset. Using these datasets you will train three simple models, one on each dataset. All three models have the same architecture. All three models have the same architecture. You will then evaluate these models to see what the impact of missing data is on the final train model. Let's jump into the lab. Let's jump into the lab. Let's start with importing a few utility functions. One of the things we imported from utils is MNIST. You start by downloading the MNIST dataset using the torch vision dot datasets dot MNIST function. The dataset gets downloaded to the specified MNIST underscore data directory, and then by setting train equals true, you get the full MNIST training set. Last transform applies transformations that normalize the data. You split the data to simulate data distributed across three partitions. This could represent, for example, the data of three different organizations or personal data on three different user devices. To do that, you get the total length of the training set. The dataset is then split into three parts part one, part two, and part three. Using the random split function from torch dot utils data Each part is approximately of equal size. You calculate that size by dividing the total length of the data set by three. The splits are stored in separate variables for further processing. Note that we manually set a random seed to be able to use the exact same three parts in lesson two, where we are going to train one federated model across all three parts. In the real world we have different data sets. For example, one hospital might have more radiology images of a fractured ribs, whereas another hospital might have more radiology images of fractured fingers. Another example is one user who has a dog will probably have more images of a dog saved on their phone, but another user who is a big fan of cars will probably have more images of cars. To simulate this, we change the data distribution in each of the three datasets by excluding specific digits. For part one, we exclude digits one, three, and seven. Part two excludes digits two, five, and eight, and part three excludes digits four, six, and nine. This creates different data distributions in each of the three parts. The plot distribution function outputs a visualization of the distribution of digits in each part. This helps us to better understand the data we are working with. As we can see, for example, in part one we only have digits zero, two, four, five, six, eight, and nine represent it, but digits one, three, and seven are missing. In part two, we only have digits zero, one, three, four, six, seven, and nine. But digits two, five and eight are missing. Similar in part three. Now you train three individual models, one on each of the three data sets. For that, you use a trained model function imported from utils from utils. We also imported a definition of a simple model, which is a neural network implemented in PyTorch with just two fully connected layers. Note that the choice of model does not really matter here. You could use many kinds of different classification models that work with the MNIST dataset. We can see that the loss is gradually going down. We trained a model for ten epochs, and the loss going down indicates that the model is learning something. The training of model one is complete. Now the training of model two starts. It does the same thing for another ten epochs. Now the training of model three is running. This will take a minute to execute. And now the training of all three models is complete. Let's see how the three models are trained perform. To test the models, you load the MNIST test dataset using the same MNIST function you used before. To indicate that we want to load the test set, not a train set, we must set train equals false. The dataset is downloaded to the specified directory again MNIST underscore data. And then the same transformation that was used to normalize the training dataset is also applied to the test dataset. Remember that you excluded certain digits from each of the training datasets before to evaluate how individual models perform on data examples that were not represented in the training set. You again create three subsets. This time you only include the specific digits that each of the individual models has not seen during training. Test set 137 only includes digits one, three and seven. Test set 258 only includes digits two, five, and eight, and test set 469 only includes digits four six and nine. Similar to trained model, we also imported a function called evaluate model from utils. Evaluate model takes a model instance and a data set as input, and returns the accuracy of the model on that data set. The evaluate model function is called to evaluate each model. Model one. Model two, and model three on two data sets. First, the entire MNIST test data set and second, the custom dataset that only includes the specific digits missing during training. Last, we print the test accuracies for each model on the entire MNIST test dataset and the specific subsets. We can see that each model gets an accuracy of roughly 65 to 70%. This is roughly what we would expect given that three digits were missing from each of the training datasets. It means that overall, the model has learned something to accuracy is better than random chance, which would be 10%. But what you can also see is that on the specific subset, the accuracy is 0%. One additional step you can take to understand this better is to look at the confusion matrix. A confusion matrix provides insights into the performance of each model by showing the number of correct and incorrect classifications for each class you use to compute confusion matrix function. To compute the confusion matrix for each of the trained models. Model one. Model two, and model three. You do this on the full MNIST test set to see how each model performs on each of the ten digits, not just on the excluded ones. So what can you see here? On the vertical axis you see the true label. And then on the horizontal axis you see the predicted label. So for example where the true labels zero, the model and 967 cases actually write fully predicted labels zero. But what you can see is that because labels one, three and seven were missing were absent from the original training data, the model actually learned to never predict these cases, even if the true label is is one of these labels. Now, what we can also see is that for those labels that were absent from the original training dataset, the model learns to predict something else, some other label that is close to the original label. So for example label one, it was absent from the training data. The model learned to actually never predict label one, but in many cases it predicts either label two or label eight. So this shows the significance of training data and of missing training data. Because if the training data is missing, the model actually learns to predict the wrong thing. So now you've seen the problem with centralized training and how it can break down when your data set misses certain data points. Now, what has federated learning enabled you to do and how does it help with this situation? In an ideal world, we could train models across all of the available datasets. In an ideal world, we could train models across all of the available datasets. Everyone would retain control over their own data. Organizations could keep their data private. Organizations could keep their data private. Users could keep their data private. But collaboration on training would be possible to train models in critical areas like healthcare. Federated learning is a major component to enable such a future. The key idea is to move the model training to the data and leave the data where it is. Data can remain in organizational silos or on user devices. Organizations and users retain full control over their data. The model training happens wherever the data sits on the GPU cluster of a company in the cloud account, on the GPU cluster of a company in the cloud account, where an organization keeps their data, or even on a user device. Federated learning orchestrates the training process across those different data sets and devices. This enables federated learning to access more data and more compute sensitive and distributed data in organizational silos and data on user devices. In the next lesson, you'll learn exactly how this works. But first, let's look at a few real world examples of federated learning in industry. The first example where this is deployed is finance. In finance, The first example where this is deployed is finance. In finance, data is heavily regulated. U.S. customer transactions need to be stored in the US. EU customer transactions need to be stored in Europe. EU customer transactions need to be stored in Europe. The customer transactions are valuable for training. The customer transactions are valuable for training for example, anti-money laundering models that help to detect or prevent financial crime. that help to detect or prevent financial crime. With federated learning, it's possible to keep the data stored in different regions around the world, but still enable training a model across those different distributed datasets. Our second example is on the other extreme of federated learning. Our second example is on the other extreme of federated learning. It's the Google Gboard on Android. It's the Google Gboard on Android. Instead of two datasets as in the previous example, this system is deployed across hundreds of millions of user devices. When you use the Google Keyboard to type a sentence, the keyboard tries to predict the next word you're about to type. It also tries to complete the sentence you're about to write. This is called Smart Compose. This is called Smart Compose. These features are powered by language models. The user data that goes into training is sensitive enough that it cannot be collected. Google was actually the first to propose and pioneer federated learning to enable these models to be trained on user devices without having to collect such data. The system has evolved in many ways since then, but it's still an impressive example of large-scale federated learning on user devices. In the previous two examples, we saw cases where a single organization has distributed data. Distributed across different regions in finance or distributed across hundreds of millions of mobile devices. What's special about the third example is that there are multiple organizations collaborating with each other. In healthcare, data is often distributed across many hospitals. The Flower framework that you'll use in the next lesson was used by the National Health Service of the United Kingdom on the data of 130,000 patients. In collaboration with the Oxford University. In collaboration with the Oxford University, an early Covid screener based on blood and vital signs was trained. an early Covid screener based on blood and vital signs was trained. This is an exciting project because it allows different hospitals to collaboratively train models. This approach is a key enabler to rolling out AI in healthcare, where individual organizations almost never have enough data where individual organizations almost never have enough data to train modern, data hungry model architectures. Let's review lesson one. Let's review lesson one. We've seen that data volume and diversity are critical for training good models. And at the same time, we're in this interesting situation where we seem to be running out of training data, but we also have large amounts of unused data. but we also have large amounts of unused data. So why is that? Data is often distributed. Traditional training approaches assume centralized data, and it's often difficult or next to impossible to centralize data. You've seen that federated learning operates on distributed data. Federated learning is deployed across many different industries, and it runs across distributed devices or distributed organizational silos. and it runs across distributed devices or distributed organizational silos.