I'm frequently amazed by what LLMs can do, but they could be even better if they could be safely trained on various kinds of private data. For example, personal medical records, or sensitive information at your company. In this lesson, you'll learn all about the current limits of the data being used to train an existing LLMs and the potential for federated LLM fine-tuning to help change this. Let's dive in. LLMs, they're incredibly impressive. But today what's interesting is that most of the world's data is not found in any LLM. Once you start listing all of the data out there that's not an LLM, you start to feel like you're never going to stop. You have private data. So we're talking about the data in your phone or the data in your emails. There's all kinds of sensitive data that some of it's private, some of it's not. But just think of all the data that's in images, in your doorbell camera, the sounds that your smart speaker might hear. All of these things not an LLM today. There's data that's highly regulated. So we're talking about data that's out in banks and financial institutions, all that data in medicine then hospitals, none of that data is typically in an LLM. And so there's also data in thousands of computers in an enterprise, this isolated IoT and robots and manufacturing machines and factories all around the world, all of those types of computers have valuable information. And again, they're not an LLMs. And so this is remarkable if you start to think about it, because we think of LLMs as being these huge billion parameter models that containers of much of the world's information. But when we start to break it down, we realize that these LLMs are missing out on so much of the information in the world. Needless to say, if we can start to tap in to all of this data and the world that we know is not in LLMs today, we'd expect to see remarkable advances in what we're able to do. So given all of that, what on earth are these LLMs trained on? Well, they're trained on data that's public. Data that's in the internet, data that's on the web. So if you start to enumerate what these are, you'll see that there's text on websites, there's videos on YouTube, there's articles in popular press, blogs, magazines, parts of social media, message boards. This is the kind of data that is embedded inside the LLMs that we're aware of. And as a result, there's tangible consequences for the types of responses that they can provide us. And so I love LLMs. You can ask them questions such as "Plan me a fun Saturday in New York for tourist", and they'll give you great responses. It's entirely reasonable that they would say something in response to a question of that kind That goes along and tells you to say, "start with a breakfast at the Clinton Street Bakery, take a run on the High Line, and then catch a Knicks game." And that sounds like a great Saturday to me. But what happens is, if you start to ask an LLMs specialist domain specific questions, so let's do that. Let's ask it a question to do with medicine, and you might ask it a reasonable question and say something like, "I have blurry vision, I'm a diabetic. What should I do?" The type of response you might just get out of a conventional LLM is something along the lines of "perhaps eat some carrots. They have been known to improve your eyesight", and so you might understand where it comes from, that type of information. There's an urban legend about how carrots improve eyesight, but the point is that that's a very poor answer to a very important question. And that is a type of answers that you'll get when you ask a general LLM medical questions or many different types of domains. That is a not a good match for the types of data that have been trained on. So because of this, there has been an emergence of domain specific LLMs, of all kinds. these are LLMs that excel at specific tasks. And so some of these are trained on data that's not private, but they're trained on focused datasets that contain higher amounts of information. But they need that you don't commonly encounter. So for example chef GPT type of LLM might be trained on a lot of cooking information, many recipes and other types of information that it needs to be able to respond to a person who wants to prepare a meal and ask questions of that nature. There are LLMs that help you with learning languages are learning other things are the subject areas. Then there's also other LLMs that have been trained on domain-specific proprietary data. Many are in there medical domain. So BioGPT, Med Gemini, Glass eye. These are examples of LLMs that have been trained on medical data and have significantly enhanced capabilities to respond to queries about health and medicine. And one of my favorite examples is one of the early forms of these domain-specific LLMs the Bloomberg GPT. This is a 50 billion parameter LLM that was trained extensively on the internal archives of Bloomberg. And so they have all these, financial documents, the massive reservoir that they used to train Bloomberg GPT and give it enhanced capabilities in the domain of finance. And so the point of this is that people already know that LLMs, when they're given specific forms of data, can start to excel in these domains, but we just don't do it generally. And what is that? Well, this is a lot of reasons why. But one of the biggest issues is that LLMs are known to leak parts of their training data. And so this really makes you think when you're starting to use forms of private data, forms of regulated data, forms of, distributed data, any type of data that's not public, you have to think very carefully when you start to use them in certain LLM. This happens to be in the scene of this slide, an article that appeared in the New York Times in December 2023. This article talks about the even the example of the author where their own email address was found in a public model ChatGPT because their email address was in its training data. And so because of this factor, as a developer, as somebody who's interested in building an LLMs, you need to be incredibly careful on how you treat and use sensitive data of this kind. And what it means is, is that the conventional methods that you probably know about already, such as performing fine- tuning on our LLMs in the normal ways that you've seen before, are not a good fit for using on private data. And so the focus of this course, what makes it quite unique, is that we're going to dive into the methods that you will need to use to build LLMs on private data. And when we use private data, this is an umbrella term. Not only does it include the private data that you can think of again, things on your phone or in your personal emails, but it includes all the other types of data that is routinely not included in LLMs. As I mentioned before, regulated data, health data, data that's isolated in IoT and factories, all the different kinds of data that never makes it into an LLM. We want to look at how you can use that data, because we will be able to improve the quality of LLMs as a consequence. So how are we going to achieve this magic? Well, we're going to focus on a alternative to conventional fine-tuning of LLMs. This alternative is called federated LLM fine-tuning in this animation, in a nutshell, conveys the core idea. What you see here, is an illustration that's quite representative. It's a medical example. There are three hospitals. Each of these hospitals have highly private data related to the medical records of patients things of the like. And what we want to do is embed that information inside an LLM. And what federated LLM fine-tuning allows us to do is not need to copy the data to a central location and then perform the fine-tuning, instead, you'll be able to let the data reside in these hospitals, perform the training in isolation on each of these different isolated parts of the data, and then only transmit what you've learned. Only transmit the updated model weights to a third server location where the information, not the data, is aggregated to update that model. So note, in this example, the data never has to leave the location, and that is a fundamental building block that we can use along with other types of methods, other types of techniques to provide privacy for that training data and other important characteristics that allow us to access and touch all kinds of private data. So as we will learn in subsequent lessons, this approach, this federated LLM fine-tuning approach, will allow us to leverage private data and overcome many of the other barriers that are also related, such as data volume regulations, limitations on the hardware, the data host, all of these different things. But what I want to highlight is this is a lot to take in. We're talking about LLM fine-tuning, we're talking about federated learning. There are a lot of topics here. And so what I want to recommend to you is that if you haven't done so already, go look at the other short course that's available that's focused on the fundamentals. It gives an introduction to the area of federated learning. If you take that course first, a lot of the information will be describing will make a lot more sense faster. But don't worry, this course is still self-contained. So if you want to go ahead, please do. but I want to highly recommend you start with the the introduction to the federated learning first. That brings us to the end of lesson one. Let's review some of the main messages, that we're going to use as background as we progress through this course. There are three main things I want you to think about. Number one, perhaps surprisingly, most of the world's data has not yet found its way into LLMs. So what's missing? A large, vast categories of data, such as private data, regulated data, distributed data in important key domains such as medical, finance, enterprise, and more. Number two, perhaps the most important reason that data of this kind is not readily included in LLMs is that LLMS are well known to easily leak training data. You must keep that in the forefront of your mind whenever you're building LLMs. And so this means that conventional ways of including private data, like centralized fine-tuning, cannot normally be used in this circumstance. That brings me to number three. The answer that you will learn to use in this course is Federated LLM Fine-Tuning. It is a way for you to leverage private data and protect the privacy of that information as you start to embed it inside LLMs, and it allows you to also address many of the other associated barriers from using data of this kind. What we want to do is by using more data, and by being able to carefully incorporate new types of private data sources, we can then make smarter LLMs that, in particular, are able to answer domain-specific questions in areas like medicine and improve society overall.