In this lesson, you will learn how to use Together AI's API to ensure that the content generated in AI games adheres to safety and compliance policies. We will begin by utilizing the default content moderation policies of Llama Guard 2-8B. Afterward, you will have the opportunity to create and implement your own custom policies. All right, let's go. So first let's talk about safety and security. Safety in general means you're ensuring a positive customer experience by preventing any exposure to PII, like personal identifiable information or any type of toxic or harmful content or inappropriate material that a user may not want to get exposed to. Security, on the other hand, is more around defending against any threats related to large language models or any attacks, such as prompt leakages or data poisoning. Prompt leakages are basically when things get exposed from a model that shouldn't be necessarily. And data poisoning is when data is poisoned with inappropriate material. And in this lesson, we're basically going to focus on the safety aspect. So why safety? Safety basically ensures that users feel safe while playing and exploring. This is especially important because any user is playing a game wants to feel control and comfort, and they want to know that they can define what level of maturity the game can be. This is especially important for GenAI because this content is created by the AI and not the developer or you, so you have a little bit less control on what is going to be generated. But with these guardrails, you can actually define some level of control that makes users and yourself as the creator of the game, feel safe. So let's talk a little bit about context dependance of safety. Safety can be not only user-dependent as we previously talked, where a user can have control over what level of maturity they want in their game, but it can also be context-dependent. So a good example of that is a chatbot for a doctor might need to only give general medical advice, but it shouldn't be allowed to prescribe any drugs. But on the other hand, a chatbot for a financial firm should not be really giving medical advice and an example that's more relevant to us, when you're playing a game, it is pretty normal to have bandits demand to give them all your money. And that might be perfectly appropriate. But if you're using a support chatbot and the support chatbot demands you to give them all of your money, that wouldn't necessarily be appropriate. So how do you make these dynamic guardrails? Before I answer that question, I want to walk you through two different types of safety user input safety and model safety. User input safety is the prompt that the user is basically providing. And we are creating guardrails around that prompt. And then model safety is once that prompt is passed to the LLM and the LLM provides a response. We are basically creating guardrails to what the LLM can respond. One of the ways in which you can use Llama Guard is out of the box. Meta, the creator of Llama Guard, has already defined policies that you can use, and you can basically use Llama Guard in two ways in the together API. One of them is as a standalone classifier. So you can use Llama Guard 2 8B as your main model. And it will also have these guardrails built in. The second way that you can use it in the Together API is by, using it as a filter to safeguard responses from any of the 100 plus models we have available. So you could use a Llama 3 8B model as your main model, but also use Llama Guard to do the filtering of bad responses. One of the great features about Llama Guard is that it lets you define custom categories to define what is safe depending on the context. For example, in AI Dungeon users are allowed to choose between the following categories safe, moderate, and mature. On the background these settings are basically mapped to custom Llama Guard policies. And this is part of what we're going to teach you in this lesson, which is how to implement the three policies related to the categories above into your game. Okay. So let's jump right into the code. First, we are going to import all necessary modules. And our API key. Additionally, we will now want to import the helper functions and retrieve the current game state that we created in the game. Additionally, we also want to import the get game state function which is the helper function that we created in the last notebook. And it's basically pulling the game state, including character descriptions and other relevant information. So I'm just going to print the character description. Well getting Elwyn, who is a 25 year old inventor from Eldrida's capital city, etc, etc... So now let's write code that ensures that the character description, which is the input, adheres with safety guidelines before we use it in the game. So we'll use the client dot chat dot completions dot create from Together API with the Llama Guard 2 8B model. And so we're using it as a standalone classifier and we are printing the prompt safety check. And this will tell us whether the character description is safe or not. In this case, it's safe. And as you can see and read from the character description, there's nothing in there that seems compromising. So it seems like it's working the way we're intending it to. Okay. So now I'm going to walk you through how to build your own custom policy. To be clear these are not AI Dungeons policies. We are creating policy specific for this course. As examples, one of the best features of Llama Guard is being able to specify your own policies. So let's develop a custom content policy for your game. We'll be defining rules that are specific to your games context, and ensure that all content adheres to those rules. Let's set a content policy that is safe for everyone. So to begin, we will write down the content policy. You can see that for the Safe Content policy, we are restricting violence and hate, sexual content, self-harm, profanity and regulated or controlled substances. So any input or output that relates to one of these categories will be blocked. We'll now add the text that you want to be checking, as well as the build the prompt with some of the embedded values. So this prompt is basically for saying check if there is unsafe content in user messages and conversations according to our safety policy with the below categories. And then we're embedding the safe content policy that we just defined right here. And then we're adding the text that you should be checking. And we're asking the AI to provide their safety assessment in the above conversation. And they should include whether it's safer or unsafe as well as a comma separated list of the violated categories. So I'll just print this prompt to show you how it looks. And the next step here would be to use the client chat completions create API with Meta Llama Guard 2 8B to check whether the text which was killed a troll adheres to our safe content policy. The response is that it's unsafe and that it's violating the content policy section one, which is violence and hate. So moving on. Let's establish a moderate content policy. So something that say it's designed to be safe for teenagers. And this will enforce safety guidelines moderately and allow a little bit more freedom in terms of what's restricted. So this is a content policy where it's now a little shorter. It's just restricting for violence and hate and sexual content and self-harm. We will add the text. So we want to check it will be the same one. "Kill the troll" and we will build the prompt again with the embedded values. This time I'm not going to print it because we already know what it looks like. And by running this prompt again through the chat completions create API. Using the Llama Guard 2 8B model, you'll see that oh, this is also unsafe. "Kill the troll" is clearly violence and hate. So it's marking it as unsafe. Now let's set a mature policy. This policy will be significantly less restrictive than the two policies we just went through. The safe and moderate. It is basically just restricting for sexual content. We will add the text that you want to check which will be the same one "Kill the troll". And we will build the prompt again with the embedded values. And now let's see what the model is going to tell us. So we're going to run our policy through a Llama Guard again. And confirm that it meets our safety standards. And boom, it tells us this is safe. Why? Because violence and hate was not part of the mature content policy. So now to bring it all together, let's move on to create some helper functions. These functions will basically determine whether a prompt is safe and will integrate these checks into the game where we've already built to ensure that all interactions in your game are secure and appropriate. So let's define a function called is_safe that takes in message is a parameter. And let's spring the prompt that we had generated previously that has the embedded content policy. We're going to use a safe content policy in this situation. We will then try to get the response in this same function using client dot completions dot create and calling in Meta-Llama Llama Guard 28B as the model, and the prompt is defined as the prompt above. And in this same function, we're also going to output the result. Which is response dot choices dot text. And what we're going to return we're basically going to extract the text and strip any extraneous whitespace. That's what this result dot strip is doing. And it's going to return true if the response is safe or false if the response is not. Finally, let's run this in our game using our previously established helper functions and test it out. So I'm just taking the helper functions that we created in the previous notebook of get game state and start game and run action. And we are using the main loop which takes in message and history. And basically if it's not safe, we're returning in the game invalid action. or if it's safe, we are actually returning the result. So we now have the game created from gradio and to see how it all comes together, I'm going to run our phrase "kill the troll." And as expected, because this is the safe policy, it's throwing invalid action. Whereas if I say: "Tell me what the character does next." It will give me a response because it's safe. The input that I'm providing. So it says you walk through the bustling streets of Luminaria, the soft glow of crystal formations illuminating your path. So overall, in this lesson, we've taught you how to integrate safe guardrails into your game applications.