In this lesson, you will learn how to automatically improve your prompts using Llama's Prompt Optimization tool. Let's go! Prompts have a huge impact on how LLMs behave. Even minor wording changes can completely change the model's output, but handcrafting prompts is time consuming and often inconsistent. That's where Meta's Open-Source Prompt Optimization Tool comes in. It helps automate the process of refining prompts, making your Llama applications more reliable and effective. The tool is available as a Python package, llama-prompt-ops. You feed the tool with your current prompt and a few sample tasks, and choose an evaluation metric and set other parameters in the YAML configuration file. It uses the Llama model itself to suggest improved versions, runs comparisons, and gives you a final optimized version. This is especially useful when migrating prompts from other models or tuning for edge cases. Let's see this in an example of optimizing prompt for customer message classification for a facility management company. Let's begin by loading our API keys. You will also need to pip install llama-prompt-ops. And since this is already installed on the platform, it is commented here. You can create a sample project. The sample project created by the tool is a facility management classification task which categorizes customer service messages by urgency level, customer sentiment and relevant service categories. All right, the project is now created. It has added a config.yaml file, prompt.txt, a dataset, and also a Readme file. The Readme and config YAML file are in my-project. The dataset is inside a data folder and prompt.txt is in prompts directory. Let's take a look at the system prompt in prompt.txt file. Here is the system prompt: You are a helpful assistant. Extract and return a JSON with the following keys and values. These are the JSON keys. So, for each message that is received its urgency, sentiment, and the category should be extracted and returned. This is the prompt that we want to optimize. Let's also see a few lines of the dataset. For each of the input messages from a customer, the answer is given that shows the urgency that is high here. Sentiment that is neutral for this message and category that is false for all the options except for a specialized cleaning service option that is true. Let's also take a look at the config YAML file. There are different parameters in this config file that affect the optimization process. This is where the initial system prompt is. This is the dataset. And also here is the task model, which by default is set to Llama 3.3, and the proposer model that is set to same Llama 3.3 model. The task model is the model that's using the system prompt in the prompt.txt file. We'll do the task on all the input customer messages, and will return the urgency of each message, sentiment, and the category. The proposed model is the one that will propose new system prompts such that the overall performance of the task model on the data set that is given will be enhanced. Let's modify this config file and set our task model to be Llama 4 Scout and the proposal model to be Llama 4 Maverick. Here, we modify the task model to Llama 4 Scout and proposal model to Llama 4 Maverick, and also note that the optimization strategy is set to Llama. You can change it to basic for quick results and advance for more extensive prompt modification, and running through more evals. Llama is our recommended strategy for optimizing prompts for Llama models. Also, the prompt optimization tool supports two built in metrics: exact match metric for simple text matching, and a standard JSON metric for a specific field evaluation. The metric used for the default example task is a custom metric designed for customer service message categorization that assesses the accuracy of prediction across three key dimensions: urgency level, sentiment, and service categories. Now that we have our sample project set up, we can run the prompt optimization process by using the migrate command. The optimization process involves several steps. First, to load the system prompt and dataset. Second, analyze the prompt, structure, and content. Third, apply the optimization strategy. Fourth, evaluate the optimized prompt against the original prompt. And last, save the optimize prompt to the results directory. Let's run this and wait for it to complete. This can take some time. We'll speed it up in post edits. Now that the optimization is completed, let's compare the optimized prompt and the original prompt. The optimized prompt is in the prompt field of the JSON file that was stored here. With these few lines of code, we get the optimized prompt. We can also read prompt.txt file inside the prompt folder and put it in original_prompt variable. To have a better comparison between the original prompt and the optimized prompt, we use this block of code to show the original prompt and the optimized prompt side by side. And here's the result for the original prompt and the optimized prompt. Please note that besides the optimized prompt, the optimizer will generate few-shot examples. The combination of the optimized prompt and the generated few-shot examples will give you better evaluation results. The few-shot examples will be pairs of questions and answers. Let's take a look at the question for the first few-shot example. And here is the question. Let's see the answer for this question. And here's the answer. With urgency, low and sentiment, neutral. And the category identified as facility management issues. Training and support requests. and sustainability and environmental practices. Let's see how many few-shot examples are generated. We have five few-shot examples. To use these five examples along with our optimized prompt, let's put all these examples in a single string. Let's now use our eval function and compare the optimized and original prompts. Let's first load our dataset that was in dataset dot JSON file. We have 200 pairs of questions and answers in our initial dataset. Prompt optimization tool by default uses the first 50% of the dataset for training. And the next 20% for validation. So the test dataset is the last 30% of the dataset. Let's get that last 30% and store it in ds_test. This will be the 60 pairs of questions and answers. Let's first evaluate our original prompt on the test data. The same evaluate function that was used during prompt optimization is given in utils file. Let's import it. Now using this evaluate function, we can run a loop on our test data. Create messages using the original prompt. Pass it to the TogetherAI client. Get the response. And pass the answer from the dataset and the prediction. To the evaluate function. And append it to the results _original. Let's run this. This can take some time to complete. We'll speed it up in the post edits. All right. This is now completed. Let's repeat this process, this time for the optimized prompt and few-shot examples we got by doing prompt optimization. We will pass the message to TogetherAI client. We'll get the prediction. We'll pass that prediction and the answer field from the dataset to the evaluate function. Append it to result underscore optimized. This can also take some time to finish. All right. This is not completed for the optimized prompt plus few-shot examples. Let's take a look at the first element of result _optimized. As you see for the first data in our dataset, the urgency and sentiment are correctly identified. And the score for the correct categories is 0.9, meaning that out of the ten possible categories, nine of them are correctly identified to be false or true. With one mismatch with the categories of the answer in the first data in the dataset. The total is 0.97, which is the average of 0.91 and 1 and 1. Let's take a look at the second data in our dataset. For this one, the urgency is true. The sentiment is false. And 8 out of 10 potential categories are correctly identified to be true or false. And the total is 0.6, which is the average of 0.8 0 and 1. Let's now compare the performance of the original prompt and the optimized prompt, plus a few-shot examples on all of the 60 pairs of questions and answers in our test dataset. First, let's get all the keys that are either integer, float, or boolean. Which for our result case will be basically the list of all the keys. We can then run a loop on all result original for all the keys, in float underscore keys. And here are the average scores for the original prompt for categories, sentiment and urgency. And the total, which is the average of the previous 3. Let's see the same result for result underscore optimized. And here's the average eval using our optimized prompt and the few-shot examples generated by the prompt optimization tool. As you see, the score on correct sentiment has gone from 0.5 to 2.6. And the correct urgency has increased from 0.86 to 0.91 There is also a small increase in the correct categories. And the total average score has increased from .76 to 0.84. In this lesson, you used Llama's prompt optimization tool to optimize the system prompt for a sentiment analysis and categorization use case, and compared the optimized prompt plus the few-shot examples you got from Prompt Optimizer with the original prompt. In the next lesson, you are going to use synthetic data kit to digest, create, curate and save synthetic data. See you there.