In this lesson, you will use Llama Synthetic Data Kit to create high-quality datasets for training and fine-tuning. You will learn how to ingest, create, curate, and save data in just a few steps. Let's create some data. When building LLM applications, we often run into data problems. or just not enough. Synthetic data offers a flexible alternative. It lets us generate exactly the examples we want and need. Synthetic data is very important for model distillation, fine-tuning, or even just stress testing your model on specific edge cases. The Llama Synthetic Data Kit is a command line tool that makes it easy to generate high-quality training data using the Llama models themselves. Using the tool, you can generate different data sets from Q&A pairs to chain of thought and summarization data. Let's learn how this works by ingesting a PDF file, creating data from it, curating it by removing low-quality data, and finally saving it in different formats. We will begin by importing our API keys. To use the synthetic data kit, you need to set an environment variable named API_ENDPOINT_KEY, or another Llama Cloud Provider's API key. You also need to pip install synthetic data kit, and since this is already installed on the platform, it is commented here. Extracting relevant data from real-world documents such as PDFs, web pages, and videos is complex and error-prone. Synthetic data Kit helps you to ingest different files, including PDF files and web pages, and extract plaintext data from them. Here we have this paper with 27 pages. And by just this single line and running it, synthetic Data Kit will extract the text and will save it. Let's take a look at the last ten lines of the first 50 lines of the extracted text. And here is the text. Synthetic data Kit can extract text from web pages too. Here we have a web page on Meta's website. Let's use synthetic data Kit to extract text from this web page. The extraction is done and the result is saved here. Let's take a look at a few lines of the extracted text. And here is the result. Let's now create a question and answer dataset based on the paper that we ingested and extracted its text. This will be as easy as this single line. Using create command, passing the address of the extracted text and for the type of the dataset we want to create we'll choose Q&A. There are other dataset types that you can choose, including chain of thought and summarization. By changing this QA to COT you can create Chain of thought dataset. Let's run this. The process continues until all data is generated. And finally, you'll see where your data is saved to. Let's take a look at this JSON file. Here's the result. A summary of the document is given and Q&A pairs of questions and answers are generated. Here is another pair and one more pair here. In certain use cases, you will need to filter your data and remove low-quality data from your dataset. You can do this using curate command from the Synthetic Data Kit. You can pass the created data and also the threshold of quality. That's a number between 1 and 10. The higher the threshold, the higher will be the quality of the data, but the rate of retention will be lower. Let's try threshold of eight on our curated data. Now the cleaning process is being done. And as you see, the 48 pairs of the created data were evaluated and rated, and 45 of them that were above or equal to threshold eight was retained. The clean data is saved here. Let's take a look. For each of the curated pairs now a rating is calculated. And only pairs with a rating higher than or equal to eight are kept. You also have the conversation with system, user, and assistant messages for each pair of questions. And in the end, you have metrics including the total number of questions. The number that passed the given rating threshold, retention rate, and the average score of the retained questions. The save as command converts the curated data set to different file formats. It supports four popular output formats: JSON, alpaca,FT and chat ML. And two storage formats: JSON and HF for HuggingFace. Let's take our clean Q&A pairs from previous step and save them in JSON format using the JSON storage format. Let's run this. The file with JSON format and with the JSON storage is saved here. Let's see the first ten questions in this file. And here's the content of the JSON file. By changing the format to other options, including FT. You can save your data in different formats. After running the FT format of the data is saved in this JSON file. Let's take a look at a few lines from this file. And here's the FT format of our Q&A dataset. I encourage you to change the format to Alpaca and chat ML and compare the results. There is a default config file for the synthetic data kit. Let's take a quick look into this file. This configuration file for synthetic data has all the configurations you need for synthetic data generation. For example, when the data was generated using the create command, it was automatically stored in data/ generated path. You can control the paths throughout all the stages from parsing to the final saving. There are many more settings here. I would encourage you to review this config file and change any of the settings to enhance your synthetic data creation process. For example, these are the parameters that control the creation step and these are the parameters that control the curation process. In this lesson, you learned how to generate synthetic data using synthetic data. I encourage you to try other data types like chain of thought and try saving into different formats.