In this lesson, you will load tabular data from the CSV file and perform natural language queries via Azure OpenAI to quickly extract information. By the end of this lesson, you will be able to reuse this agent to analyze your own CSV files. All right, let's start coding. After completing lesson one, you can see that we are taking a step by step approach by following these instructions with a small snippet of code that you will implement one by one. Now, in this second lesson, we will expand the scope of our activities with the final goal of creating your database agent. We cannot implement everything all at once. Previously, you connected to Azure OpenAI and implemented a basic LangChain agent. Now, we will add a new component. Since we are building a database agent, we need to think about how to connect to data. In Python, for example, you can load the data set from the CSV file and then interact with that data by asking questions. This is similar to a database agent, but we haven't connected to an actual database yet. In this lesson, we'll interact with CSV data sets that you can find online or use your own CSV file. Now, before we dive into the details, let's go over some basic concepts. I want you to understand the considerations and options available when building a database agent. Imagine you are working at your company and they ask, "how can we build a database agent to connect to our database?" Many companies want to explore their databases with it using SQL because not everyone is familiar with it. As a software developer engineer, you will need to decide on a good approach. We have our baseline model, GPT-4, and its powerful meta-prompt capabilities, which work very well linguistically. To handle SQL tasks, you could customize the model using fine tuning a method some companies have experimented with. Fine tuning a model for a specific SQL task can create valuable intellectual property, but it might be too complex for the purpose of this course. Instead, we will use the retrieval augmented generation pattern using the database or dataset as a source. We will focus on using a CSV file as our data source. We use a LangChain agent to connect to the CSV file. This method is straightforward and will be our starting point. The next lesson will apply the same approach to a SQL database. If your SQL database has an API interface, you can use RAG to connect to it. We won't cover this here, but keep in mind for future projects. Another option is Azure OpenAI's function calling, which allows you to create the specific functions to perform tasks based on your descriptions. This is very useful for a SQL task unless you execute the SQL queries in the backend in without exposing them directly to your code. Additionally, there is the assistance API, a feature in Azure OpenAI, and OpenAI APIs. It adds a stateful management providing short term memory and context to the discussion. We will use function calling and the code interpreter feature, which I will explain to you later. In this lesson, we'll focus on using the CSV file. But in this course we will explore all these options step by step. Let's move on to the notebook. Let's now proceed as we did in lesson one. For each lesson we will prepare the environment. Remember the Jupyter notebook on the DeepLearning.AI platform has all the needed set up ready for you. We'll repeat some of what we did in the previous lesson, such as importing these libraries, setting up the Azure OpenAI for LangChain, and defining the endpoints and versions. Now, for the interesting part. Interacting with a CSV data set. And remember, you can choose your own dataset. For example, you can download a CSV file from the open data portal and place it in your project folder. Or you can connect remotely to a data set. You can find some instructions in the notebook how to remotely access the data we are using here. For simplicity, the data is available in your lesson directory under data as all the states history CSV. This dataset contains Covid 19 statistics for various states in the United States of America for 2020 and 2021. Let's proceed with this setup and explore the data. Basically, we are importing or we are getting like all the information related to LangChain created to pandas dataframe. So you can actually connect not only to the actual OpenAI that we have done before, but also to the DataFrame. Now the important thing is that we are creating what we call an agent in LangChain and not any agent. We are creating a pandas dataframe agent. Basically we are saying we will be using the LLM. The model that we have created before. Remember we have the rows of information. You have the dataframe that contains already the representation of that CSV file. And then how do we perform any action on this? We put the prompt. So you see the invoke function is the same we have used before. Pretty simple. But here we will put how many rows are there. So let's keep in mind the bigger picture. You are interacting with a CSV file in context of data science, AI or data analysis. Usually you start with the exploratory data analysis. Right? You asked questions to the data was the the maximum the minimum was the average? How many rows, how many empty values does it kind of things that we perform with tools like, pandas profiling. Right? Here, I want you just to ask questions to the data. Any questions that you would like to ask to the dataset to understand the statistical dimensions, but also the nature of the data. So let us start with this one, just to see how this works. So I have executed this part for the dataframe. So, look at the traces is very interesting. Just to see this interaction like entering new agent executor chain is doing an observation, checking the information, having a thought I know now the final answer, and the final answer is there are 20,780 rows. So finished chain is the trace you need to to check for any interaction with agent. And here, What we have is the input. Remember the question that we as before. And the output, they are x number of rows. Simple as that. So this is the initial test. We have created something very basic is just asking for the rows the number of rows. But obviously when we are interacting with these models we do something a bit more let us say complex. So let me give you an example. I'll just get this. I will show you. Let's create something that we call the prefix and the suffix. So basically in the prefix we are say like okay first set of pandas displays, please show all the columns. So you get the column names. Yeah. So the question mark and then on the suffix we are explaining the logic or how we interact with the model, we are saying okay. Before giving the final answer, try another method reflecting on the answers. You can customize. The magic of this is that you could prepare different kind of, prefix suffix depending on how you want the model to interact. And depending on the kind of application there are different that contain different text. This is just text, is pure linguistics. You just want the model to understand things. If I do here okay have a execute this. We have already loaded this. So let me put another question. Now remember we have the prefix, the suffix and question. And I'll do something more complex like how many patients where hospitalized during July 2020 in Texas, nationwide, as the total of all the states. So we can execute that and we have three pieces of text the prefix, the suffix, and the question. Now, before we have set an instruction, they will be like it was just this instruction. But now we are building something more complex. But following the same procedure we are using invoke and we are putting the three together. So the model will get the instruction that the message, then the confirmation of how we want it to behave. Right. Do you need some time? You need some time. Just pause, check the detail. You can add whatever you feel to the suffix. This is just a list of examples. So let me run it. We don't forget to run it. And again we are having the same kind of performance. We are entering the agent executer chain. That's cool. So the agent is thinking, "hey, I need to filter the data frame for the month of July 2020 and then some the old information for hospitalizing increase." So, look, I have, go to import pandas. I need to import pandas before I can continue. The agent is doing all these kind of things. So basically, even you see a lot information is saying, okay, we're looking for the date because remember is July 2020. So for the data we have we are checking the date has to be in July 2020. We need to find the information related to Texas. You get all the information and you start to see like some details and you need to take some time, okay, to analyze information. But the important thing here is that the the is that you see that they are not a hospitalization reports in Texas due in July 2020. According to this information and on the traces we can see that is zero for Texas. But we had a total for the rest of the states is 63 K. Why? Well, for an exploratory data analysis point of view, you could ask a question were we tracking the results of that part of the year in Texas, for example, or any other state. And maybe that's the answer. That's why we don't have any record. But that I can tell you from the file, this is the correct information. So basically if we go to the final part that is the interesting one, because it contains all the information, we are confirming that there were no hospitalizations tracked on the CSV file. Obviously, there were some hospitalizations I assume, but during July 2020 this seems unusual, But I have to go with the data given. So look that the agent is already thinking or analyzing a little bit. The context are giving you some information and you saying okay for Texas is zero, but then the total 63,000, etc... etc. And then you have the explanation, because remember that on the top we have asked for some reasoning. We have asked for some explanations. So here we are getting explanations saying, okay, I arrived to this answer by converting the data. So is explaining the logic of the query we have performed. So that's wonderful. Again, take your time to explore this trace because has a lot of information, but it's pretty good. And now you see the json answer. Basically you have all the same information. You have everything here and you can use it. So again interoperability pretty cool. Like you get all the information you need to based on the instructions you gave. You can try one more time with different prefix, suffix ,questions. Obviously you can perform the questions I suggest you change and you put New York or California or Alabama and you start to see different result depending on the state, and you could change also the date. And that's pretty cool. So that's pretty much what we want to do, and then we can move to lesson three. See you in a bit.