In this lesson, you'll be taking a look at the basics of Pydantic data models, which is to say how you can create them and how they work. You'll be doing this in the context of validating user input. So we're going to run with that same scenario we've been talking about already, the customer support system, where the first thing that happens is a user fills out a form that includes their name, their email, and the text of their request. And then the first step in your system is to validate that the user input matches your expectations. For example, that the email address is formatted correctly. So, let's take a look at how that works. All right, so to get started, you're going to do some imports from pydantic. you're importing BaseModel, which is going to be your starting point for really any pydantic data model. BaseModel has all kinds of built-in functionality for data validation and then you can build on top of it to customize your data models, and then you'll see how we're going to use ValidationError to catch errors and EmailStr is a data type in your models. And then you're importing json to do some JSON parsing later on. Next, you're going to define a Pydantic data model. In this case, a user input model. So this is a class called UserInput that inherits from BaseModel. And then you have three fields, name, email and query, where name and query are set to be strings, and email is this EmailStr type from Pydantic. And with that, you can create an instance of your UserInput model. like this, where you say user_input equals your class UserInput where name is set to Joe User, email is set to
[email protected] and query is I forgot my password. And then you can print that out. So, here, you have something that doesn't look very exciting, but the cool thing that's happened behind the scenes here, when you create an instance of your Pydantic data model like this is that Pydantic has validated that the data you put in matches the model expectations. In this case that the name is a string, email matches this EmailStr format, and query is also a string. you're working with valid data. Now, you can try creating another instance of your model where you have email set to not-an-email, so invalid data in this case. What happens when you run that? You get a ValidationError. So if you scroll all the way down to the bottom of this mess, you see ValidationError, one validation error for UserInput, email, value is not a valid email address. An email address must have an at sign. Okay, so apparently Pydantic is looking for an at sign here in this email string. format. So you can try putting in an at sign and see what happens. Let's run that. You get another validation error. Okay. value for email is still not a valid email address. The part after the @-sign is not valid, it should have a period. Okay. So, here's another clue as to what's going on inside the email string format. So let's add a period somewhere here after the at sign. and run that. Okay, yet another validation error. And now it says an email address cannot end with a period. Okay. But we're making progress here. Let's try putting something after the period. and running that. Okay, that worked. So this little experiment gives you a sense of in this case, what pydantic is doing behind the scenes to check that your email field matches the expectations of that email string format. So now you can see that it's looking for an at sign somewhere in there. It's looking for a period somewhere after the at sign and then something that comes after the period. So this is not saying that now you have a valid email address that's actually going to go somewhere. It's just saying, okay, this matches the basic expectations for an email string format. And so email string is a built-in data type in pydantic, but you can define your own string patterns or other specific data types for whatever specific field you want in your model. And so rather than running head first into validation errors like we were just doing with email string, the next thing you're going to do is create a function. Here you have a function called validate_user_input where you're going to pass in input_data. In this case, this is going to be a Python dictionary. And then you're going to try parsing that input data into your UserInput model. And so this is just unpacking that Python dictionary into your UserInput model. If that works, you'll say valid user input created and dump out the JSON representation of the contents of your model. And if there's a problem, you'll say Validation error occurred and print out the contents of the error. And so this is more like what you might be doing inside of a software application where you're trying to create an instance of your model and if there's an issue with that, you want to capture that error and decide what to do next. So let's define that. And then you can try using that function to create an instance of your data model. So here you have input data defined as a Python dictionary. This is just the same input that we started with up top. And then you're using the validate user input model to print out the contents of the model, to validate the input data that you've got here. when you run that, it says valid user input created and prints out the contents of the model. Great. So then you can start experimenting with some other versions of user input. In this case, you have input data that contains name, and email, but not a query field. So what happens when you run that? Well, you get a validation error that's occurred. query field is required. So it turns out that when you define a pydantic data model as you did above with these three fields, each of those fields are going to be required in your model. And it turns out you can include optional fields as well. So that's what we're going to do next here. So here you're going to do a few more imports. from pydantic you're importing Field. from typing you're importing Optional and then from datetime you're importing date and these are what you're going to use to customize your model a bit further. Now you're going to define a new version of UserInput that has the same three fields at the top here. But now a new field called order_id that's defined to be an optional field. It's going to be an integer. And then with this Field object, you can add some more customization and definition to in this case, the order_id field. So you're saying the default value is going to be None. This is a description of what this field is all about. It's going to be a 5-digit order number. cannot start with 0, and then these are a couple of rules that you're setting in order to put constraints around what the input can be. So it's saying this has to be an integer greater than 10,000, less than or equal to 99,999. So that enforces this rule of it being a five-digit order number that cannot start with zero. And then you have purchase_date, which is another optional field, and it's going to be this date object from datetime with a default of None. So let's define that. And then you can try out this new UserInput model with some new input. So here you have the exact same input. that you started with up top and you're running it through validate user input data again with that. Now with your new model of user input. And you get no problem at all. The optional fields were omitted in the input data. And so you have valid user input created with name, email and query populated and order ID and purchase date as null. So, one thing that might look confusing here is that these are printed out as null here when up here you've defined the default to be None. It turns out that this is just the JSON representation of the contents of your model. So, if you go back up here and have a look at validate_user_input. So you see that what you're doing here is using this built-in model_dump_json method of your pydantic data model, and that prints out the JSON representation of your model. If you go back down here and print out the actual contents of the model itself. So user_input, printing that out, you see that in fact order_id is None and purchase_date is None. So this is the Python representation and this is the JSON representation of the same data. your model. And so now you can play around with this new user input model a bit. Let's try user input data that includes all five fields. And in this case, valid instances of order ID and purchase date. And run that. Okay, great. So now you have valid user input that contains all five fields. And it's also interesting to take a look at what happens when you have some additional fields. So now you have input data that contains all five fields you expect and a couple of fields you don't expect. system message and iteration. What happens when you run that? Well, no problem. Valid user input created, and Pydantic just ignored these extra fields. So, this is a feature of Pydantic and it's actually a really common way in which people use Pydantic. You might have data in your system that's coming to you in the form of a Python dictionary or JSON data that has a zillion different fields, some of which you care about and some of which you don't, and you can define a Pydantic data model that grabs the fields you care about and validates those and just ignores the fields that you don't care about. So in this case, you have valid user input where you just ignored system message and iteration. One thing that you might have noticed here is that purchase_date is getting printed out as a string representation of the date. Here again, purchase_date is the string representation of the date and this looks different from the date time object that you're putting in. So, what's going on there? Well, it turns out this is just another example of the JSON representation versus the Python representation. So, if in this case, you go ahead and print out what user_input model contains, then you have purchase date is a datetime object just like you put in. And so then it can be interesting to look at what happens when you put in input where the purchase date is that string representation instead. Turns out, no problem. You can put in a date that looks like this or a datetime object and what's happening behind the scenes here is that Pydantic is converting that string into a datetime object. And this is an example of what's called data type coercion. So, pydantic in this case is handling multiple different formats for date, and this is a feature of pydantic that allows you to be a little bit more flexible with what the inputs can look like. So pydantic has a number of different data types where there's automatic coercion of, in this case, dates in different formats, and you can try this for other data types as well. For example, if you put in your order_id as a string and run that, you see also, no problem at all. pydantic can handle a string representation of an integer and automatically convert that for you. So data type coercion is something that happens automatically for certain data types and certain formats, and it's something that you can customize yourself if you want to have a specific set of formats be acceptable for a field in your model. With pydantic you also have the option of turning off data type coercion if you want to be very strict about the format that you're accepting in a particular field. So in this case, you've seen that you can put in an integer as a string, but what about putting in a string as an integer. So in the name field here, this is defined to be a string field and if you try with 99999, then you'll find that you get a validation error and that's name, the input should be a valid string. So in this case integers can be coerced from strings, but strings will not be coerced from integers. And next, you can take a step in the direction of where we're going with validating LLM output and start with JSON data. So in this case you're defining a JSON string that contains means the fields for your model and then you can parse that into a Python dictionary and print out what you get. So in this case, you've parsed the JSON into a Python dictionary, and then you can go ahead and run your validate_user_input function on that Python dictionary and see that you have valid user input. So when we move on to looking at LLM responses, you'll be getting data as a string from the LLM and you'll be using that to populate your pydantic data model. And so you can play around with different JSON input data. up. For example, in this case, data that contains order ID that starts with zero. So if you recall the rules for order ID that it must not start with zero. You can go ahead and parse that into a Python dictionary and see that there's no problem with that. Of course, this is valid JSON data and so there's no problem loading it into a Python dictionary, but then when you try to populate an instance of your user input model, you run into a validation error that order_id should be greater than or equal to 10,000. So it's sort of a two-step process when you're starting with a string representation of JSON. data, first you parse that JSON, and then you populate an instance of your model. But the way you're going to do that in practice is to use this model_validate_json built-in method from pydantic. So here you're taking your UserInput model and applying the model_validate_json method directly on that json_data. And of course, when you run that, you get a validation error because you have a problem with the order_id field. But what's happening behind the scenes here is you're doing this two-step process of first parsing the JSON and then populating your data model with the contents of the JSON. So you can go up and fix the issue. In this case, I could just turn this to a one, redefine that JSON string and run this last cell again to validate the data that's in that JSON string. You can also play around with what happens in this case if the JSON itself is not valid. So for example, if you're missing one of these brackets, and I'm just going to comment out these other lines for the moment. and run that to define new JSON data that has a problem. Then, when you run model_validate_json down here, you get a different type of validation error. In this case, a JSON end of file problem, and it tries to describe what happened with the problem loading the JSON. So, you'll get different types of validation errors depending on whether you have issues with the JSON input or issues with the data that's contained in the JSON. And that's where we're going next. Validating the responses you get back from an LLM using this model. model_validate_json method from pydantic. And there you have it. Those are the basics of Pydantic data models. You saw how you can create a data model by defining a class that inherits from BaseModel, and then defining the fields and the data types for that class. And how you can populate an instance of your data model using either a Python dictionary or JSON data as input. In the next lesson, you're going to take these skills and validate the output you're getting back in the response from an LLM. I'll see you there.