Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
In this lesson, we'll go through what a knowledge graph means and how they help represent and retrieve relationships from your data. You'll then explore the dataset you'll work on to build a knowledge graph. Let's dive in. So, what is a knowledge graph? We're going to take a step back for a moment and reflect on relational databases and what a relational schema looks like. We all probably are familiar with some scenario like this, where there'll be some table on the left and some table on the right, and this table in the middle that is the join table that allows for multiple connections between the persons and the products in this particular example. So let's think about these two tables with the join table. If you ask a question like, what products has this person purchased? You'd first start with the person table and select this person here, it's identified by abk. And of course, you do a join over to the Person_Product table. and you'd see that the Person_Product table has a join between the abk and some chair, abk and a lamp, abk and a desk. Fantastic. One more join across from the Person to the Person_Product to the Product table. And now you've got a complete answer where from abk, you can see that here's the products that abk has purchased, this Chair, Lamp and a Desk, and you can get the details about those purchases. What if you extended that question? And you asked who else bought products that ABK bought? Well, you start with the same joins going from ABK to the products, and then you join back to the Person_Product table, and from there, back to the Person table and now you would get this person ee who's bought some products that abk also bought. But notice that ee is also bought one extra product. Hmm. That's interesting. So, we could probably extend this one more step and ask, well, what product should we recommend that abk buy next? Now this is kind of the basics of doing a recommendation query. If you've got some purchase pattern from abk that matches a purchase pattern from some other person or rather a group of people. The thing you want to recommend to abk is what that other group of people bought that abk has not. So the same joins apply from abk through to the products, back to the other people that have bought those products. There's too many joins. It's getting a little confusing. There is actually a nicer way to actually think about this and to query this. Let's turn this into a graph. The first step is to just drop all the records that don't matter. All those gray records that weren't part of the original query, let's put those aside for the moment and focus on the records that we know about, that are part of the data set so far that matters for this query about recommending something to abk. Get rid of all the join table records in the middle, turn those into arrows, connecting abk directly over to the products that he's purchased. And the same thing for ee over to the products that they've purchased. And now we'll just rearrange the records a little bit. So you've got abk on one side, ee on the other side. And now it looks much nicer, it's much easier to understand what's going on. Both abk and ee are connected to these products in the middle. but then also this person ee is connected to another product that has no connection to abk. You can start to turn this into a query by rephrasing it as a pattern match. Starting with abk on the left side, you can describe a query in a query language called Cypher where it looks a little bit like SQL, if SQL had pattern matching abilities. So here we're going to describe a pattern that describes the data records that you want to find. We're going to match from a data record that we're going to call abk that has a label of person and that has a property with the name key where the value of it is abk. So that's in round parentheses, we know that that's a node in graph terms. And then we kind of have this right facing arrow with colon purchased in there. That's a relationship type that is the purchased. So match a pattern where abk purchased over to some other parentheses, we're going to call abkProducts. That's going to be a pattern match from abk to the products that abk matched. We can return those directly and that's going to be the products in the middle there. We can extend that pattern to include the other people who bought those products. Since we went from abk to the products that abk purchased, we can also have relationships going in to those products, and say, okay, there's some other people who bought those products. This pattern will find all the products that people have in common with abk. We can extend it a little bit more to ask that recommendation question problem. where instead of saying we just want to find these other products, the other products have a unique property where they're actually something that abk does not have. So we have the same pattern match from ABK to some products, to some other people, and then extend that to other products. And now we include a predicate, saying we actually want a negative pattern to be true, that abk has not purchased those other products. So, where not abk purchased those other products. So down there on the bottom that dotted line from abk, to the couch, we want that to not exist. Now we're describing a pattern that both has records that should exist, something that should not exist. The other products now will be reduced to only products that abk is not purchased. That's what we're going to use as a recommendation that he buys. So, what really is a knowledge graph? It's a kind of database that represents information as nodes representing things like people, products, blogs, whatever you might want to have as records in a database. But then also relationships that aren't simply convention of using a join table, but that are first-class citizen, they're actually data records inside of the database that have a semantic meaning that they actually represent how two nodes are connected. And they add information about those two nodes. So both the nodes and relationships have key value properties, they have multiple labels on nodes. Relationships always have a direction, and they have a single type. And the query language that you use for doing pattern matching is called Cypher. And it turns out that this is super nice for mapping to natural language. and really convenient for working with GenAI and LLMs because LLMs are really good at natural language. If we go back for a moment here, and just look at that query at the top, you can read that out loud and make sense of it. Match abk who's a person whose name is abk that bought some products that some other people have bought, they've bought some other products where abk has not purchased those other products. The other thing that's interesting about knowledge graphs is that they're really convenient for combining unstructured data along with that structured data, whether that was some products that we have or people, whatever it might be in structured data, you can then add any kind of chunked text, do a vector representation of that, store that in the database as well, and combine both vector similarity search and pattern matching together for all kinds of powerful access patterns. So for example, if you want to do root-cause analysis and you imagine that you're a furniture manufacturer looking to understand customer complaints about your products. So you turn to your team of experts, you know, data engineering, analysis team and say, hey, could you do some root-cause analysis? So I've heard a lot of complaints out on the internet about our products. Let's try to figure out what's going on. So the root-cause analysis would include questions like, which products have the most issues? And then based on those products, well, what part of the product is the actual problem? And then you might want to ask a question like, is there a problem with the part itself? Perhaps there are other things that are going on. People just don't like the product. But if you can take corrective measures, it would be because there's something either in the manufacturing process, which you're looking for here is for corrective measures that you can take. do root-cause analysis for something that you can improve. So if it's a design problem, you could turn that over to the design team. They could take a look at what people's complaints are about. But if it's a manufacturing problem or a problem with a part somewhere, you can do some analysis on that, find the part problem and either swap out the part for a new one or figure out how to improve the process itself. So as a data engineer, if you're handed this task, you would start with the usual thing, ask some clarifying questions, Why are we really doing this. You want to be able to understand that there's possible manufacturing problems in the supply chain. Cool. Okay. Second question is or if that's what we want to find out, what data is available that we can do this analysis with? So it turns out in this scenario, we've got a bill of materials, some CSV files that connect the product all the way to the suppliers from those products. Maybe they came from some spreadsheets, maybe it came from some relational database that isn't really suited for doing analysis. That's okay too. And you also have some data from user reviews that you scraped from some websites on the internet. You want to be able to combine those into one dataset that you can then do some analysis on. So, to do that analysis, you're going to build a knowledge graph that connects all the data. So here's what the target data model is going to look like. On the left, in the sort of beige boxes over there, you see the CSV files that are available. Each one of those is going to turn into nodes in the graph or relationships in the graph. That's going to be the kind of what we're going to call the domain graph. On the right hand side, on the bottom there you'll also see some beige boxes that represents markdown files. Those markdown files are going to be chunked up, they're going to create some documents inside of the graph as well. And then all of this is going to get connected. We'll look at some of the details as we actually go through doing the the agent descriptions themselves. But what I want to call out here for you right now, is that these are really very connected, but also very distinct parts of the graph. We've got the data sources, the structured data on the one side there. That's going to turn into what we call the domain graph. So that's the structured data that we know that we can query with pattern matching. On the other side, the unstructured data sources are not going to be queried in the same way, and they're not going to be imported in the same way either. So, they're going to turn into what we call the Lexical Graph, which represents the original textual data, plus also the structure around that textual data. To connect those two pieces, the unstructured data and the structured data, we've got this part in the middle that we're going to call the Subject Graph. The subject graph is going to be subjects where entities that we pull out of the chunks, so the chunks will have some text that will be talking about products, maybe they'll be user names and things. All of those things might be entities. We're going to find those, call those things subjects. And those subjects are going to be connected to objects, so the subjects talk about something. So, some user, let's say that it's me, user abk, loves this table. And so there'll be a subject predicate object in the subject graph that talks about abk loving some table. That also, because abk loves the table, the table can correspond with the product that we have in the structured data. If abk mentions is a table and the table is a product that exists in the structured data, we can then connect the entities that we've extracted from the text all the way over to the structured data that we've got in from the CSV files. Don't worry. We're going to do this step-by-step. On a high level, this is what you're going to end up with. A graph with sub graphs within it, a domain graph, a subject graph, and a lexical graph.