Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
In this lesson, you will continue the knowledge graph construction. The previous lesson created the domain graph from CSV files according to the construction plan. Now you will process the markdown files, chunking them up into the lexical graph and extracting entities into the subject graph. You will learn how to use the neo4j_graphrag library to perform the chunking and entity extraction. You'll also learn about techniques for doing entity resolution. Ultimately, this will not be an agentic part of the workflow. This is going to be handled entirely by tools. So we're going to define some tools and also some helper functions that those tools need to actually do their work. The two tools we need are one, actually constructing a knowledge graph builder. We're going to have one builder per file and have it process the files to do the chunking and extraction. The other function that you'll define is correlate subjects and domain nodes. These two tools together will do the graph construction out of the markdown files and then take the resulting extracted entities and correlate those between the subject graph and the existing domain graph that was previously created from the CSV files. That might be a lot to take in right now. It'll make sense as we go through each step. As with the other notebooks, you'll begin by importing the needed libraries. So let's go through doing the common setup. all the libraries that you need. Wait for those to import. Go ahead and make sure that OpenAI is ready to go. And also check that Neo4J is ready. That all looks good. Now as part of the setup here, we need more than just the previous state that we had in the previous lessons that were created because now we're not just doing the workflow. The previous lesson actually created part of the graph in Neo4J. So because Neo4J will be fresh when we load this notebook, we need to add actually some of the product nodes that were created in the previous lesson. And we've got a helper function for that. So this helper function load_product_nodes, we can just go ahead and load it and then call it. And what we're going to expect is that the only nodes that exist inside of the graph have the label Product. Okay, that worked correctly. That's super. Now you do need some of the initial state for this agent, and the initial state should be part of what was created from the previous parts of the workflow. And it's the same kinds of things we've loaded before. We need the construction plan and the approved files to actually work with. And also because we've done some of the planning for the entity extraction, we need the approved entities and the approved fact types. So we're going to create each of those initial states. Here's the approved construction plan. Here are the approved files, the approved entities. and the approved fact types. Okay. You can now start to define all of the different functions that we need for actually creating a pipeline that's going to do all the processing of the markdown files, doing the chunking up and then extracting the entities. Now, the neo4j graphrag library has a convenient SimpleKGPipeline, which you can use to do all the processing of the chunks and the entity extraction. So for all the markdown files you'll be processing, you'll create some helper functions to actually get that set up correctly. But first, let's take a look at the interface for the SimpleKGPipeline. Now, this is just a little sample code that, of course, won't run because it doesn't have valid values here. but just to show the shape of actually what it means to actually create an instance of the SimpleKGPipeline. It needs to have an LLM, of course, for actually doing some of the entity extraction. It needs a driver from Neo4J to actually write the graph out to. You're going to need an embedder interface that could be the same LLM or a different one if you want to. And because we're processing markdown files instead of PDFs, this from_pdf is false, but it's actually kind of true because we're going to use a custom PDF loader and instead of loading PDFs, we're going to load markdown files. So this is just a little quirk but the way the interface works. But we're going to end up being saying true here. We will define a custom loader that we're going to add into this part here for the pdf_loader. We're going to have a custom splitter as well because we have some notion of what the markdown looks like. And if you've done any kind of chunking before, you know that just doing text splitting itself is an art. There's probably multiple courses on deep learning here that you can take a look at to actually really dig into that topic. We're going to do a simplified version making some assumptions about the data that we have. You can also pass in a schema, and the schema is of course going to be based on what the previous lesson had done for setting up a proposed schema for what kinds of facts and kinds of entities can be pulled out of the text. And then of course, to put all that stuff together, we're going to have a custom prompt that we pass in that's going to let the LLM know exactly what we're looking for and how to think about doing the extraction. Let's walk through the end-to-end workflow of the Neo4j KG Builder to help you understand what these components do. First, each document is loaded. There's built in support for PDFs, but you'll use a custom markdown loader. Next, a text splitter component will perform chunking. This is the common chunking that happens in many different frameworks. For each chunk, the chunk embedder will calculate a vector embedding. The chunk is then analyzed using an LLM, as instructed by the entity and relationship extractor. Now this component will be configured to align with our knowledge extraction plan. Once the graph has been constructed, an optional graph pruner can be used to clean up the graph. This all happens in memory. A KG writer is the component responsible for saving the in-memory graph into Neo4j itself. Finally, an entity resolver component will merge nodes that are likely the same entity. Okay, so now you can start to define some of these custom functions we're going to need to pass into the pipeline. And the first thing we'll do is set up a custom text splitter. And if you remember the markdown files that we had, The markdown files had an H1 kind of header at the beginning, and then they had these page breaks basically for splitting up the reviews. And so we're going to use a simple regular expression text splitter. So for Neo4J, there's this base class called TextSplitter, you can extend that with custom functionality. And so that's what we're going to do here. We're going to extend TextSplitter and then in the run function for TextSplitter because it's part of the pipeline. Here's where our custom functionality is going to be put in place. We're just going to use a regular expression that's going to split the text based on this regular expression's been defined, that's going to be passed into the function itself. So for any kind of regular expression that's passed in, you can use that as the the split points within the text. The final part that it does here is go ahead and just use a list comprehension for taking each of those plain text splits and putting them into an object that the neo4j_graphrag library expects, called TextChunk. TextChunk will just have the text or the string itself, an index for that particular chunk of text. And that's just going to turn into a single list with these TextChunk objects that the list is going to be called text chunks. You'll also be loading the markdown files with a special data loader, and similar to the text splitter, this is going to extend a base class from the Neo4J GraphRAG library. And there's a couple of these that are available, but we're going to define a special one here ourselves. So we're going to import the DataLoader base class and also these kind of types that actually need to be used for the output. We're going to make it look as if we have been parsing a PdfDocument, so we're going to load up a PdfDocument type. In this MarkdownDataLoader, the key thing that it's really going to be doing here is that we know that there's some metadata within the markdown that we actually want to be able to extract and add to the context of the data loading when the chunking is happening. So we've got an extra utility function inside of the class called extract_title. It's going to use a very basic regular expression to just kind of assume that the first time it finds a single H1 headline, it's going to use that as the title for the document. And so when the run function is called here, and the run is really the thing that's going to load the data from some data source. So it's going to load from the file system. It's going to take the markdown text and just read the entire file, extract the title and then turn all that into this DocumentInfo here. And it's the DocumentInfo that really is the interesting part. This is the kind of the metadata for the document. That gets combined with the text of the document which we've loaded here into markdown_text and put into this PdfDocument class. So the other things that you need to set up the knowledge graph construction pipeline is of course you need an LLM. We already we have that, but we're going to be using OpenAI again, of course. We're going to be using OpenAI for the embedder as well. And then we have to grab the Neo4J driver. So this is using the GraphRAG libraries for that. So Neo4J has support of course for OpenAI and other models as well. We're going to use OpenAI for both the LLM and also for the embeddings. So we're just going to create instances of those. LLM for Neo4J, the embedder as well. And we're going to grab the Neo4j driver from our graph DB singleton that we've been using for just sending queries to Neo4J. That singleton lets you grab the internal driver that it uses. So that's what we're doing here. Okay. You've defined some of the utility functions needed for the knowledge graph pipeline and also some of the other components like the LLM and the embedder as well. And now we're going to turn to the context required for actually doing the knowledge graph entity extraction. And we're going to start with the entity schema. So the entity schema required by the neo4j_graphrag package has a couple different components. One is it wants to know the node types that it should look for inside of the text. And here we're just going to make a copy or just really an alias of the approved_entities. We're also going to get the schema of relationship types. And for that, we're going to go ahead and extract from the approved_fact_types that we have. All those approved_fact_types actually exist inside of a dictionary where the key in the dictionary is actually the relationship type itself. And so we're just going to extract all the keys that should be the schema_relationship_types and you can see them here. You're also going to use the approved_fact_types. but as what's called schema patterns by the knowledge graph builder pipeline. And it's really just a repackaging of the exact same information. So from each of the facts, we're going to turn that into a list that has the subject label, the predicate label, and the object label. And we're going to turn the predicate label, we're going to make that uppercase, which is the convention inside of Neo4J. The complete schema then required by the knowledge graph builder is here. It's got the node types, relationship types, the patterns about how those fit together. And we're going to set this flag to false saying, only use these types that we've defined, don't go adding new things. The next part that's important for getting the context for the entity extraction, we're going to create our custom prompt that's going to be injected with both the text within each of the chunks, but also the schema is going to be added to that. And also we're going to be having a special utility function that's going to be adding in the file context. The pipeline will be processing an individual chunk, and the context from the overall file will be added to the prompt so that the LLM knows when it's looking at this chunk, what is the document that it was part of. So for that file_context, we're going to have this little helper function here. All this is really doing is going to grab the first couple of lines of the text from the file. That's going to include the title of the file, but also a little bit of introduction information. That'll give enough context for each of the individual chunks. Then you can define the prompt itself. And here we're going to look at the prompt as kind of one big chunk of text. It's going to follow roughly the same format that we've done before. We have some notion of, you know, what is the role of the LLM in its current work and what's the goal that it's actually after. But then a lot of this is setting up both some design hints and also the context and the context includes both the schema definition and also that little bit of the file that we're going to extract. So as you look through this prompt, you'll see that it's got, you know, instructions about the role and goal. It's got some design instructions here about what it should do. It also has some specifications about what the output format should look like. And here also and for every chunk, the schema is going to be injected and once all of that's been put in place, then as part of the output, we're going to ask it to do unique IDs for each of the nodes as it's creating those nodes. And also we're giving it some instructions here about what to do with each of the properties. You can see in the output here that we want the node output to have some IDs, we want to know what the label is that it's picked, and also what are the properties that should be assigned to that entity. Also then, this is where the context from the files, so the document level context is going to be injected. And here because this is the helper function that's going to create the overall prompt, we're going to just actually inject that here as kind of static text. So it's not going to be injected each time the chunk is created. It's going to be part of the static text. And then finally the chunk data itself will be placed into this template format here. Okay, with all the context that we've set up, with all the different helper functions that the knowledge graph builder needs for doing the pipeline processing. We're going to go ahead and make the knowledge graph builder and then use it. So because we're going to create a knowledge graph builder for every file, so that we have a specialized prompt based on the file content. Here's how we're going to set that up in a helper function. We're going to have a helper function that is going to make the knowledge graph builder, and the output of it will be a SimpleKGPipeline, that's the neo4j_graphrag class. And the first thing we're going to do inside of this helper function is get the document level context and that's going to be using this file_context utility that we'd created. And again, that's going to grab just the first couple of lines of the data file. And then from that, it's going to create the contextualized_prompt, passing in that context as the only argument. Then we're going to get the full prompt that we want to use. So when we create the SimpleKGPipeline, we're going to grab the LLM that was defined, the driver we're using, the embedder, We're going to pretend that we're processing a PDF because we've got a custom PDF loader that's actually going to be loading markdown. We're a custom splitter that's going to be using a regular expression. And here the regular expression is just going to be looking for dash dash dash. That's the markdown page break that if you look at the markdown files that we saw earlier, that's how each of the reviews are split up. So we're going to use that as the text splitting. the schema that we assembled as the entity_schema, and then finally this contextualized_prompt that was constructed just up here. Putting all that together, we get our complete pipeline that can then do the chunking up and then the entity extraction. Now that you have a helper function for generating the knowledge graph pipeline for a particular file, now you just need to go ahead and loop through all the files in the import directory. And for each of those, we're going to get the full file path, have a little output statement letting us know what's happening, create the knowledge graph builder, and then run the knowledge graph builder. Now, this will take a couple minutes. We have, I think about uh 10 files that it's actually going to process. Each one of those will take depending on how much responsiveness we get out of OpenAI. It'll take up to maybe a minute to actually do all the natural language processing of the document, doing the extraction, and along the way, we'll be also doing the chunking. Okay, once this loop is finished, you will end up with a complete lexical graph. That's the parts of the graph that's going to contain the chunks, connected to each other and connected to a document node as well. And then also the subject graph, that's going to have all the extracted entities and also how they're related to each other. Now that we have all files processed, this has been completed. You now have a lexical graph, a subject graph, and a domain graph. However, the knowledge graph isn't complete because the subject graph and the domain graph are not connected. And if you recall, the domain graph was created from the CSV files. And we just created the subject graph as extracting data from markdown files. The next step is to connect the entities that we extracted from the markdown files into the subject graph. We want to be able to connect those with the domain graph. So if there are products in the domain graph that refer to the same product as the extracted entities in the subject graph, we want to connect those. So we're going to define a couple of tools in order to do that. Now, for each type of entity in the subject graph, you're going to devise a strategy for correlating with the right node in the domain graph. For example, you should expect that the products with the product names exist in the subject graph. and that these should correlate with the products in the domain graph. So in order to do this, you will do a couple of things. First, you're going to find all the unique entity labels in the subject graph. Similarly, you'll find all the unique node labels in the domain graph. And then you'll attempt to correlate the property keys between those two different sets of labels. So you can figure out if there's a product with a couple of different properties in the subject graph, how does that correlate with a product with a couple of different properties in the domain graph. So that's the very final step is performing entity resolution by analyzing the similarity of property values. Okay, the first step is to find the unique entity labels in the subject graph. Now, let's take a look at the subject graph to see what the nodes look like. After the Neo4J graphrag library has done its work, the resulting nodes that are part of the subject graph will have an extra label identifying them. They're going to be labeled with this __Entity__. label. So you'll see that plus additional labels to identifying what type of entity it is. So if we run this query, matching any nodes, and we have a predicate looking for where that node has this label, we can then return the distinct sets of labels and call that the entity labels. So let's take a look at that result. And you can see there's a couple of special labels that have been added here. There's the __Entity__ one, there's also the __KGBuilder__, indicating that _KGBuilder_ is the one responsible for creating those nodes. But what we really care about is that we see the Product, Location, Issue, and Feature. Now, we do know this because that's what we're expecting from the previous steps. But what is important is some of those steps might have intended to try to find things like location, but those might have failed to actually occur inside of the graph because the LLM failed to actually find them. So what we'll do instead is we're going to query the graph for what actually happened. and then based on the labels that we find inside of the graph, we'll call those the unique entity labels that are part of the subject graph. So let's walk through a couple of elaborations of this query. The first thing is to take that what you realize each of these rows, it's really a list of labels. So for each row, if we go ahead and just use that UNWIND clause, So what we say here, UNWIND the entity_labels, which are going to be the labels for every node, and we're going to call that individual rows as entity_label. We can then return the DISTINCT entity_label. We should see all the same values, but now rather than as a bunch of lists, we'll see a single list with all the values. Okay. So now the list is really just a couple of rows, where each row has a single label, the KG builder, the product, the entity. You can see it right there. So let's elaborate a little bit further on this query. We want to filter out these underscore labels. So we're just simply going to do that by having exactly the same query. We're going to start with the MATCH. have a predicate for just the ones that actually are entities. unwind all of those so they're just entity labels individually. And then with those, we're going to have another predicate that says, okay, now get rid of any of the entities that start with underscore. That should get rid of _KGBuilder_ and _Entity_. So that's the list that we're looking for, just Product, Location, Issue, and Feature. You can then define a utility function that just wraps that call to Neo4J. If you try out the utility function, you should see of course just that list. It's exactly what we want. Now for each of those unique entity labels, we also want to find the unique keys that each of those labels have. So for every one of those, we're going to do a similar thing. We're going to define a utility function that's going to look for all nodes that have that particular label, but instead, once we've found some sample set of those, we're going to find just the unique keys that occur on those. Let's take a look at what that utility function looks like. Here it's going to give an argument which is what's the entity label that we're looking for. And now the match, rather than finding all nodes, we're only going to find nodes that have that particular entity label. And here we're going to kind of turn things around a little bit. We actually want to look, of course, only where that label co-occurs with the __Entity__ label. And then, rather than finding the distinct set of labels, we're going to find distinct set of keys on those nodes and return those. And similar to what we did before, we're going to unwind those lists of keys into just a couple of rows of keys. So that's what the UNWIND is doing here. And then we're going to collect those all back into a list and then return that as the single result. So, let's define that function and then give it a run, looking just for the unique keys related to Product within the subject graph. Okay, and it looks like we we didn't really constrain that too much when we were doing the entity extraction. So this time around, the LLM has actually found lots of different keys when it created those entities. So, across the different products, here's all the different kinds of properties that it actually derived from the reviews. It figured out things like the material, different features about, I don't know, the shelf depth apparently appears somewhere. You'd have to look at the text of the markdown to actually see why the LLM discovered these different properties. Depending on the particular review that was being used for during the extraction, some of the reviews probably have some of these properties being mentioned, others will not. And so this is not going to be consistent across all properties, but some of these things probably are, particularly I'm thinking the name is going to co-occur. And you'll see later how we actually try to take this list of keys and correlate them with keys that are available on the domain graph, and that's how we're actually going to sync everything up. The next utility function that you will define is very similar, but now we're going to turn our attention to the domain graph. And on the domain graph, we already know how to find the labels that are part of the domain graph. We know that they're labels where there's not an __Entity__ extra label on it. So if we match for a particular label, let's say for Product, if we're going to find the unique domain keys for Product, we're going to head and match for that, and then filter out any of the ones that have __Entity__ because those are part of the subject graph. And then same as we did for the previous function that we're then going to go ahead and get the unique keys on those domain nodes and then collect those up into a list. And so if we look for those unique property keys on the domain labels, this should be consistent because these were all of course imported from a single CSV file. It's a much smaller list. It probably correlates exactly with what the CSV file had. It's the product name, the price, the description, product ID. To help that process along, we're going to define a helper function called normalize_key. You can just pass in for a particular label what is the key that you want to normalize here. And what this function's going to do, it's going to go ahead and lowercase the key, it's going to remove extra white space that's not very helpful. And if the prefix is actually present there where the prefix is the label itself, go ahead and remove that. would do things like this product name would end up being just name. Then you could obviously make it very easy to correlate with the name key from the subject graph. So, Product_name would end up being just name, 'product name' would be just name. And of course, a property key like price would still just be price. That wouldn't get changed at all. So again, the implementation is pretty straightforward, but the purpose of this is actually just to kind of make the property keys themselves easy to compare. And then you can look at some of the examples just for a sanity checking. That looks good. The next utility function that you'll put together is for correlating keys for a given label. Now, this is going to take advantage of some of the utility functions we already had. And let's take a look at the implementation. We're going to import a new library here in Python called rapidfuzz. It's really just used for doing text similarity scoring and this is a simple edit-based scoring. There's a couple different ways for actually trying to figure out whether text strings are very similar to each other. rapidfuzz is a pretty good library for that, inside of the function itself, given a particular label, some keys from the entity node and some keys from the domain node, we're then going to look at a similarity score that is a threshold for like how do these keys correlate with each other according to the scoring from rapidfuzz's fuzz function. And then for the keys that correlate very highly, we're going to pair those up as saying these keys are probably worth comparing the values of. Okay, let's look at this in detail. The correlated_keys, so the keys that we want to pair up, start off as being empty. And we're going to iterate through all of the keys in the entity_keys collection. In kind of classic style, we're going to have a for loop inside of a for loop. We'll loop through all the entity_keys, and then inside of that, we'll loop through all the domain_keys. And for each pairing there, we're going to consider how closely they correlate. We're going to go ahead and just normalize each of those keys. And because the rapidfuzz similarity score goes from zero up to 100, we're actually going to flip that around so that it's looks more like a similarity score. And we're going to go ahead and calculate the ratio of this normalized domain key and the entity key, that should be the fuzzy_similarity. Now, we want that fuzzy_similarity to be greater than the similarity threshold that was passed into the function up here, which is currently set at 0.9 as a default. And so if it surpasses that threshold, we're going to go ahead and add that to the correlated_keys as a pair that we think is valid. We'll go ahead and just sort those keys then just so it's easier to take a look at the ones that are the most highly correlated versus less correlated. And we're going to take the top correlated one later on to actually use. So after defining that function, we're going to go ahead and this part here is just to be trying it out. So let's define the function for product. It's going to go ahead and find all the unique entity keys, find the domain keys, pass it into this function, and see the results. So for product, we can see that name and product name correlate perfectly. Price and price correlate perfectly. description and description correlate perfectly. But design to description, that's pretty low correlation, dimensions and description, low correlation. Those aren't super great. What was the threshold that we passed in? Yeah, we we asked for similarity of 0.5. You can try different values here to see what kind of different results you get. The default value of 0.9 is one of those classic kind of high thresholds you want to set. Okay. One more bit of background here and then we're going to actually put all this stuff together. We've now been able to correlate for a given label in the subject graph and in the domain graph in the keys on those labels, what the good pairings are for actually trying to decide whether two nodes happen to be the same node. Now, this is all by way of having what's called entity resolution. This is a technique for achieving that. There's many different techniques. This is a pretty good baseline and it works perfectly for this current data set that we have. Probably for a different data set, you're going to want to have multiple techniques. And of course, you could imagine an entire agent built around just trying to figure out for the data that you have, what's the right technique for actually doing the resolution. But for what we're doing today, we're going to make some assumptions. The next function that we actually want to use is actually when we're going through processing all the data inside of Neo4J with some Cypher calls. Cypher has some support for doing string comparison methods as well. One of the value similarity functions that's available is called the Jaro-Winkler distance, and that is really just a string comparison method, similar to you might use vector similarity. where you actually have to calculate the vector embeddings. Instead, you can figure out text distance in lots of different ways. Jaro-Winkler does that by calculating what's called the edit distance between two strings. And it really looks at how many small edits have to be made to take some string A and some string B, how many edits do you have to make to actually make them be the same string. So the values end up being between zero and one where kind of in an inverse way, zero ends up meaning that it's an exact match and one means that there's no match whatsoever between these two strings. Now, Neo4J's library of text similarity scores include this Jaro-Winkler calculation. You can also use Hamming distance, the Levenshtein distance, there are Sørensen-Dice similarity and also fuzzy matching. And of course, if you want to, you can also do cosine similarity. using vector embeddings. All those have different, you know, features and like everything else that we've been doing so far with this part of the knowledge graph construction, you could probably pick and choose which one is better for given a particular dataset. For the dataset that we have, this Jaro-Winkler distance is going to be just fine. So you can take a look at what Jaro-Winkler would look like in action using just pure Cypher. If you look at the Cypher query that's happening here in the MATCH clause, we're actually going to be matching a pair of both entities and domains for a particular label. Here we're going to be looking for the entity label that is the same on both. And of course, the entity node also has the extra label of __Entity__. And then with the entity and the domain, and then calculating this Jaro-Winkler distance score, we're going to pass that on with a filter where the score has to be less than 0.4. If you remember, the score of zero means it's a perfect match, and so we're going to have kind of the opposite way of scoring here. We're looking for low values. So filter on that and then return those values to take a look. So in this example, we're going to be passing in some query parameters. We're going to be looking for the entity label of Product, both on the subject graph and also on the domain graph. On the subject graph, we're going to look for the entity key name, and the domain key is going to be product_name. Okay, you can see that these are perfect scores. Gothenburg table and Gothenburg table, these all look exactly the same, of course. So you would expect them to have a very low score. You can try this with different values to try to see what it would be like if you maybe had a little bit less strict threshold, like 0.5. you can see that the Gothenburg Table and Västerås Bookshelf apparently are not terrible. Gothenburg Table and Stockholm Chair. I don't know how those are actually very similar text-wise, but I guess the distance gives you that. This is why you usually want to end up with in practice having a very low threshold like 0.1 so that they're as close as possible to being the actual same thing. So let's elaborate on that query a little bit. This is exactly the same query, but we've added one extra part. Whenever we find two entities that have a very low threshold, so here we've repackaged things a little bit. Now in the WHERE clause, in this pairing of entities and domain nodes, we're going to have the WHERE clause actually just having a predicate on the Jaro-Winkler distance. We're setting the threshold to be 0.1 and for any pairs that pass that correspondence, we're actually going to go ahead and create a relationship from the entity to the domain node. And so we're going to use the fixed type here, it's not going to be parameterized, of CORRESPONDS_TO. So the relationship will be this entity node corresponds with this domain node. Now, there's two sub-clauses that we're going to use here in case you decide to run this multiple times. This can be very useful. When you do a merge, merge inside of Neo4J is kind of like an upsert, and it actually has two sub-clauses based on the behavior of the upsert. So if the MERGE ends up finding a match for actually what you've described, then there's this sub clause for ON MATCH, and then whatever follows that will be executed. Or if you're doing the first time and it's being created, the clause ON CREATE is being called. So after you do the MERGE, if the very first time you run it, ON CREATE will be called and we're going to set this value created at time with a timestamp. The other one, if you already have created it and you're just matching it again, we're going to say that updated_at also gets a timestamp and that'll be updated every time you run this again. Okay, and that runs across all of the corresponding products. That looks pretty good. That Cypher query is doing exactly what you want. So go ahead and wrap it inside of a function call, passing in the label, entity key, and the domain key. And also the similarity score that you want to have as a threshold. The query is pretty much the same here. and you can see that the parameters are being passed in as query parameters from the arguments to the function. And again, we can just go ahead and run that. We're trying that out on Product, name, and product_name. These are the two different keys from the entity and also from the domain graph. Great, so that found 10 relationships that made sense between the subject graph and the domain graph. So for completeness, while we know that we can connect the products, we want to do this for all of the entities that are available. So to correlate and connect all of the subject nodes to the corresponding domain nodes, we'll just put in a for loop that's going to go through all of the unique entity labels and then try to correlate them from the entity side over to the domain side. So the subject nodes and the domain nodes will be connected. So you can see that Product was connected, of course. Location actually doesn't have any correlation. Neither does Issue or neither does Feature. But that's what we expected. What has happened now as a result of doing all of this work, You finally have a complete domain graph that was constructed out of CSV files. You also have a lexical graph and a subject graph that was created from markdown files. And with this final step, you've connected the entities in the subject graph with the entities in the domain node. Now you have a completely connected knowledge graph. Well done.