We'd like to know you better so we can create more relevant courses. What do you do for work?

Loading...

We'd like to know you better so we can create more relevant courses. What do you do for work?

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

of vector or semantic search using the brute-force k-nearest-neighbors algorithm. You code up an implementation of brute-force KNN, and see how it can be used to accurately obtain the nearest vectors in embedding space to a query vector. You then explore the issues around the runtime complexity of the brute-force KNN algorithms and this leads you to the class of approximate nearest-neighbors algorithms that lie at heart of vector database technology. All right, let's dive in. Vectors capture the meaning behind our data and thus in order to find data points similar in meaning to our query we can search and retrieve the closest objects in vector space and return them. This is known as semantic or vector search. And by semantic search I mean search that utilizes the meaning of words or images in question. One way to find similar vectors is through brute force which goes in following steps. Step one, given a query find the distances between all the vectors and the query vector. Step two, sort the distances and finally return the top K best matching objects with the best distance. This is known in classical machine learning as the K nearest neighbor algorithm. The thing is that brute-force searches comes with a large computational cost. You can see here that the overall query time grows with the amount of objects that we have in our star. And if over time the amount of data doubles or triples, the query time will also double and triple. Let's demonstrate this algorithm in code and try to scale it up both with number of data points and dimensions. So, we already have a number of libraries loaded into our notebook and the one that is worth looking into is the nearest neighbor, which we'll use to demonstrate brute force algorithm and how it works. Let's start by generating 20 random points of two dimensions. And then, we can plot it nicely on a graph, so that you can see how they are distributed across the screen and across the vector space. And now, let's add all the data points into nearest neighbors' index. And you can see here that we are using the brute force algorithm. And then, if I run this it returns an instance that is ready to query. Now, this is the query that finds the four nearest vectors is our case set to four and this is the query vector that we are looking for and if we run this, we will see that vectors 10, 4, 19 and 15 are the four nearest ones with the following distances. And even though we only have 20 objects we could already measure how long this query took. So, if I run this and grab the time before and after we could see that like searching across 20 vectors takes very little time in this case. And now, let's see the time complexity for brute force for much larger data set than just 20 objects. And for this we have this handy function called speedtest which takes the count of the number of objects that we are tested on. And it works in three steps. First, we are randomly generating a number of objects based on the count. Then again, we are building the index using the nearest neighbors. And then finally, we measure the time of the actual query and that's basically what we return back as a result. So, let's test it on 20,000 objects and we can see that it run pretty fast but to really give it a go. Let's test it on way bigger data sets so we're going to run it on 200,000, 2 million, 20 million, 200 million objects and you can already see that like we've asked the number of objects increase the query time takes longer and longer and even between 2 million 20 million that's already a 10x increase and a 200 million objects took 12 seconds. So, you can imagine what would happen if we actually had a billion objects or way more than that. This will get very quickly out of hands. The complexity that you just saw was only for two dimensions. So, what will happen if we increase the dimensionality of our vector embeddings let's say to 768 dimensions and see what happens. So, let's start with 1,000 documents with 768 dimensions so in here we are generating those vectors over here and then we are normalizing it and then also we have our query that we'll use to test the performance. And now, let's run a query where at the beginning we start a timer, then we use dot product, and we calculate the distances across all 4,000 vector embeddings and then finally once we have the results we'll sort all of them and then stop the timer and this will give us the time it takes to find the top five nearest results, and we can see that searching across 1,000 vector embeddings. It took us one half millisecond, and these are the nearest matches. So now, let's really test out how long it takes. To run a vector query across 1,000, 10,000, 100,000 and half a million objects with the 768 dimensions. And we can see that like up to 100,000 that returns pretty fast. But half a million takes a bit longer and you can see a single query of just half a million vectors takes almost two seconds. And if we were to run a thousand queries across half a million objects that would take in total about half an hour. Which is not great. In conclusion, we saw how the number of vectors affects the query time. So, the more vectors we have, the longer it takes for the query to complete. And then, it got really tricky when we got to the use cases that are closer to real-life scenarios where the dimensionality of our vectors was 768 vectors. And then in this case, as soon as we reached half a million objects, brute force was just not cutting it as it was taking almost two seconds per query. And in real-life scenarios, you'd be dealing with tens or hundreds of millions of objects. In the next lesson, we'll cover different methods on how you can query across many vectors and still return results in a reasonable time.