In this lesson, you will learn about using sentence embedding models in production and how the two different encoders the question encoder and the answer encoders are used in a retrieval pipeline such as in the RAG system. All right, let's go. Okay, so we trained the dual encoder embedding model. Now we have two encoders ready to go, the question encoder and the answer encoder. As we can see here during ingest we encode each text chunk using the answer encoder and store the resulting vector embedding into the vector database. Then, when a user issues a query, we use the question encoder to generate the query embedding vector. That vector is then used to retrieve the matching facts or text segments they're sent to the LLM as part of the RAG flow. How do we find the matching chunks of text after the question embedding have been computed? The naive approach will just compute similarity between the question embedding and all answer embeddings. But that is computationally heavy and may take too long for a real production system. Thankfully, we have quite a few computations of approximate nearest neighbors or ANN. Algorithms like HNSW, Annoy, FAISS and others. These algorithms approximate nearest neighbor searches with high accuracy but significantly lower compute time. They are widely used for this task. Most ANN algorithms are in-memory. So, when you implement this in production and for very large dataset, you have the additional requirement to implement your ANN approach using a persistent data store on disk. Let's see all this in the code. So in this notebook, we first remove the warnings and we're going to import as always, a bunch of packages we're going to use specifically, I want you to note these new to packages called DPR context encoder and DPR question encoder. We'll use those for the dual encoder we're going to load that's been pre-trained. You will also need the cosine similarity matrix function from the other lab. It's exactly the same thing. Just a helper function to compute similarity. So let's put together an example. Here we have five different potential answers and a question. What is the tallest mountain in the world? Now you can take this model called all-miniLM-L6-v2, which is a pure similarity model. Compute the question embedding of the question. and then compute the answer embedding for each of the answers. And then compute the similarity between the question and each of these answers. When you do this, you will see that the best answer, the one that's closest in terms of similarity, is the same one that is the question "what is the tallest mountain in the world?" And the similarity is actually 1.0. That's what we would expect. Contrast that with a Dual-Encoder that has a different answer encoder and question encoder. In this case we use the DPR fully pre-trained model for this purpose. After loading the model, we can compute the tokens of the question again. And, embedding of the question. And as you can see here is some of the embeddings printed here. And the size of the embedding is 768 dimensional vector. So now you can do the same with the answers. You take each answer, tokenize it. Use the answer encoder in this case to get the answer embedding. Compute the similarity and figure out what the best answer is. And drum roll. Yay! We got the right answer to the question that was exactly the same as the answer. This answer zero did not get the highest score. And the best answer we got is exactly what we wished for. "The tallest mountain in the world is Mount Everest." Okay. So now taking a step back, let's look at the full RAG pipeline. During Ingest, the blue lines, we take input documents or text, we chunk it and use the answer encoder to encode these chunks into embedding vectors. Then store this into the vector database. Upon receiving a user query, the green lines, we use the question and encoder to get the question embedding and then use ANN to retrieve the most relevant chunks of text. Those are then included as context in the prompt and sent to the LLM, which generates the desired response. In practice, there are a few ways you can go about building a RAG pipeline. You can coded yourself from scratch. You can use one of the do it yourself framework like LangChain or Llama index. Or you can use a RAG as a service platform like Vectara, which does most of the heavy lifting for you. So in this lesson you saw how to use the question encoder and answer encoder in a production RAG pipeline. And the importance of an ANN algorithms make the latency of retrieval acceptable. We will conclude the course in the next lesson and explain how a two-stage retrieval pipeline works. See you there.