In this lesson, you will learn about sentence embeddings, how early and naive attempts to create them failed, and what led to the successful approach of using a dual encoder architecture for sentence embeddings. All right, let's go. So far we've discussed words and tokens interchangeably. In reality, NLP systems deal with tokens. And the token can be a word. But it doesn't always have to be that. For example, here we have the sentence "We love training deep learning networks." We can tokenize it into whole words in English and assign each word an integer value like 23 or 112, etc... Subword tokenization means that a token is not always a complete word, but can be a subword or any sequence of characters. When tokenizing, you define a vocabulary of possible words or subwords and break out the text according to this vocabulary. The result is that each sentence is represented by a sequence of integer values. Common tokenization techniques are used in LLMs. embedding models are usually subword tokenizers like BPE, also known as Byte Pair Encoding, word piece, or a recently popular variant called Sentence Piece. Let's see how token embeddings work in BERT, which has a vocabulary of about 30,000 tokens and an embedding dimension of 768. The tokenizer input sentence is prepended with a special first token called CLS, and then all tokens are converted to token embeddings. Think of this first level of embeddings as fixed token embeddings that focus on the word itself only. The output of each encoder layer in BERT also provides embeddings for each token in the input sequence, but these embeddings now integrate information about the rest of the sentence, so we call them contextualized embeddings. As we've seen before. And as we go from layer to layer, these representations become better and better at integrating context from the whole sentence. After the success of using word embeddings like Word2vec or GloVe, the next question researchers explored was can we create embedding vectors for sentences? Such as cosine similarity or dot product similarity in an embedding space represents semantic similarity between sentences. Initial attempts were naive. Some tried taking the output embeddings of the last layer of a transformer model of all tokens in the sentence, and averaging those out. This is also known as mean pooling. Others tried just using the embeddings of the CLS token as the representative of the sentence. These approaches all failed. Let's see this in practice. So let's, ignore some of the warnings And then I'm going to import some libraries you would need here, like PyTorch, SciPy, and seaborn. And again our, BERT model and tokenizer. So here we have, the model name we're going to use BERT-base-uncased. And we're loading the tokenizer and the model for that. And then we have this helper function which essentially performs the mean pooling operation on the sentence embeddings. What it does is it, takes the the sentence and encodes it into, tokens, creates the attention mask, runs it through the BERT model, gets the output, embeddings, and then the last hidden state, that's the last, last layer of the BERT model. and it takes this and does this operation here, which is the mean pooling, to get to the final, output of this function. Now the second function I'm going to set up here is just a helper function to compute cosine similarity of a matrix. So, a matrix is rows every row is certain features of a certain vector or certain sentence. And as you can see with cosine similarity, we have to do the normalization of each vector or each row and then just compute, the dot product of them. The last function you will use here it's plot similarity, which will be able to create a heatmap for you when you want to visualize the similarity matrix, between every pair of sentences. So it uses seaborn, the sns package and does the heatmap according to the labels and the features that we have. So let's try a couple things. Here, we have a few messages or sentences that we will use. And you see, some about smartphones, some about the weather, etc... Okay, so here you take each one of these messages in its variability. You compute its, mean pooling embedding. and calculate this for all the messages. And then what you'll do is you'll take, these embeddings and plot the similarity, heatmap. And let's see what that looks like. So you can see that this heatmap shows the similarity between each sentence and each other sentence. And we can see that most of it is red and orange which is really high similarity which doesn't make sense. These sentences don't have the same semantic meaning, which means that mean pooling doesn't work. Now you can do this in another way. Here, we have this dataset called the STS benchmark dataset that has already a lot of sentences or sentence pairs sentence one and sentence two. So if you load this dataset and you can see some of the sentences here they also have a score, that shows, this is the ground truth score of whether these two sentences are similar and what is their similarity score. And now you will use the sim_two_sentences function to compute the mean pooling of BERT embeddings on the STS sentence examples. We'll use 50 example data set here just to, speed it up. But you can change this variable later and try any number of subset of the STS dataset. So now you can print the scores. And as you can see, the numbers here are pretty high. The mean pooling of the BERT score, BERT embeddings is high, which indicates that this approach considers all these pairs of sentences to be very, very similar to each other, which again demonstrates the fact that mean pooling of BERT embeddings doesn't really work for sentences. Another way you can look at this is through Pearson correlation. By computing the Pearson correlation between the ground truth scores and the mean pooling scores of BERT. And doing that, you can see that the Pearson correlation is very low, which indicates it's not correlated. These two types of scores are not correlated. And the p-value is also significant. So again, another indication of the fact that this is not a viable approach. You saw that mean pooling doesn't work. So now you can try a pre-trained sentence embedding model to see how one that is trained works. This one is called all-miniLM-L6-v2, and now you're going to look at through all these messages that we had before and plot the same heatmap. And as you can see here, this looks a lot better. You have, some sentences that are similar here, but for the most part, the similarity is much more accurate, reflects the fact that a lot of these sentences are not similar to each other. Now, you can run the same approach on the STS dataset and create a third column called mini LM score. So when you do that, you will see that we have this third column here with the scores created by the good dual encoder model. And for example, if we look at this example here, you can see that ground truth score and the mini LM score, are very close. And they're both thinking of this is not, related as much. Whereas again, the mean pooling score thinks everything is pretty much similar. And a very high score. And if you compute the Pearson correlation in this case, you can see that the Pearson correlation is much higher, which means that, these scores and these scores correlate much better, as we would expect. I would encourage you to play with some other examples in the STS dataset and see for yourself how sentence similarity works for different sentence pairs. Real progress in research around sentence embeddings was initially made with the introduction of Universal Sentence Encoder at Google around 2018, based on the original transformer architecture. In fact, our founder Amin has been part of the original team behind Universal Sentence Encoder. SBERT was the first model that was trained on data that included pairs of sentences. The work on SBERT in 2019 quickly led to more innovation in the following years, with additional designs such as DPR, sentence T5, E5, and COlBERT, and many others. Now let me explain a subtle but important consideration when building sentence encoders. There are two possible goals. The first is pure sentence similarity. For example, if you want to use embeddings to find similar items, the second is to rank relevant sentences as responses to a question. For example, in RAG. They are not the same goal. Let's look at an example data set with four text chunks shown here as A1 to A4. And our question is "What is the tallest mountain in the world." If you use pure similarity, clearly A1 is exactly the same as the question. So the similarities highest or equal to one. And thus this result A1 will be ranked highest. But that's not really what you want. You want the answer A2, "Mount Everest is the tallest" to be selected. This led to the dual encoder or by encoder architecture. We have two separate encoders. the question encoder, and the answer encoder. And the model is trained using a contrastive loss. To sum up, in the next lesson you saw first hand how mean pool of BERT embeddings does not work for sentence embeddings. and how a pre-trained sentence embeddings model like mini-LM-L6 does a much better job. You might be curious to know how to build these dual encoder models. That's exactly what you will learn in the next lesson. See you there!