In this lesson, you will learn about the internals of the embedding models. You will see how your texts are converted into vectors through multiple layers of an embedding model. Let's have some fun. Vector search applications like Retrieval Augmented Generation, hinge on embedding models. Choosing a suitable one is a pivotal decision when implementing such projects. If you select the model trained on academic papers, but all your data will be coming from Twitter, there is a high chance that the search quality will be impacted. Surprisingly, not that many people perform the evaluations, which should be one of the first steps. We will consider the optimizations of vector search that everything starts with choosing the proper embeddings for your data. The embedding model should take any text and convert it into a single vector of fixed dimensionality, so we can compare the distance between any two documents to determine their similarity. However, embedding models do not operate on text directly. Instead, they require an additional process that splits them into smaller pieces. This translation layer is called a tokenizer. We will have a closer look at the role of the tokenizers in the whole process of creating the text embeddings. This image shows a standard transformer architecture and comes from the attention is all you need paper. It presents a full encoder-decoder model. However, embedding models are usually encoder-only transformers. Let's see how the input is transformed step by step. These models take an input and convert it into input embeddings. These embeddings are learned during the training process. Each possible value of the input has its own embedding. That means transformers are able to digest only the input symbols they were trained on. In case of the embedding models, they can only work with the text units they encountered in the training data. If your model was solely trained on English, it won't be able to work with Japanese texts and vice versa. There are, however, no strict requirements on how this text units should look like. There are usually on a word or subword level. The inputs are integer identifiers, so they can be mapped to a certain input embedding by the model tokenizer expected to produce a sequence of numerical IDs corresponding to the tokens seen in a given text. This IDs defines some sort of a contract between the tokenizer and then voting model itself. Thus, we cannot swap the components easily. Once the training is finished, as ID matter. The most straightforward idea to handle the inputs would be to treat each individual letter as a separate entity. Theoretically, an embedding model should be able to work on letters or bytes that would result in a very small number of tokens. The size of the vocabulary will also impact the number of input embeddings learned by the model. However, each character may occur in various contexts, so the embedding won't be meaningful. The network parameters will need to learn the relationship between the letters and derive what the sequences mean. In practice, we want more meaningful pieces, so the transformer inputs are initialized with some meaning already. Contrary to character-based tokenization, we could use words tokens. Each word would get its own identifier and the model would have to learn more embeddings to cover them all. That means more training data and time spent on training. Moreover, we would not be able to represent any unseen word as a sequence of tokens as we would only cover full words that would never happen with the character or byte-level tokenization. This idea ignores the fact that language has a structure, and certain words may have some common parts carrying the meaning. Subword level tokenization seems as a compromise between these two basic approaches. There are some existing NLP methods, such as stemming or lemmatization, that incorporate linguistic knowledge to find the root forms of the words. That may sound appealing, but they won't work for any other language as they differ in terms of syntax. However, various algorithms try to find a way to split the words based on the statistics of the training data, and they are commonly used in embedding models and LLM's. We will review the alternatives in the second lesson, but ideally, the tokenization algorithm should find the root forms of the words and assign them separate tokens. Let's track the process of converting text into the input embeddings step by step. First of all, we split into tokens using the selected tokenizer. Then, the tokenizer produces a sequence of token IDs and sends them to the model. Model maps each of the token IDs into the learned input embedding. The first step of the transformer is actually a lookup table. It is a pretty interesting exercise to see how this input embeddings will look like for different input texts. Let's check one of the pre-trained sentence transformers. You will use one of the existing pre-trained models. It isn't the benchmark-winning one, but its open-source nature helps to deconstruct it into pieces. Sentence Transformers is a Python library providing many embedding models, and you will use all all-MiniLM-L6-v2 in the whole course. However, changing it to a different one is as simple as passing a different name. This model consists of a transformer, a meaning pooling layer, and normalization. This output hides the tokenizer component, which is also important. You can tokenize the text using a tokenized method of the model. It accepts multiple text at a time and requires passing a list of strings. A list of input IDs it's not the most straightforward or human friendly way of inspecting how the input text was cut into pieces. You can see a list of tokens by passing a list of corresponding IDs to the convert IDs to tokens method of the model's tokenizer. Each word of our example has a corresponding token learned by the tokenizer. There are also some technical tokens marked with square brackets. CLS and SEP start and end each sequence. Let's forget about the pooling and normalization layers for the time being. Accessing the first module helps to investigate the inputs a little bit more. This transformer model has an embedding layer at the very beginning. It is responsible for transforming the raw input tokens into the embeddings. This embeddings are then transformed by multiple stacked layers. But for now you will mainly focus on the input token embeddings. Let's access them. The input layer has different kinds of embeddings, including word embeddings, position embeddings, and token-type embeddings. The word embeddings are the most important ones as they capture the meaning of each token. You can access them by coding the word embeddings method on the embeddings layer. The input token embeddings for both texts have been calculated. These embeddings are calculated separately for each token in a sequence, and we can calculate the dissimilarity between the tokens based on them. You can see the input token embedding does not change, no matter the context or order of the words. You can experiment with passing different text, different sentences to see how similar the input token embeddings are. However, the same token will always get the same vector. Now let's recap the whole process done in an embedding model. You have just reviewed the input token embeddings. But we mostly care about the output embeddings as they are supposed to capture the meaning of text. The sequence of input embeddings do not contain any positional information. We want to enrich the token embeddings so the model can also capture the positional information. They might be also some additional embeddings depending on the model. This is done by adding the positional encodings. As a result, we get a slightly modified set of token embeddings, which should also capture the relationships between the tokens in the whole text. Positional encodings are usually generated with a sine function. Once both input token embeddings and positional encoding are added, we pass them through a set of stacked layers, each of them taken by the input embeddings and producing some output embeddings. Internally, each of these modules uses an attention mechanism to determine the relationships between input embeddings, so here is where all the cross token information is captured. Please note that each model can only take a certain number of tokens simultaneously. That's also related to the positional encodings as they are kept in a matrix of fixed dimensionality. If your model was trained to support, let's say, a maximum of 256 tokens, then there are no guarantees it will work for longer sequences. Usually the tokenizer will drop the trailing tokens if the maximum length is exceeded. Let's visualize the input token embeddings to get some intuition on how this mechanism captures the meaning of tokens. Input token embeddings are context-free, and they are also the parameters of the transformer model. The model learns them during the training phase. To represent the meaning of each token in the best way possible. We can access them as a matrix. Since the token embeddings are context-free, we can map each of the vectors with the corresponding token. Our matrix has 30,522 inputs. What is equal to the size of the vocabulary. You will now get it from the tokenizer and then sort by this index. This is how some randomly selected tokens of our model would look like. The mapping between tokens and the vectors is now easy. Visualization is a bit harder as each vector has 384 dimensions. This is a dimensionality reduction algorithm you will apply now to compress each vector to two dimensional space. This may take some time. We have the results. There are three main groups of tokens in our vocabulary. The first group consists of the technical tokens specific to the model, and its training procedure. Then the subword tokens with the double hash prefix, are the suffix tokens particular to the tokenizer used in our model Finally, the last group is prefixes and whole words starting with anything except the double hash. You will now assign different colors depending on the group a particular token belongs to. Scatterplot, using plotly, helps you interactively visualize them. When we visualize the inputs token embeddings we can use it to inspect the input. Eyeballing is the only technique we'll apply here. But when you check the clusters you can clearly see that numbers are put close together. However, years form a subgroup of the numbers. The Cromwell token is closest to the tokens representing years from the 16th century. If you wonder why, try to google for this English politician. Subword tokens occupy some subareas of this space, but some are just closer to full words. And used tokens are mixed with some non-English characters. Generally, non-English characters are grouped together. The names are close to each other. And actually, each subarea of the semantic space holds tokens with similar meaning or function. This is where our embedding model will start. You can inspect this semantic space even more if you want to get some more intuition. Let's now discuss all the other steps involved in converting a sequence of tokens to a vector, capturing the meaning of the whole text. We already know that the encoder only embedding model starts with context-independent embeddings. Attention mechanism captures the relationships between tokens and incorporates that information into the output embeddings of each stacked module. Thanks to that, our initial context independent embeddings become context-aware, and the embedding model can produce a sequence of embeddings with a length equal to the length of the input sequence. Pooling helps to create a single vector of embeddings for a whole text. It usually averages all the individual vectors assigned to the input tokens. Let's verify whether the output token embeddings are really context-aware. When we use the encode method of the model, it produces just a single vector. However, we can also generate a vector for each individual token, and those should be context-dependent already. We can achieve that by setting output value to token embeddings. If we now check the output token embeddings of different sequences, they should not be identical anymore. Our embedding model captured the meaning of each tokens in a sequence, so they will be encoded differently depending on the context. It's not a basic word to vet like mapping anymore. We will use the same examples like before, but you can play and experiment with different examples if you want. Then we will tokenize both sentences using the tokenized method of our model, and finally, calculate the embeddings of both sentences. We are going to specifically retrieve the token embeddings produced for each token in both sequences. Finally, cosine similarity calculated between each tokens coming from the first sentence and each token from the second one. These distances will help us to visualize how similar are different tokens in both sequences. The output token embeddings of vector search and optimization are not identical anymore. You can test different sentences and see how the output token embeddings will vary in terms of their similarity. Tokenization is an important yet often ignored step. It defines how our model sees the world and how much meaning each individual token can initially get in the model. You will learn about different ways of tokenization in the next lesson. All right. See you there.