In this lesson, you'll go through the matrix math required to calculate self-attention one step at a time. You'll learn both the how and why the equation works the way it does. All right, let's go. At first glance, the equation for calculating self-attention can be a little intimidating. But don't worry, we're going to break it down into small, easy-to-understand pieces. We'll start with these variables. Q stands for query. K stands for key and V stands for value. The terms query, key, and value come from database terminology. So let's talk about databases for a little bit. Imagine we had a database of guests at a hotel that paired each guest's last name with their room number. Now imagine Stat Squatch is working at the desk one night, and I check in and tell Squatch my last name. Starmer. However, instead of correctly spelling my last name Starmer, Squatch types Stammer into the computer. Now the computer has to figure out what last name in the database is closest to whatever Squatch typed in. In database terminology, what Squatch typed in the search term is called the query. And the actual names in the database that we are searching are the keys. So the computer compares the query to all of the keys in the database and ranks each one. And in this case, the query Stammer is closest to the key for Starmer. And so the computer returns my room number 537. And in database terminology we'd call the room number the value. To summarize the database terminology, the query is the thing we are using to search the database. And the computer calculates similarities between the query and all of the keys in the database. And the values are what the database returns as the results of the search. Bam! Going back to the equation for self-attention, we now have a better idea of what the Q, K, and V variables refer to. Now let's talk about how we determine the queries, keys and values in the context of a transformer. First, let's remember that self-attention calculates similarity between each word and itself and all of the other words, and self-attention calculates these similarities for every word in the sentence. And that means we need to calculate a query and a key for each word. And just like we saw in the database example, each key needs to return a value. So, in order to keep our examples small enough that we can easily calculate things by hand, let's use the prompt: "Write a poem." And just like we saw in a previous lesson, the first thing a transformer does is convert each word in the prompt into word embeddings, and then the transformer adds positional encoding to the word embeddings. To get these numbers or encodings that represent each word in the prompt. Note in this simple example, we're just going to use two numbers to represent each word in the prompt. However, it's much more common to use 512 or more numbers to represent each word. Anyway, in order to create the queries for each word, we stack the encodings in a matrix and multiply that matrix by a two-by-two matrix of query weights to calculate two query numbers per word. Note: we multiplied the encoded values by a two by two matrix. Because we started with two encoded values per word, and a two-by-two matrix allows us to end up with two query numbers per word. If instead, we had started with 512 word embeddings per word, and thus 512 encoded values per word, then a common thing to do would be to use a 512 by 512 matrix to create 512 query numbers per word. That being said, the only rule that you really have to follow is that the matrix math has to be possible. Also, I want to point out that I've labeled the query weights matrix with the transpose symbol. This is because PyTorch prints out the weights in a way that requires them to be transposed before we can get the math to work out correctly. Small Bam! Now we create the keys by multiplying the encoded values by a two-by-two matrix of key weights, and we create the values by multiplying the encoded values by a two-by-two matrix of value weights. Now that we have the query, key, and values for each token, we can use them to calculate self-attention. We start by multiplying the query matrix Q by the transpose of the key matrix K. Josh, why do we need to transpose K, when we do this multiplication? Well, in this specific case, the obvious thing is that the multiplication wouldn't work if we didn't transpose K, then the two numbers in the first row of Q could multiply the top two numbers in the first column in k, but then the bottom number would be left out of the fun. So in this case, multiplying by K without transposing it is a bad idea for technical reasons. However, there's actually a much more important reason to transpose K. And to understand it, let's go through the multiplication one step at a time. We start with the first row in Q, the query for the word write, and the first column in K transposed, the key for the word write. The matrix multiplication gives us the sum of these products, which is -0.09. This process of multiplying pairs of corresponding numbers together and adding them up like we did here, is called calculating a dot product. So a -0.09 is the dot product of the query and the key for the word write. Dot products can be used as an unscaled measure of similarity between two things. And this metric is closely related to something called the cosine similarity. The big difference is that the cosine similarity scales the dot product to be between -1 and 1. In contrast, the dot product similarity isn't scaled, so that makes -0.09 and unscaled similarity between the query and the key for the word write. Likewise, the unscaled dot product similarity between the query for write and the key for A is 0.06 and the unscaled dot product similarity between the query for write and the key for poem is -0.61. Likewise, we calculate the unscaled dot product similarities between the query for A and all of the keys and the unscaled dot product similarities between the query for poem and all of the keys. Thus, by multiplying Q by the transpose of K, we end up with the unscaled dot product similarities between all possible combinations of queries and keys for each word. Double bam! Now the next thing we do is scale the dot product similarities by the square root of D sub K. D sub K is the dimension of the key matrix, and in this case, dimension refers to the number of values we have for each token, which is two. So we scale each dot product similarity by the square root of two, and that gives us a matrix of scaled dot product similarities. Note: scaling by just the square root of the number of values per token doesn't scale the dot product similarities in any kind of systematic way. That said, even with this limited scaling, the original authors of the transformer said it improved performance. Small bam! The next thing we do is take the softmax of each row in the matrix of scaled dot product similarities, and taking the softmax of each row gives us these new rows. Note: this softmax function makes it so that the sum of each row is one. So we can think of these values as a summary of the relationships among the tokens. For example, the word right is 36% similar to itself, 40% similar to A and 24% similar to poem. Bam! Now let's put these new rows back together to form a matrix. And the last thing we do to calculate self-attention is multiply the percentages by the values in matrix V. To understand exactly why we do this multiplication, let's go through it step by step. When we multiply the first row of percentages by the first column in V, we calculate 36% of the first value for the word write, and add it to 40% of the first value for A, and then add 24% of the first value for poem. And that gives us 1.0 the first self-attention score for the word write. In other words, the percentages that come out of the softmax function tell us how much influence each word should have on the final encoding for any given word. Likewise, we scale the second column of values to get the second self-attention score for the word write. Then we scale the values by the percentages for A to get the self-attention scores for A, and then scale the values by the percentages for poem to get the self-attention scores for poem. And at long last, we have calculated the self-attention scores for each input token. Bam! In summary, the equation for self-attention may look intimidating, but all it does is calculate the scaled dot product similar among all of the words. Convert those scaled similarities into percentages with the softmax function, and then use those percentages to scale the values to become the self-attention scores for each word. Triple bam!

Please sign in to view this content

Next Lesson

Attention in Transformers: Concepts and Code in PyTorch

Introduction
Video
・
6 mins

The Main Ideas Behind Transformers and Attention
Video
・
4 mins

The Matrix Math for Calculating Self-Attention
Video
・
11 mins

Coding Self-Attention in PyTorch
Video with Code Example
・
8 mins

Self-Attention vs Masked Self-Attention
Video
・
14 mins

The Matrix Math for Calculating Masked Self-Attention
Video
・
3 mins

Coding Masked Self-Attention in PyTorch
Video with Code Example
・
5 mins

Encoder-Decoder Attention
Video
・
4 mins

Multi-Head Attention
Video
・
2 mins

Coding Encoder-Decoder Attention and Multi-Head Attention in PyTorch
Video with Code Example
・
4 mins

Conclusion
Video
・
1 min

Appendix – Tips and Help
Code Example
・
10 mins

Course Feedback

Community