In this lesson, you'll use PyTorch to code a class that implements masked self-attention. Then you'll run some numbers through it and verify that the calculations are correct. Let's code. Just like we did before. We're going to import torch, torch dot NN, and torch dot NN dot functional. Now we'll code a class that implements masked self-attention. So we start by defining a class called masked self-attention that inherits from NN dot module. Then we create an init method that has the same arguments that we used before. d_model is the dimension of the model or the number of word embedding values per token. And we also have the row and column indices. Then we call the parent's init method. And then we use NN dot linear to create the weight matrix that will later make the queries from the encoded values. Note, just as a reminder, I've labeled the query weights matrix with the transpose symbol because of how PyTorch prints out the weights. Then we do the exact same thing to create the key weights and the value weights. Then we save the row and column indexes. The forward method is where we actually calculate the masked self-attention values for each token. Just like before, we accept the token encodings. But now we are also accepting the mask. By setting the default value for the mask to None, we can use this class to calculate masked self-attention and the original self-attention. Now, just like before, we pass the token encodings to W_q, the query weight matrix to create the queries stored in a variable called q. Then we calculate the keys, k, and the values v. Now using the matrices we just created, we calculate self-attention. We calculate the similarities among the queries and the keys, and then scale the similarities. Then if mask is not None, we add the mask to the scaled similarities with masked underscore fill. To understand how we add a mask using the masked underscore fill method, imagine that the mask is a matrix of trues and falses. The trues correspond to attention values that we want to mask out. So the masked fill method replaces the trues with value, which we set to negative one times ten to the ninth, and it replaces the falses with zero. This matrix is then added to the scaled similarities. Then we run the scaled similarities through a softmax function to determine the percentages of influence that each token should have on the others. Then we multiply the attention percentage by the values in v and return the attention scores. All together, the masked self-attention class looks like this. Bam! And now let's run some numbers through it and make sure it works as expected. We'll start with the same matrix of encodings for the same tokens that we used before. And just like before, we'll set the seed for the random number generator. Then we create an object from our masked self-attention class using the same parameters we used before. Now we need to make a mask to prevent tokens from looking ahead when calculating attention. Because we have three tokens in the prompt, we start by creating a three-by-three matrix of ones with torch dot ones that looks like this. That matrix of ones is then passed to torch dot tril, which turns the ones in the upper triangle into zeros and leaves the ones in the lower triangle as they were. Ultimately, we save a matrix with ones in the lower triangle and zeros in the upper triangle in a variable called mask. Then we use mask equal equals zero to convert the zeros into trues and the ones into falses. Note we can verify that we correctly created the mask by printing it out and seeing that it is what we expect. Bam! Lastly, we pass the encodings matrix and the mask to our masked self-attention object. And these are the masked self-attention values. Double bam! Note if you got something different, don't panic. You can verify that the math was done correctly. Using the same methods we used before, we can print out the weights and we can pass the encodings directly to each matrix to see what the query, key, and value matrices are. Triple bam!

AI is the new electricity and will transform and improve nearly all areas of human lives.

Learn Code

Next Lesson

Attention in Transformers: Concepts and Code in PyTorch

Introduction
Video
・
6 mins

The Main Ideas Behind Transformers and Attention
Video
・
4 mins

The Matrix Math for Calculating Self-Attention
Video
・
11 mins

Coding Self-Attention in PyTorch
Video with Code Example
・
8 mins

Self-Attention vs Masked Self-Attention
Video
・
14 mins

The Matrix Math for Calculating Masked Self-Attention
Video
・
3 mins

Coding Masked Self-Attention in PyTorch
Video with Code Example
・
5 mins

Encoder-Decoder Attention
Video
・
4 mins

Multi-Head Attention
Video
・
2 mins

Coding Encoder-Decoder Attention and Multi-Head Attention in PyTorch
Video with Code Example
・
4 mins

Conclusion
Video
・
1 min

Appendix – Tips and Help
Code Example
・
1 min

Course Feedback

Community