In this lesson, you will go through the matrix math required to calculate masked self-attention, one step at a time. Like before, you'll learn both the how and the why the equation works. Let's have some fun. Earlier we went through the matrix math equation for self-attention, and now we're going to go through the matrix math for masked self-attention. The good news is that the only difference between the equation for self-attention and the equation for masked self-attention is that we add a new matrix M for mask to the scaled similarities. And that means that, just like we did earlier, we start by calculating the query, key, and value matrices. Thus, given the prompt, "write a poem." The transformer creates word embeddings. Then it adds positional encoding to give us these encoded values. Now we calculate the query, key, and value matrices just like before. Now we calculate masked self-attention. And just like we did earlier, we'll calculate the similarities between the queries and the keys. Then we scale the similarities. Now we add a masking matrix M to the scaled similarities. The purpose of the mask is to prevent any tokens from including anything that comes after them when calculating attention. In this example, where the prompt is write a poem, we want the first token write to only include itself and mask out "a poem" from its attention calculations. And we want the second token, A, to include itself and write and mask out poem. And we want the third token poem to include everything. The masking is done using a matrix that looks like this, which adds zeros to values we want to include in the attention, calculations and negative infinity to any value we need to mask out. Adding zero to something doesn't change it. So these scaled similarities go unchanged, but adding negative and infinity turns these scaled similarities into negative infinity. What that means is that when we apply the softmax function to each row, the first token, write, has 100% similarity to itself and 0% similarity to anything that came after. Likewise, the second token, A, has 0% similarity to the token that came after it, "poem". And the last token "poem" has similarities with everything. And thus, when we finally multiply the percentages by the value matrix, the masked self-attention value for the first token "write" does not include anything that came after it. Likewise, the masked self-attention value for the second token, A, only includes "write" and A, and the masked self-attention value for the last token "poem" includes everything. And that is how we calculate masked self-attention. Bam!