In this lesson, you will learn about Multi-head attention. You'll learn about how it is used and how it is incorporated into a transformer. Let's get to it. So far, we've seen that attention helps establish how each word in the input is related to the others. And for a simple example, what we've seen so far works fine. However, in order to correctly establish how words are related in longer, more complicated sentences and paragraphs, we can apply attention to the encoded values multiple times simultaneously. Each attention unit is called a Head and has its own sets of weights for calculating the queries, keys, and values. And when we have multiple Heads calculating attention, we call it Multi-Head Attention. Bam! In this example, we have three attention heads. However, in the manuscript that first described transformers, they used eight attention heads. In our example, with three heads and two attention values per head, we end up with six attention values. In order to get back down to the original number of encoded values that we started with, two, we simply connect all the attention values to a fully connected layer that has two outputs. Bam! Note: another commonly used way to reduce the number of outputs is to modify the shape of the value weight matrix. So far we've used a matrix with two columns of weights, and that gave us a value matrix with two columns. And as a result, each attention head has two outputs. However, if we only use one column of weights, then the value matrix will only have one column, and each head will only output one value. And in this case, since we started with two encoded values, we could require a maximum of two Heads to get back to the same number. Or we could just code our transformer to be more flexible about these things. Bam! Now we understand the ideas behind Multi-Head Attention. Bam!