In this lesson, you'll learn about encoder decoder attention and finally understand where the names encoder-only and decoder-only transformers come from. Bam! Let's go. So far, we've seen how an encoder-only transformer uses self-attention to create Context Aware Embeddings, which can cluster similar sentences or similar documents, or can be used as inputs to a classification model. And that's just the start of what we can do with an encoder-only transformer. We've also seen how a decoder-only transformer uses masked self-attention to create generative inputs to generate long streams of new tokens. However, before encoder-only and decoder-only transformers existed, the first transformer ever made had one part called an encoder, that used self-attention, and one part called a decoder that used masked self-attention. And the encoder and the decoder were connected to each other, so that they could calculate something called encoder-decoder attention. Encoder-decoder attention uses the output from the encoder to calculate the keys and values, and the queries are calculated from the output from the masked self-attention generated by the decoder. Once the queries, keys, and values are calculated, encoder-decoder attention is calculated just like self-attention, using every similarity. This first transformer was based on something called a seek-to-seek or an encoder- decoder model. Seek-to-seek models were designed to translate text in one language, like English, into another language like Spanish. For example, Squatch might say "Pizza is great." And the encoder would calculate self-attention from that, and the decoder would use the output from the encoder to calculate encoder- decoder attention, which was then used to generate a translation. "¡La pizza es genial!" It wasn't long after this first encoder-decoder transformer was invented, that people realized that they could build a model that did interesting things with just the encoder, and those models that only use the encoder were called encoder-only transformers. Likewise, it wasn't long before people realized they could generate text, including translations of text with just the decoder, and those models were called decoder-only transformers. So now we know where the names encoder-only and decoder-only came from. Bam! And we have also learned about a third type of attention encoder-decoder attention, which is also called cross-attention. Encoder-decoder attention simply requires us to be flexible with respect to how we calculate the query, key, and value matrices. Double bam! And although this style of seek-to-seek model has somewhat fallen out of favor for language modeling, we still see it in what are called multimodal models. In a multimodal model, we might have an encoder that has been trained on images or sound, and the Context Aware Embeddings could be fed into a text-based a decoder via encoder-decoder attention in order to generate captions or respond to audio prompts. Triple Bam.