DLAI Logo
AI is the new electricity and will transform and improve nearly all areas of human lives.

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

DLAI Logo
  • Explore Courses
  • Community
    • Forum
    • Events
    • Ambassadors
    • Ambassador Spotlight
  • My Learnings
  • daily streak fire

    You've achieved today's streak!

    Complete one lesson every day to keep the streak going.

    Su

    Mo

    Tu

    We

    Th

    Fr

    Sa

    free pass got

    You earned a Free Pass!

    Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

    Free PassFree PassFree Pass
In this lesson, you'll learn about encoder decoder attention and finally understand where the names encoder-only and decoder-only transformers come from. Bam! Let's go. So far, we've seen how an encoder-only transformer uses self-attention to create Context Aware Embeddings, which can cluster similar sentences or similar documents, or can be used as inputs to a classification model. And that's just the start of what we can do with an encoder-only transformer. We've also seen how a decoder-only transformer uses masked self-attention to create generative inputs to generate long streams of new tokens. However, before encoder-only and decoder-only transformers existed, the first transformer ever made had one part called an encoder, that used self-attention, and one part called a decoder that used masked self-attention. And the encoder and the decoder were connected to each other, so that they could calculate something called encoder-decoder attention. Encoder-decoder attention uses the output from the encoder to calculate the keys and values, and the queries are calculated from the output from the masked self-attention generated by the decoder. Once the queries, keys, and values are calculated, encoder-decoder attention is calculated just like self-attention, using every similarity. This first transformer was based on something called a seek-to-seek or an encoder- decoder model. Seek-to-seek models were designed to translate text in one language, like English, into another language like Spanish. For example, Squatch might say "Pizza is great." And the encoder would calculate self-attention from that, and the decoder would use the output from the encoder to calculate encoder- decoder attention, which was then used to generate a translation. "¡La pizza es genial!" It wasn't long after this first encoder-decoder transformer was invented, that people realized that they could build a model that did interesting things with just the encoder, and those models that only use the encoder were called encoder-only transformers. Likewise, it wasn't long before people realized they could generate text, including translations of text with just the decoder, and those models were called decoder-only transformers. So now we know where the names encoder-only and decoder-only came from. Bam! And we have also learned about a third type of attention encoder-decoder attention, which is also called cross-attention. Encoder-decoder attention simply requires us to be flexible with respect to how we calculate the query, key, and value matrices. Double bam! And although this style of seek-to-seek model has somewhat fallen out of favor for language modeling, we still see it in what are called multimodal models. In a multimodal model, we might have an encoder that has been trained on images or sound, and the Context Aware Embeddings could be fed into a text-based a decoder via encoder-decoder attention in order to generate captions or respond to audio prompts. Triple Bam.
course detail
Next Lesson
Attention in Transformers: Concepts and Code in PyTorch
  • Introduction
    Video
    ・
    6 mins
  • The Main Ideas Behind Transformers and Attention
    Video
    ・
    4 mins
  • The Matrix Math for Calculating Self-Attention
    Video
    ・
    11 mins
  • Coding Self-Attention in PyTorch
    Video with Code Example
    ・
    8 mins
  • Self-Attention vs Masked Self-Attention
    Video
    ・
    14 mins
  • The Matrix Math for Calculating Masked Self-Attention
    Video
    ・
    3 mins
  • Coding Masked Self-Attention in PyTorch
    Video with Code Example
    ・
    5 mins
  • Encoder-Decoder Attention
    Video
    ・
    4 mins
  • Multi-Head Attention
    Video
    ・
    2 mins
  • Coding Encoder-Decoder Attention and Multi-Head Attention in PyTorch
    Video with Code Example
    ・
    4 mins
  • Conclusion
    Video
    ・
    1 min
  • Appendix – Tips and Help
    Code Example
    ・
    1 min
  • Course Feedback
  • Community
  • 0%