DeepLearning.AI
AI is the new electricity and will transform and improve nearly all areas of human lives.

💻   Accessing Utils File and Helper Functions

In each notebook on the top menu:

1:   Click on "File"

2:   Then, click on "Open"

You will be able to see all the notebook files for the lesson, including any helper functions used in the notebook on the left sidebar. See the following image for the steps above.


💻   Downloading Notebooks

In each notebook on the top menu:

1:   Click on "File"

2:   Then, click on "Download as"

3:   Then, click on "Notebook (.ipynb)"


💻   Uploading Your Files

After following the steps shown in the previous section ("File" => "Open"), then click on "Upload" button to upload your files.


📗   See Your Progress

Once you enroll in this course—or any other short course on the DeepLearning.AI platform—and open it, you can click on 'My Learning' at the top right corner of the desktop view. There, you will be able to see all the short courses you have enrolled in and your progress in each one.

Additionally, your progress in each short course is displayed at the bottom-left corner of the learning page for each course (desktop view).


📱   Features to Use

🎞   Adjust Video Speed: Click on the gear icon (⚙) on the video and then from the Speed option, choose your desired video speed.

🗣   Captions (English and Spanish): Click on the gear icon (⚙) on the video and then from the Captions option, choose to see the captions either in English or Spanish.

🔅   Video Quality: If you do not have access to high-speed internet, click on the gear icon (⚙) on the video and then from Quality, choose the quality that works the best for your Internet speed.

🖥   Picture in Picture (PiP): This feature allows you to continue watching the video when you switch to another browser tab or window. Click on the small rectangle shape on the video to go to PiP mode.

√   Hide and Unhide Lesson Navigation Menu: If you do not have a large screen, you may click on the small hamburger icon beside the title of the course to hide the left-side navigation menu. You can then unhide it by clicking on the same icon again.


🧑   Efficient Learning Tips

The following tips can help you have an efficient learning experience with this short course and other courses.

🧑   Create a Dedicated Study Space: Establish a quiet, organized workspace free from distractions. A dedicated learning environment can significantly improve concentration and overall learning efficiency.

📅   Develop a Consistent Learning Schedule: Consistency is key to learning. Set out specific times in your day for study and make it a routine. Consistent study times help build a habit and improve information retention.

Tip: Set a recurring event and reminder in your calendar, with clear action items, to get regular notifications about your study plans and goals.

☕   Take Regular Breaks: Include short breaks in your study sessions. The Pomodoro Technique, which involves studying for 25 minutes followed by a 5-minute break, can be particularly effective.

💬   Engage with the Community: Participate in forums, discussions, and group activities. Engaging with peers can provide additional insights, create a sense of community, and make learning more enjoyable.

✍   Practice Active Learning: Don't just read or run notebooks or watch the material. Engage actively by taking notes, summarizing what you learn, teaching the concept to someone else, or applying the knowledge in your practical projects.


📚   Enroll in Other Short Courses

Keep learning by enrolling in other short courses. We add new short courses regularly. Visit DeepLearning.AI Short Courses page to see our latest courses and begin learning new topics. 👇

👉👉 🔗 DeepLearning.AI – All Short Courses [+]


🙂   Let Us Know What You Think

Your feedback helps us know what you liked and didn't like about the course. We read all your feedback and use them to improve this course and future courses. Please submit your feedback by clicking on "Course Feedback" option at the bottom of the lessons list menu (desktop view).

Also, you are more than welcome to join our community 👉👉 🔗 DeepLearning.AI Forum


Sign in

Create Your Account

Or, sign up with your email
Email Address

Already have an account? Sign in here!

By signing up, you agree to our Terms Of Use and Privacy Policy

Choose Your Learning Path

Enjoy 30% Off Now. Cancel Anytime!

MonthlyYearly

Change Your Plan

Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.

View All Plans and Features

Welcome back!

Hi ,

We'd like to know you better so we can create more relevant courses. What do you do for work?

DeepLearning.AI
  • Explore Courses
  • Community
    • Forum
    • Events
    • Ambassadors
    • Ambassador Spotlight
  • My Learnings
  • daily streak fire

    You've achieved today's streak!

    Complete one lesson every day to keep the streak going.

    Su

    Mo

    Tu

    We

    Th

    Fr

    Sa

    free pass got

    You earned a Free Pass!

    Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

    Free PassFree PassFree Pass
Welcome to Attention in Transformers: Concepts and Code in PyTorch taught by Josh Starmer. Josh is the CEO of StatQuest, an online provider of educational material in AI, data science, machine learning and statistics. It's my pleasure to be here with you, Andrew, and to teach this course. In this course, you'll learn about the attention mechanism, which is a key technical breakthrough that eventually led to Transformers. You'll learn about how these ideas develop over time, how it works and how it is implemented. The transformer architecture and the attention algorithm have been hugely important in the development of large language models. Let me just draw on the history, and then Josh will dive into the details of the algorithms and implementation with some great illustrations and examples. Back from 2014, a lot of researchers were working on machine translation the task of, say, inputting a sentence in English and translating it into French. A very basic approach was to take each English word and look up the French word that the word translates into. But this approach doesn't work that well. For example, the 'word order' may not be the same in English and French. Here, the English sentence starts: "the European Economic Area was..." but the word order is changed in French. Sentences can also be of different lengths. This three word sentence the English: "They arrived late" is five words in French. To manage these challenges a couple of research groups Yoshua Bengio's group at University of Montreal and Chris Manning's group at Stanford University, independently came up with similar approaches and invented an attention mechanism. The paper's society is here on the slide. Both research groups founded an encoder decoder mechanism can be effective for translation. Let me show you how this worked. The encoder reads in one word at a time and produces output vectors one per word. Early approaches have produced a single dense vector that represented the meaning of the entire sentence. But in these new papers, the vectors from each individual words were preserved and made available to the decoder. These dense per word vectors captured the meaning of words in a context of the sentence. Today, we might call these contextual embeddings, where the embedding depends not just on the word, but also on the words around it, on the context. Once the input sentence is converted into vectors, the decoder then uses these vectors as inputs. The decoder would generate the outputs one word at a time to produce, say that French output from the English inputs. Now here's the important part: the decoder has a means of weighting, while we also say attending to or paying attention to each input word. Or really each input was embedding independently based on where it is in the input and where it is in producing the output. So for example, when the translation starts and is trying to generate the first French word, it might weigh, or attend to the first English word in the input most heavily. For the second word, however, in this example, it will attend to the fourth input vector, the vector for area due to the change in word order in French. Continuing, the model weighs or attends to or pays attention to the worst, most relevant for that step of the translation. And this gave us an early form of attention. Just a few years later, in 2017, the paper "Attention is all you need" by Ashish Vaswani, Noam Shazeer Nikki Parmar, Jakob Uszkoreit Llion Jones, Aidan N. Gomez, Lukasz Kaiser, whose actually done some teaching with DeepLearning.AI and Illia Polosukhin was published from my former team the Google Brain team. This paper introduced the transformer architecture and more general form of attention which Josh will be describing today. And was designed specifically to be highly scalable using GPUs. Chatting with Aiden, and he told me about how back then they designed this architecture, the number one criteria for a large design choices was we scale this on a GPU. And that turned out to be a great decision. This paper also studied machine translation and the model described also had an encoder and then also a decoder. The encoder creates contextual embeddings for the input sentence in a single pass, and the decoder then produces the output one word at a time. Each output is fed to the input to the decoder to serve as a context for the next step, so when it's generating the next word, it also knows the previous was and already generated, and the encoder model would go on to be the basis for the BERT algorithm. BERT stands for "Bidirectional Encoder Representations from Transformers", which in turn is the basis of nearly all of the embedding models you might use to create embedding vectors for RAG or recommender applications today. The decoder model has since been used as the basis for the GPT, or "Generative Pre-trained Transformer" family in large language models that OpenAI has been building, and which you might use in ChatGPT. And this decoder is also the basis for most other popular models, such as those from Anthropic, Google Mistral, and Meta. The original paper just used six layers of attention, while for example Llama 3.2-405B by Meta has 126 layers, but the basic architecture is the same as you'll be learning with Josh today. Here's how the course is laid out. We'll start by describing the main ideas behind Transformers and Attention, And then go on to work through the matrix math and coding of attention. You will then learn the difference between self-attention, masked self-attention and work through the PyTorch implementation. Then you will learn the details of the encoder-decoder architecture Andrew just described, as well as Multi-head attention. Many people have helped with this course. I'd like to thank Geoff Ladwig, Esmaeil Gargari and Hawraa Salami. Hey Andrew, what's with the mask? Oh, I thought you're going to talk about mask self-attention. And I thought I would try to illustrate that. Well, clearly I'm self-attention, but maybe this conversation between the two of us, we call that cross attention. Yeah.
course detail
Next Lesson
Attention in Transformers: Concepts and Code in PyTorch
  • Introduction
    Video
    ・
    6 mins
  • The Main Ideas Behind Transformers and Attention
    Video
    ・
    4 mins
  • The Matrix Math for Calculating Self-Attention
    Video
    ・
    11 mins
  • Coding Self-Attention in PyTorch
    Video with Code Example
    ・
    8 mins
  • Self-Attention vs Masked Self-Attention
    Video
    ・
    14 mins
  • The Matrix Math for Calculating Masked Self-Attention
    Video
    ・
    3 mins
  • Coding Masked Self-Attention in PyTorch
    Video with Code Example
    ・
    5 mins
  • Encoder-Decoder Attention
    Video
    ・
    4 mins
  • Multi-Head Attention
    Video
    ・
    2 mins
  • Coding Encoder-Decoder Attention and Multi-Head Attention in PyTorch
    Video with Code Example
    ・
    4 mins
  • Conclusion
    Video
    ・
    1 min
  • Course Feedback
  • Community