Now that you have a high-level view, let's dive into the details of Transformers state space models and how they have evolved. Understanding these foundations will highlight how Jamba builds on them and solves key challenges in LLM architectures. All right, let's go. The transformer architecture is predominant architecture for language models. This architecture is based on the attention mechanism, where each token interact with every other token in the sequence, enabling the model to capture complex relationships between tokens. It creates a matrix comparing each token with every token that came before. The weights in the metrics are determined by how relevant the token pairs are to one another. So the complexity is quadratic due to the pairwise interactions between all tokens in the sequence. Now let's look at how inference works in a transformer. As each token is generated, the model calculates attention across all past tokens in the sequence. Each token that is generated is then included in the input, and the process is repeated for each token. To avoid recalculating attention for all previous tokens at each iteration step, we use a KV-Cache. KV-Cache stores the vector representations of previous tokens that were used in the calculation of attention, allowing us to focus on calculating attention for the new token. With KV-Cache, each step has linear time complexity and sequence length with quadratic complexity Overall. However, this approach comes with the cost that the memory requirement grows linearly with the sequence length. Revisiting the entire context comes with significant memory and compute demands, and leads to slow inference that is challenging to sketch. For long sequences, these demands can create major limitations. To overcome this, we are exploring an alternative approach that manages context more efficiently. To explore this approach, let's start by defining the concept of state. State representing the internal memory that stores relevant past information, helping the model make accurate predictions about future tokens. Different models handle this memory in different ways depending on their architecture. Transformers create a formal state through their attention mechanism. They actually remember every detail in the history, and we can interpret the KV-Cache as the transformers state. This approach, however, is highly inefficient. An alternative approach, the transformers is to employ a fixed state representation. These models compress past information into a fixed and manageable size state that is updated at each step. For example, whether the context is 200 tokens or 200,000 tokens, it is compressed into the same state size. We will refer this model as a state- based models. During inference, state-based models only need to process the previous state Ht minus one and the current input token Xt to update the state into Ht and then generate the next output Yt. It then repeats the same process to generate the future outputs. This approach reduces the computational requirements drastically. It scales efficiently with the input length, avoiding quadratic rows in computation. Specifically, this approach requires constant time inference per step with a constant memory complexity that doesn't scale with a sequence length. This is much more efficient than the time and the memory complexity of attention based architectures. If this concept looks familiar, it's because it lies at the core of RNNs, a traditional architecture that implements the concept of a fixed-sized state. The core component of an RNN, is the recurrent unit, which includes a hidden state that acts as the model's internal memory. Let's look at inference using RNN. To generate the token after is the model uses the token Jamba as input to update the hidden state and then generate the output token. This hidden state carries forward the compressed information about all previous tokens to help generate the next tokens. This means the memory requirements are constant regardless of the sequence length, that the time complexity grows linearly with the sequence length, which is much more efficient. However, because information is summarized into a fixed size, RNNs can lose the ability to capture dependencies across long distances, making them less effective than transformers. In addition, they are costly to train because the dependencies between time steps prevent parallelization, limiting training efficiency. Here we can see a comparison between transformers and RNNs. RNNs are more computational less efficient, but offer a lower quality and struggle to scale during training. The interesting fact is that RNNs were introduced before Transformers, but due to their drawbacks, they didn't fully realize the potential of state models. Structured and state space models, also known as S4, provide a more efficient way to manage state. Imposing a certain structure on the model's parameters and processing the state for linear operations. Here is a diagram showing how inference happens with SSMs A bar, B bar, and C are the model parameters used at every step to generate an output token. Given an input token Xt, the model updates its state by linearly combining the previous state and the current input, using respectively A bar and B bar. A bar helps identify what to forget and one to remember from the state over time, and B bar, helps identify what to remember from the new inputs. After updating the state, the model uses C to map the current state to the output. C helps identify how to use the updated state to generate the next token. The forms of A bar and B bar matrices rely on a parameter labeled data which represents the step size. It essentially controls the balance between how much to rely on the previous state versus the current input. One notable characteristic of SSNs is their dual representation. They can function as a linear recurrence or as a one-dimensional convolution. Using convolution as SSNs, we introduce the possibility of training LLMs at scale, while maintaining the efficient inference of RNNs. Let's return to RNN comparison. SSMs are efficient both at inference and at training, which is their key advantage over RNNs. However, the quality still lags behind that of Transformers. The main reason for this is that the state in SSMs is independent of the inputs. Let's check again the diagram of structured SSMs. A bar, B bar, C and delta are all learned constants. This means that at every step, the model processes every single input in the same way. Consider again this phrase: "Jamba is hybrid". With structured SSMs, all of these tokens equally contribute to the state that the model will use to generate the next token. However, tokens, Jamba and hybrid are more relevant to generate the next token. Selective SSM addresses this issue by making all the parameters dependent on the input. This concept of selectivity enables the model to focus on or filter out inputs based on the relevance to the task. So, with selection, the state becomes more expressive focusing only on the most important data is when remaining concise. Mamba is a state space model that incorporates selective SSM into a recurrent architecture that selectively processes information based on the current input. This selection mechanism makes Mamba context aware, allowing it to adaptively determine which information should be stored in the state for future predictions. By combining structured state space modeling with a selective filtering process Mamba achieves a balance between efficient state management and targeted information retention. Now, if you check the Mamba paper, you will see this diagram that shows more details than the top simplified diagram you saw so far. For instance, A bar t does not only depend on delta t as we've previously discussed, but also on another metric parameter which is A. In this lesson, we won't go into the details of what A actually represents, but if you'd like to learn more about how Delta t and A are related to A bar t, I encourage you to check the Mamba's paper by Gu and Dao. Note how the selection mechanism is specifically applied to delta t, which indirectly affects A bar T. Also, B bar T depends on two parameters, A t and delta t. Note how the selection mechanism is applied to delta t and B t. And finally, the two Cs are equivalent. Note how the selection mechanism is all applied to C t. However, the selection mechanism disrupt the ability to compute the convolution, which is what enabled parallelization in structured SSM training. Another important contribution of Mamba is its ability to maintain parallelization during training. This is achieved by applying the parallel scan algorithm and applying hardware aware memory management. Again, I encourage you to check the Mamba's paper for more details. As a result, we end up with a robust architecture that excels in training, efficiency, inference, speed, and memory footprint while also delivering high-quality performance. This is why Mamba succeeds when performance falls short, particularly in handling long contexts and real-world production workloads. But unfortunately, Mamba does have its drawbacks. Mamba falls short when careful handling of specific tokens is required. For some operations, the complex representation of hidden state isn't enough, and an attention mechanism is required. One example is copying the specific words or sentences from the context. As shown in this paper, Mamba is less successful in predicting a repetitive sequence of words. Here is an example of where copying from input is important and Mamba will do worse than transformers. The model is required to classify movie reviews as either positive or negative. And here is a sample output. The transformer will excel in such tasks, and not only will succeed in identifying the intent of the review, but also output the right label. However, even when Mamba succeeds to identify the correct intent, it often outputs a non-existent label as it doesn't attend to exactly all options found in the context. In order to have the advantages of these architectures while mitigating their drawbacks, we implemented our own novel architecture called Jamba. Jamba, which stands for Joint Attention and Mamba, combines both attention layers and Mamba layers. On top of the combination between transformer and Mamba, we also use an additional technology called Mixture of Experts, which allows us to use only a portion of the model weights on each token, as chosen by a router. Each such portion is called an expert. This way we can improve quality without sacrificing speed or care size. With color combination of transformer, Mamba, and MoE, a Jamba block which creates a strong and flexible architecture. This flexibility allows Jamba to balance the sometimes conflicting objectives of low memory usage, high throughput and high-quality outputs. So there's a trade-off here. Adding more transformer layers helps address the issues we discussed with Mamba, but it also increases complexity. The key was finding the optimal number. In the application we conducted, as detailed in the paper, we found that a ratio of seven Mamba layers and one attention layer had the best quality, while being much more efficient in throughput and memory footprint. Coming back to our table, we see that Jamba achieves top performance as transformers while also maintaining high efficiency in all aspects. Backing this up with benchmarks, we can see that Jamba 1.5 models achieved top scores across common quality benchmarks. Moving on, Jamba is the fastest model among all leading competitors, setting a new standard for efficiency without compromising on performance. All right. In this lesson, you learned that Jamba's hybrid structure of transformer Mamba, and Mixture of Expert components enables it to adaptively manage resources while maintaining performance, making it a powerful choice for scalable language modeling.