This lesson is going to focus on long context, what it takes to support it, train it and utilize it. You'll also learn about performance metrics for evaluating long context models. Let's dive in. Long context capabilities unlock the full potential of large language models, enabling them to excel at a range of complex tasks. First, they work really well in processing and reasoning over long documents like financial agreements or legal contracts, while preserving key context and details throughout. In retrieval augmented generation pipelines long context models, enhance the integration of retrieve information, increasing the chance of finding an answer that is both accurate and contextually coherent. These models can also retain multi-turn history interactions, ensuring consistent and natural conversations. Many-shot-learning is another critical use case. Instead of fine tuning, we can embed thousands of examples directly into the context. This approach is cost effective and flexible, enabling quick adaptation to new tasks without retraining. Finally, long context models are essential for agentic needs and advanced reasoning tasks such as chain of thought and chain of three. They can support extensive token choices required for decision making and complex problem-solving. However, training a long context model presents a range of significant challenges. First, there is massive computational costs. Longer sequences dramatically increase training time and training models exclusively on long context quickly becomes extremely expensive. Adding to this, transformer architecture is the backbone of most LLMs face inherent limitation or scaling to long context. These issues are compounded by computational bottlenecks. Hardware memory impose strict limits on the context clients models can handle. Next is the data challenge. There is simply not enough naturally occurring long context data available for effective training. In addition, balancing long and short context during training is another practical harder. Overemphasizing long context may degrade performance on short context tasks while leaning to heavily on short context data limits and models ability generalize the longer sequences. Finally, evaluating long context models is particularly challenging. Standard evaluation methods often don't capture the full complexity of how these models perform in real world scenarios and models, ability to process from context doesn't necessarily equate to its ability to utilize them effectively in practice. Ever since these challenges demand innovations in architecture, data strategies, training infrastructure, and methods, as well as evaluation methods, we should approach it from every aspect to fully unlock the potential from context. As we look at architectural considerations for long context LLM, it's important to understand the challenges and innovations driving progress. First, performer-only architectures face significant hurdles when scaling to long context. They demand substantial memory and computational resources, which limits their ability to scale efficiently. While techniques like sparse retention and sliding windows can extend the context, they often come with a cost of degraded quality. To overcome these limitations, we turn to innovative architectures. Mamba architecture is highly efficient not only due to its support for parallelism, but also because its state-based design significantly reduces computational and memory complexity. This combination enables faster training and lower resource usage, making it particularly well-suited for handling long content. However, Mamba alone doesn't consistently match the top quality output of transformers. A potential use case involves providing the model with long document and asking them a question that requires integrating information from across the entire context. Such cases can be challenging for Mamba alone, as it may struggle to compress all the necessary information effectively. The hybrid Jamba architecture integrates Mamba with transformer layers, leveraging Mamba's efficiency, and extended context capabilities alongside the Transformers ability to deliver high-quality responses. This synergy ensures production-grade performance while keeping training computationally efficient and resource fair. Returning to the long context example, attention can be seen as performing retrieval from Mamba recovering important information that may have been lost during compression. To handle the growing computational demand of context training, adapting the infrastructure to support advanced parallelism is crucial. Several techniques play a key role here, including fully sharded data parallel, tensor parallelism, sequence parallelism, and expert parallelism. Sequence parallelism in particular, is vital for long context training. It works by distributing the computation and activation memory across GPUs along the sequence dimension of transformer layers. This approach not only reduces the memory footprint, but also enhanced performance by optimizing parts of the transformer that were previously untouched by parallelization. Moving on to data collection and generation. The quality of the data directly influences the effectiveness of long context models. There are two key types of sources we focus on for training long context models. We begin with collection of natural long documents such as books and code bases. These sources provide rich, continuous content that aligns well with the needs of long context models, offering depth and complexity. Additionally, synthetic data generation plays a crucial role in supplementing natural data. By generating diverse and relevant data, we can fill gaps and further train the model at scale. With architecture selected, the infrastructure optimized, and the data prepared, we're now ready to dive into the training process. Let's begin by examining the central stages of training a long context LLM. These stages can be broadly divided into pre-training, mid-training, and post-training. Retraining the model learns to predict the next token in the sequence, which allows it to acquire a deep understanding of language and general knowledge. When training a long context model, an intermediate short phase is introduced between pre-training and post-training. During this phase, pre-training continues with high proportion of long documents to emphasize the model's long-range capabilities. At the end of this phase, we have a long base LLM. Next, post-training aligns the base model with supervised fine tuning or SFT instruction tuning data. In some cases, preference tuning is also applied to further improve the performance. For this lesson, we will focus only on SFT training during the post-training phase. The goal of post-training is to achieve two objectives simultaneously. First, provide the model with skills and conversational capabilities. The second goal is to retain the capabilities from pre-training, particularly the long context abilities developed during mid-training. By the end of this stage, we have a long instruct LLM that can follow instructions across varied context length. Balancing short and long context data is key to optimizing long context language models during my training and post-training. In mid-training, a higher portion of long documents, this front run context capabilities, but incorporating short context data ensures the model retain versatility for short tasks. In post-training, predominantly short context datasets pose challenges. Surely fine tuning on short context data with degrading long context performance. Careful data mixing and performance monitoring are essential to maintaining balance. Now let's explore another training strategy: length curriculum. A technique used to progressively increase the context length during model training. The idea behind length curriculum is to gradually expand the model's ability to handle longer contexts. For instance, start with 32K context window, then move to 64k, 128k, and so on. This gradual increase helps the model adapt to handling longer sequences of data while preserving its understanding of shorter context. For example, in case of Llama pre-training, the context window started at 8K and was gradually increased in fixed stages. Ultimately reaching 128K. By following this approach, the model is able to build long context proficiency in a controlled manner. Ensurance the ability and consistent performance across different context lengths. After reviewing their training aspects, we now turn to the evaluation of long context models. A comprehensive evaluation of both effectiveness and efficiency. The model must leverage the full context to produce high-quality outputs utilizing and integrating relevant information appropriately. On the other hand, the model must also be practical in real-world use. This means it should not be too slow in generating answers and should not be too costly in terms of resources. We now get to an important distinction in evaluating the context quality. Claim context length versus effective context length. Claimed context length refers to the maximum input length the model can technically process without error. The model's input length. Effective context length, however, is the maximum input length at which the model can still perform tasks accurately and effectively. Integrating relevant information into its output. Understanding this distinction is important because long context capabilities are not only about processing larger inputs, but also about the model's ability to leverage the information in those inputs for high-quality results. To ensure we are evaluating long context models effectively, it's important to choose a comprehensive benchmark that truly captured the model's capabilities in real-world tasks. The needle in the haystack benchmark is a traditional long context evaluation. It focuses on testing synthetic retrieval tasks for a specific position in a long context, but it may not fully capture the complexities of real-world applications, making it less comprehensive for practical use cases. A more robust option is the RULER benchmark from Nvidia. This benchmark evaluates long-context model across four key areas critical for real-world performance retrieval, multi-hop tracing, aggregation, and question answering. The RULER benchmark also highlights the distinction between claimed and effective context length. It defines effective context length at the longest context window, where the model achieves greater than or equal to 85% accuracy, offering them a more meaningful and practical measure of a model capabilities. Here we can say that not all claim context length are truly effective. However, Jamba consistently delivers on its promised context length. Now let's move on to efficiency measurements for long context models. When evaluating the efficiency, we focus on two key aspects. Latency. The time taken for an LLM to generate a response after receiving a query. Throughput, the number of queries and LLM can handle in a given time frame. Optimizing both latency and throughput is crucial to ensure that the model remains responsive and efficient, even under heavy workloads. In terms of latency as the context length approach, Jamba outperformed competitors, demonstrating significant improvements. For throughput, Jamba also leads, and interestingly, as context length increases, the performance gap between Jamba and other models widens. To wrap up expanding context, windows in LLMs unlocks new applications and enhances performance across complex, lengthy inputs. We examine the key challenges involved in extending the context length, and one for the critical components required to successfully train and evaluate these models. By optimizing architecture data training and evaluation, we are pushing LLMs toward broader high-performance applications.