Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Documents do not always present information in a strictly top-to-bottom, left-to-right order. In this lesson, you will learn how to use models like LayoutReader to detect text ordering. After identifying a document's layout and locating and ordering the text, you will use a VLM to develop a holistic understanding of the document. Okay, let's get to it. In lesson two, Andrea walked us through the evolution of OCR, from traditional engines like Tesseract to modern deep learning systems. Now that we understand how text is extracted, we're ready for the next challenge: documents with complex layouts. We'll start by grounding ourselves in what layout detection and reading order actually mean, and why they matter so much for document intelligence. Then, we'll look at how modern systems try to solve these problems using learning-based approaches instead of simple heuristics. cover real-world challenges, dealing with forms, tables, figures, handwriting, and multilingual documents and how they can be solved with specialized models. From there, we'll introduce Vision-Language Models and talk about how they differ from traditional language-only models. Finally, we'll discuss how you can combine these techniques into a hybrid architecture that you'll implement in the lab. Let's begin with the pipeline that many teams and organizations still rely on today. First, you extract the text from the document. Then you pass that text into a language model and ask it questions. On the surface, this feels reasonable, simple, even elegant. But the problem is that most text extraction can be destructive. The moment we flatten a document, structure is lost. columns and rows get mixed together, tables become meaningless floating text blobs. Captions detached from figures, and reading order becomes unpredictable. For complex documents, financial reports, research papers, and legal contracts. OCR plus an LLM simply doesn't have enough context to reason correctly. Layout detection or Document Layout Analysis is the first step toward fixing that problem. Rather than treating a document as a corpus of raw text, layout detection identifies meaningful regions on the page and figures out where they are and what they represent. It distinguishes between paragraphs, tables, figures, headers, footers, and captions. In other words, we're no longer just extracting content, we're understanding the structure of the document. This turns out to be incredibly important. Preserving layout prevents different sections of text from getting mixed together and convoluted. It preserves the narrative flow in multi-column documents. It allows you to target specific parts of a page like totals in a table or key fields in a form. The core idea here is simple but powerful. layout is important. Once you throw it away, understanding becomes fragile and error prone. If you look at real world documents, such as the ones you've seen in the previous lab, they're rarely just blocks of text. They include columns, tables, charts, images, captions, headers, footers, stamps, and annotations. When you explicitly detect and label these components, downstream models know what kind of information they're looking at before they try to reason over them. Once we know what's on the page and where it is, the next question is how should it be read. Layout tells us where things are. Reading order tells us in what sequence a human would actually read them. This matters enormously in multi-column layouts or documents with floating figures and captions. Without a reliable reading order, even a clean layout still leaves room for ambiguity. Historically, reading order was handled with rules. You would sort regions top to bottom, left to right, maybe apply an X-Y cut algorithm and hope for the best. This works for very clean documents, but breaks down immediately for anything real and complex. Any columns, sidebars, or floating elements, and those heuristics start producing nonsense. That changed with LayoutReader. Instead of rules, it's a learned model trained on ReadingBank dataset over 500,000 annotated pages with correct reading sequences. Each word gets represented as a tuple containing the word itself and its apparent index. Along with layout features, word color, bounding box coordinates, width, and height. The model learns to predict the correct reading order from these visual and spatial features. This handles complex layouts, multi-column structures, and irregular reading flows that rules-based system can't manage. Let's look at how LayoutReader actually works. The architecture is a sequence-to-sequence model using LayoutLM as its encoder. LayoutLM was developed by Microsoft in 2020 and combines text, layout, and visual information. The purpose is to predict the correct reading order of words or text lines in a document. The process takes OCR-produced bounding boxes, for example from PaddleOCR, and rearranges the document's token sequence to reconstruct a human-readable reading order. The model was trained on ReadingBank, a benchmark data set Microsoft created specifically for this task. It contains 500,000 document images annotated for reading order. You can see the diversity Research papers with multi-column layouts, scientific documents with mixed structures, and financial documents with tables and lists. It guessed the correct reading sequence despite their varying layouts. Now, let's talk about why OCR and reading order is not enough. Reading order detection depends entirely on the input quality from the OCR. But here is the fundamental limitation. OCR only captures text. It misses visual information, images, charts, diagrams, spatial relationships, and visual context. So even with perfect reading order, you're working with incomplete information. So, what does this look like in practice? You'll see these challenges repeatedly. Misaligned forms, complex tables, handwriting, multilingual documents, and figures that require interpretation rather than transcription. These are exactly the points where OCR-based pipelines fall apart. And you've already encountered some of these issues directly in the lab. Forms are a perfect example of why this is hard. The core challenge is associating labels with values, especially when they're not adjacent. You have a few possible approaches. Template-based extraction using fixed coordinates, which is fast, but inflexible when layouts change. Flexible key-value pair detection, using rules or ML based on proximity and content, or fine-tune transformers like LayoutLM, trained on datasets like FUNSD that learn these relationships automatically. And you still need computer vision for non-text elements like checkbox states. Solving this requires combining layout understanding with semantic reasoning and visual detection. Tables are another notorious failure point. Traditional OCR destroys table integrity by flattening rows and column relationships. Once structure is lost, numbers stop making sense. Modern models tackle this differently. Table Transformer uses object detection to find tables, rows, and columns as distinct objects. TableFormer takes an image-to-sequence approach, translating table images directly to HTML. TABLET uses split-and-merge for large, dense tables. The output can be CSV, JSON, HTML, or Pandas DataFrame. so we can reason over data instead of text blobs. Handwriting and multilingual documents represent two more failure modes. Standard OCR trained on printed text fails on handwriting. Due to varied styles, cursives, inconsistent character formation, all break the model. Intelligent Character Recognition or ICR. is a specialized model trained on handwritten data sets and uses CNN+RNN architectures for sequence modeling and character-level predictions. Multilingual documents bring different challenges, non-standard characters, unique fonts, and alternate reading directions. right-to-left for Arabic, vertical text in East Asian languages. This requires robust multilingual models with automatic language detection, language-specific OCR engines, script detection and routing, as well as reading order adaptation per language. These specialized models have been the workhorses of Document AI for years. Each refined to excel at a specific task. But recently, a new paradigm has emerged, Vision-Language Models. These represent a fundamental shift from specialized tools to general purpose intelligence that can handle all these tasks and more. So what is a Vision-Language Model? Traditional LLMs purely over text tokens. VLMs unify vision and language, processing images and text simultaneously to form a shared semantic representation. This allows models to reason about what's happening in the visual scene, not just what words appear. Let's discuss what changes when we move from an LLM to a Vision-Language Model. On the left is a regular LLM. It takes text tokens as input, goes through the transformer, and you get a text output. Everything the model understands comes from text alone. On the right is a Vision-Language Model. The input is Image + Text. Before anything reaches the language model, the visual input has to be translated into something the LLM understands using a Vision Encoder. Then it produces the text output. That's what the three components below represent. First is the Vision Encoder. Models like CLIP or SigLIP convert pixels into visual vectors. Second is the Projector. This translation layer converts visual vectors into token embeddings the LLM can process. Third, the LLM backbone. At this point, a standard LLM reasons over those visual tokens and produces text output. The key takeaway is simple. A VLM is still an LLM but with a vision stack in front of it. can see images, but how well it understands documents still depends on how structure, layout, and reading order are handled. Vision language models are powerful, but they're not a complete solution on their own. So if you give a VLM a visually rich document all at once, VLMs can struggle. They hallucinate when visual cues are missing or ambiguous. They lack deterministic grounding, meaning they can't reliably tie their answers back to specific regions of the page. They struggle with nested layouts, multi-page structure and small text. VLMs excel when paired with structure, but they can't replace the structural reasoning required for real-world document tasks. One possible solution is to combine layout detection with VLM reasoning. The Layout Analysis we discussed earlier provides the structure foundation, identifying reading order, document regions, and their type. This structure information can then decide how you process each region. Charts and visualizations might go to a VLM with a targeted prompt. Tables could use either a VLM or a specialized Transformer depending on the complexity. Text regions might use traditional OCR. or VLM based extraction. In conclusion, layout detection provides the deterministic grounding, while the VLMs handle the elements that benefit from visual reasoning. One way to orchestrate this workflow is through an agentic framework. In the lab, you will implement a pipeline that demonstrates this. Starting with an input document, you will extract and order text using PaddleOCR. This gives you all the OCR text along with Bounding Boxes and Confidence Scores. You will then use LayoutReader to reorder the extracted text. You'll also run region detection using PaddleOCR's LayoutDetect to identify tables, charts, and text blocks. You'll provide all the context above, the OCR text in the correct order. region IDs and chunk types as context for a LangChain Agent. The agent will have access to two specialized tools. Analyze chart to send cropped chart images to a VLM, to extract chart type, axes, data points, and trends. Analyze table does the same for tables. extracting headers, rows, values, and notes. Based on the user's question, the agent can decide which region needs VLM analysis and which tools to invoke. So, let's get to it. By the end of the lab, you'll see exactly how these pieces work together and how this sets the foundation for ADE in the next lesson with Andrea.