Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Welcome to Lab 3. We're now ready to build an intelligent document analysis agent that combines OCR, layout detection, and VLM tools. You'll follow an agentic approach, which allows you to recognize different content types automatically, apply the right tools for each content type, and combine insights into coherent answers. This is similar to how a human analyst would approach a complex report, scanning the structure first, then diving deep into specific sections as needed. The overall pipeline will have three components. First, text extraction. We'll use PaddleOCR for text extraction and LayoutLM for reading order. Second, layout detection. PaddleOCR layout detection to detect tables, charts, and regions. Third, agentic processing. We'll use a LangChain agent with two specialized tools. AnalyzeChart, a VLM-based chart figure analysis. AnalyzeTable, a VLM table extraction. Let's begin by loading the environment variables from a .env file. This file contains the API key for OpenAI, which we will need later when we define the agent. Next, we import the essential libraries for our pipeline. PIL for loading and manipulating images, cv2 from OpenCV for image processing and drawing bounding boxes, matplotlib for visualizing the result. numpy for numerical operations on image arrays, dataclass, Python's built-in decorator for creating clean data structures. typing for type hints that make our code more readable. With our environment set up, let's dive into the first stage of our pipeline: extracting text from the document and determining the correct reading order using LayoutLM. First, we create a PaddleOCR instance configured for English text. Then we load our sample document, which is the same economics report you've seen in the previous lab. Now, let's run the OCR engine on our document. As you've already seen in the previous lab, the OCR engine returns three key pieces of information for each detected region: the recognized text strings, the confidence score, and bounding box coordinates. Visual verification is important. Here, we draw green bounding boxes around each detected text region. This helps us confirm that OCR is correctly identifying all text areas in the document. Notice how each line of text, table cell, and label gets its own bounding box. Raw OCR output is just list of values. To make our code cleaner and more maintainable, we create an OCR region data class. This gives us a clean type structure for each text region and a convenient bounding box XY XY property that converts the four corner polygon to a simple X1 Y1 X2 Y2 format. This structure format will be used throughout the rest of our pipeline. Now that we have the raw OCR output, we need to determine the correct reading order. Let's use LayoutReader based on LayoutLMv3 for this task. Let's load the pre-trained LayoutReader model from Hugging Face. This model takes bounding box coordinates as input and predicts the reading order position for each box. The model has been trained to understand common document layouts including single column, multi-column, tables, and mixed layouts. With the model loaded, let's define the core function for reading order detection. Here is what it does step by step. First, we calculate the image dimensions. We estimate the image size from the bounding boxes with 10% padding. Second, we normalize the coordinates. LayoutLM expects coordinates in a 0 to 1000 range. So we scale our bounding boxes. Third, we prepare the inputs. We convert to the format that the transformer expects. Fourth, we run inference and get the model's prediction. Last but not least, we parse the results. We extract the reading order from the model's output logits. The result is a list that tells us the reading order for each OCR text region. Let's see the reading order in action. This visualization overlays numbers on each text region, showing the sequence in which they should be read. The numbers in red indicate the reading position. Notice the logical flow. Title first, then the content. following the natural way a human would read the document. Of course, it's not perfect with some jumping around, and this shows some of the limitations you may face with this approach. And depending on the complexity, you may have to fine-tune your own layout model for chunking and reading order. which can be hard to develop, scale, and maintain. Finally, let's combine our OCR text with the reading order to produce properly sequenced text. This function pairs each OCR region with the corresponding reading position. Then, it sorts all regions by their position. and returns a list of dictionaries with position, text, confidence, and bounding box info. This ordered text will become part of our agent's context, allowing it to understand the document content without needing to call a VLM for basic text questions. Now that we can extract and order text, we need to understand what types of content exist in the document. This is where layout detection comes in. As you've seen in the previous lab, PaddleOCR provides a separate LayoutDetection class specifically for identifying document structure. Let's initialize it now. Next, we define the process_document function to run layout detection and return a list of detected regions. Each region includes the label, the type of content, like text, table, chart, figure, etc. The score, confidence score for the detection. bbox, bounding box coordinates in XYXY format. Let's take a closer look at the top five detected regions. Notice the different content types. You have text blocks, chart, and paragraph title. Similar to our OCR results, let's create a LayoutRegion dataclass for clean data handling. Each region will get a unique ID that our tools need to reference specific regions. Then we'll loop through each result and store it in a structured format. Now let's visualize all the layout regions with color-coded boxes. Each region type gets a unique color. and we display the region ID, type, and confidence score. As you can see, we identified titles, tables, charts, and bodies of text. With our layout regions identified, we need to prepare them for the agent. For our agent to analyze a chart or table, we need to send a cropped region to the VLM. This approach has several benefits. First is focused analysis. The VLM only sees the relevant content. Second is reduced noise. Surrounding text does not interfere. Third is lower costs. Smaller images mean lower API cost. Now let's crop each region using the crop_region function. We also convert images to base64 encoding, which is the format vision APIs expect. Then let's load the image and use these two functions to create a dictionary containing information for each cropped region. Keep in mind that even though VLMs are powerful, they're not great at localization. This approach may help improve accuracy, but you'll need to fine-tune your prompts for various edge cases. to deal with greater complexities and variants. Let's display all the crop regions to see what content the agent will have access to. Each region is labeled with its ID and type. Now that we have the text extracted and ordered, and the layout detected, We can start building our agent, starting with the tools. We'll create two specialized tools, each designed for a specific type of content analysis. AnalyzeChart uses a vision language model to interpret charts and figures. AnalyzeTable, also uses the VLM to extract structured data from tables. By creating specialized tools, we can use optimized prompts for each content type and return structured JSON that is easy to process. For VLM, we will use gpt-4o-mini from OpenAI. Next, let's define the prompts for each tool. The prompts we use are critical for getting useful, structured output from the VLM. Each prompt will have three components. First, we'll define the role. In this case, you are a chart analysis specialist. Second, we'll define what to extract. specific fields like chart type, axes, and data points. Then we'll provide a JSON template to ensure consistent output format. Well-defined prompts will help ensure the VLM returns data our agent can reliably use. We'll use the same logic for the second prompt. Before creating our tools, let's define a utility function that handles the mechanics of calling the VLM with an image. It creates a multimodal message containing both the text prompt and the base64 encoded image. Then returns the model's response. The @tool decorator from LangChain transforms our function into an agent usable tool. It takes in the region_id, checks if the region exists using the dictionary of regions. we just created, get the region data, and then passes the cropped image in base64 and the corresponding prompt to the VLM. Following the same pattern, let's create our second tool for table extraction. It uses the table-specific prompt to guide the VLM in extracting structured data with proper headers and row organization. Before integrating our tools into the agent, let's test them individually. This verifies that the VLM connection works and our prompts produce useful output. We'll find the first chart region and analyze it. Notice that the data points are close but not 100% accurate. This is due to a lack of visual reasoning and localization capabilities I mentioned earlier. Similarly, let's test the table tool. This tool does a decent job of extracting the information from this simple table. However, as complexity grows, the agent may have a harder time localizing, extracting, and may be prone to hallucination. With our tools tested and working, we're ready to build the agent that will orchestrate everything. So how does the agent work? First, the agent will receive a question about the document. Second, it reads the system prompt containing all OCR text and the layout info. Third, it will decide whether or not it can answer from text alone or need to use a tool. Fourth, for visual content, like charts and tables, it will call the appropriate tool. And last but not least, it will combine all the information into a coherent response. Before building the agent, let's verify that our data structures are ready. We need both the ordered text from the OCR plus the LayoutLM. and the layout regions from layout detection. Now, let's prepare the context for our agent. The agent needs a well-formatted system prompt. So these helper functions convert our data structures into readable text as strings. The formatted context serves as the agent's memory of the document. Now we construct the system prompt, which is the foundation of our agent's behavior. It contains the role definition, in this case you are a document intelligence agent, the document context, all OCR text in reading order, the layout information, region types and IDs, tool descriptions, when to use each tool. and the instructions, how to handle different content types. Finally, let's put it all together. We'll use LangChain's create_tool_calling_agent to build our agent. This involves tools list, our AnalyzeChart and AnalyzeTable functions. The LLM, gpt-4o-mini for cost efficiency. The prompt template, which combines system prompt, user input, and scratchpad. Then, we'll create an agent that can call tools. Last but not least, we'll set up the AgentExecutor to run the tool-enabled loop. We'll set verbose=True, which is a setting that lets us see the agent's reasoning process. With our agent assembled, let's put it to the test. First, we'll ask a general question about the document. This should be answerable from the OCR text alone, without calling any tools. Watch how the agent uses its context to respond. Now let's ask the agent to extract table data. Watch how it identifies the table region and calls the AnalyzeTable tool to get the structured data. The verbose output shows the agent's reasoning or quote-unquote thinking process. For our final test, let's ask about the chart. This agent will use the AnalyzeChart tool to extract visual information about the trends. and data points that cannot be determined from OCR text alone. And that's it. In this lab, we built a complete agentic document intelligence pipeline that combines multiple AI technologies to understand complex documents. Together, these components form a hybrid system that can deeply understand more complex reports, tables, figures, multi-column layouts, captions, and narrative flow. However, as documents become more diverse and more variable, these multi-agent pipelines can begin to break down. It becomes hard to maintain, brittle across edge cases, and difficult to scale because each component must be tuned, monitored, and orchestrated manually. In the next lesson, Andrea will introduce LandingAI's agentic document extraction, which unifies all the essential steps for document understanding, layout analysis, text extraction, region segmentation, reading order reconstruction, multimodal reasoning, and schema-based field extraction into a single coordinated agentic workflow. See you all there.