Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In this lesson, you'll learn how to parse documents using optical character recognition and how to integrate its results into agentic workflows. You'll build an agent to parse text from documents using OCR, and then extract information from the text using LLMs. Along the way, you'll identify cases like handwriting, tables, and scanned images that make these steps really challenging. Let's dive in. Today is lesson one, document processing basics. We're going to start simple today, but everything we cover is foundational to the production grade systems used by enterprises and developers today, which we will cover in subsequent lessons. So here's the agenda for today. Processing what it is and why it matters. Parsing, extracting and output formats, JSON, Markdown, and how they are used. OCR, how it works under the hood and its workflow and limitations where it can break. Agentic AI and the ReAct framework. adding a brain on top of OCR, and we'll review some practical demos and failure modes and what still challenges real systems. And after all that, we'll roll up our sleeves and do some coding where you'll build your first simple document agent. You'll notice this is a bottom-up journey from pixels, text, to structure and then reasoning. Let's ground this in the real world pain these systems are meant to solve. Modern organizations are flooded with digital documents, invoices, receipts, contracts and reports. These documents live in a massive digital filing cabinet, usually as PDFs, PowerPoints, Word docs, or even images. They're built for human eyes, not machines, which means the information is hard to search, hard to analyze, and of course, hard to automate. If your data is trapped inside unstructured documents, someone has to manually open, read, and retype it into a different system, which does not scale. So if we want to use this information in analytics, automation, or AI, we need to convert unstructured documents into structured, machine-readable data. So, what does a solution look like? Document processing is turning unstructured documents into structured, machine-readable data, typically JSON or Markdown. It's more than just grabbing text. Parsing must understand what pieces of text actually mean, how they're related, and how to organize them into predictable structure. If you're parsing an invoice, you don't just want a blob of text. You want to extract the vendor name, invoice date, total amount, and line items. Extracting assumes text is already machine-readable. But if your document is a scan or a photo, the computer sees only image pixels. So before extracting can happen, we need OCR to turn pixels into text. How do we represent the output? Parsing and extracting typically produce two useful formats. First is Markdown or HTML, which is designed for humans like us and LLMs. This preserves structures like headers, tables, and lists, and is perfect for feeding into LLMs or showing to end users. Second is JSON for machine and APIs. It is hierarchical, easy to traverse programmatically, and perfect for downstream pipelines and applications. So remember, JSON is for machines, Markdown or HTML is for humans and LLMs. Good rule of thumb, if you're thinking about analytics or databases, JSON is a great option. and if you're thinking about building a RAG solution or a chat user interface, Markdown or HTML could be a great option. But all of this assumes we already have the text. What if all we have is just pixels? OCR or optical character recognition is the technology that takes an image of text and converts it into machine-readable text. It typically works in two steps. First is image cleanup, so deskewing, denoising, or contrast adjustment. Then, text recognition. Pattern matching. Does this shape look like an eight or a B? Once it recognizes the text, it will produce editable text as a digital output or a searchable PDF. But we also need to be honest about what OCR cannot do. OCR is really great at reading clean documents, but it does not understand structure, meaning, or relationships. After OCR, you typically get a wall of text. If you want to find totals, extract tables, identify headings, and classify documents you need intelligence on top of OCR. You can think of OCR as the eyes, but not the brain. And beyond its lack of understanding, it could also just fail. OCR breaks in predictable ways. Poor image quality, blurry photos, shadows, and noise. Complex layouts and skews, so multi-column text, nested tables, And of course non-standard text like handwriting, stamps, stylized fonts. These failures cascade down into parsing and extraction errors. So OCR is necessary but nowhere near sufficient for full document understanding. So if you've ever tried to OCR a picture of a receipt taken in a dim restaurant, you've probably experienced all three failure modes at once. All of this leads to a key idea. Processing is not the same as understanding. OCR can read characters, but it doesn't comprehend them. It doesn't know what is a header versus a value. which number is the total or whether text belongs in a table or a footnote. OCR gives you perception, pixels to character, but no cognitive layer. To turn that wall of text into meaningful, structured data, we need a brain. And that's where agentic AI comes in. Agentic AI adds the missing cognitive layer. An agent is an autonomous system that can perceive its environment, reason about goals, and take certain actions. In document processing specifically, the agent reads the document via OCR if needed, thinks about what the user asked, chooses which tools to call, and iterates until it reaches the goal. If the OCR is the eyes, the agent is the brain. And rule-based pipelines can collapse the moment you hit an edge case, whereas agents can reason through them. So what does an agentic system look like under the hood? Agentic document systems typically have three parts. The brain, using an LLM, which is responsible for reasoning, planning, and decision making. the eyes or OCR, which converts visual content into text. And the hands, or the tools that agents can use. APIs, database lookups, file operations and function calls. And when all these three things are wired together, you can say find the total amount on this invoice. Then the agent will decide to run OCR, inspect the text, locate the total and return the answer. You don't have to hardcode every step or every edge case. This mental model, brain, eyes, and hands, will appear again and again in subsequent labs and lessons. But how does the agent think, not just once, but step-by-step. The ReAct, reason and act framework describes how agents think. First is thought. What do I need to do next? Then it will take a certain action, choose and call a set of tools that it needs. And once this is done, it will go through an observation process to examine the result and repeat. Think, act, observe, and think again. This loop gives agents agency, adaptability, and the ability to correct mistakes. Most modern agentic frameworks are built on some variation of ReAct. And of course, one major benefit is that it's debuggable. You can literally read the agent's thoughts and tool calls. All right, that's enough theory, let's build something. Now it's lab time. We're going to take everything we covered on the slides, parsing, OCR, and agentic reasoning, and build a simple document agent together. By the end of the lab, you'll have an agent that can read a document, use OCR as a tool, extract structured information. And don't worry if this is your first time, we'll go step by step. Let's switch to the notebook.