So first a little bit of setup for our modern OCR stack. We're going to import PIL for images, OpenCV and numpy for image processing, matplotlib for plotting, and of course PaddleOCR from PaddleOCR. This cell is importing the API key that we'll need for the agent, very similar to the previous lab. Having completed our setup, let's get into it. All right, so here we create a PaddleOCR object with the language specified as English. And notice that the two models from the architecture diagram from the slides are represented here. So here's that _DET text detector model, and here's the _REC recognition model. Now, you won't actually call these models sequentially. PaddleOCR is going to handle all of the pre-processing and calling these models all as one pipeline. This receipt image will look familiar from lab one. This cell runs the OCR. The result will be a list where each item corresponds to a processed page. So for a single page image, it's only going to contain one dictionary. So this cell is going to print some of the contents of result zero. Specifically the text that was recognized, the confidence score associated with that and the bounding box coordinates. And as we scroll down, we'll see all of the text that was recognized across the entire receipt. All right, before we execute this cell, let me explain what to expect. Recall from the architecture diagram that the pipeline also pre-processes the image where necessary. And it can correct rotational issues or deskew or unwarp the image. So what we're going to see now is actually the pre-processed image with the bounding box and the text recognition overlaid. So let's go ahead and request the image. and then execute another cell where we can view it with all of those items overlaid. Now, if you compare this image carefully to the original, you'll actually see that some of the background is removed and there is a slight rotation clockwise to make it a little bit more horizontal. And then the code draws the bounding boxes on the processed image and places the recognized text on the top. So why repeat an example such as this one? Well, notice the addition of the bounding boxes. We now have localization information about where in the receipt a particular text field is found. And also notice that we are getting the value correct. Previously this 795 was incorrect. All right, so now we're going to turn PaddleOCR into a tool that can be used with an agent. This will be very similar to lab one. So this long function, we're setting this up as a tool, and here the result is going to be coming from the PaddleOCR, and much of the rest is the same. We're going to be printing out the text and the boxes and the confidence scores as we saw in the prior example. All right, having defined that function, we'll now specify that that function is the tool that will be available to the agent. The rest of this is going to be the same as lesson one, and note that we're using gpt-5-mini as our LLM. This task should look familiar as well. We're going to use the receipt and evaluate that the total is correct. So, as I scroll down here, of course we see the PaddleOCR output here in the turquoise, the green output is coming from the LLM. The task again was to perform some basic addition, and because the inputs are correct, then the addition has a much greater chance of being correct, and we do get the correct total. So this example really shows how a combination of PaddleOCR and an LLM can handle a real world receipt. Remember the Table exercise and the student Handwriting, we're going to do those two next. We're going to define another helper function. There's actually nothing new to see here. So the OCR result is still coming from PaddleOCR. We've still got some printed output. We're still working with the post-processed image, and then we're going to be doing some annotations on top of that using the bounding box coordinates. But now it's wrapped into one function called run_ocr, and you'll see this one repeatedly through the rest of the lesson. All right, so let's apply all of that to the table. This is a reminder of what it looks like. And our second chunk of output is the PaddleOCR printed information. Again, we've got the recognized text, confidence score, and bounding boxes. Scrolling on past that, we have the actual overlaid bounding boxes with the recognized text in blue. So, as you visually inspect this, you may detect one or more OCR errors. And the first one that I see actually continues with the treatment of the exponential notation. We had problems with this previously. The 10th to the 20th over here is recognized as 1020, and that's like a billion billions difference from the real answer. So it also seems like most of the exponents are recognized in this way. But let's go ahead and power forward. We'll give the agent the task of again extracting the flops from the English to German, noted as EN-DE column in the table. All right, let's scroll through this output. Again, the turquoise is coming from PaddleOCR. The green is coming from the LLM. And then in our function we also asked for this printed result down at the bottom. So you can do the visual inspection for yourself or you can trust me that this is actually fully correct. And let me draw your attention to two aspects that were incorrect in the previous run. So this ByteNet and this Deep-Att are correctly noted as not found, because they are blank areas in the original table. And fortunately, our scientific notation has been corrected. Instead of 1020, this is now reported as 10 to the 20th power. And how is that possible? This actually nicely demonstrates the reasoning power of LLMs because for something like FLOPs, which stands for the floating point operations, it simply doesn't make sense for the result to be 10 20. It needs to be a much larger number. So the agent has been able to apply reasoning to the OCR output and remedy that situation. Let's see if we can also remedy the issue with the student handwritten grammar exercise. Here's a reminder of what it looks like. Let's take a look at the output and notice at the top here that the name is John Smith and it seems pretty legible to me. But the student name here is detected as Myar and I'm not even going to try to say that. But it's clearly not John Smith. Let's scroll down to the overlaid bounding boxes and call attention to a few other things. So question number one is very clearly I am, but in the text recognition, I think we're getting I underscore and then this looks like a u a n to me. So that's not good. Number two, you know, we've got an extra underscore in here. We could probably clean that up in post-processing, but it would certainly be nice if it was accurate from the beginning. And for item number nine, we are able to recognize this person's handwriting as is, which is an improvement over the Tesseract performance. So now let's give the agent the task of extracting the student responses as JSON and notice that the task instructs do not correct any grammatical errors. All right, let's scroll down to the responses and we'll actually just skip right ahead to the LLM response here. So, question one here is somewhat nonsensical. And we've already noticed that the name is completely messed up. But there's some good news here too. So the JSON formatting is correct. That's some good news and the incorrect grammatical answer has stayed true to the student's original response. Again, this was a grammar worksheet, and so we don't want the LLM overwriting the student's response. So to wrap up this section, we've repeated three exercises from lesson one. And recall from the PyTesseract architecture diagram that there is a detection stage and a recognition stage. So the detection is what's giving us the bounding boxes, and that is clearly helping with the understanding of the overall document. And the recognition here is also better. There are fewer actual character-level mistakes in terms of transcribing or recognizing the text. So definitely performing better on these identical examples. And now we'll move on to some more difficult examples to expose some weaknesses. So in this section, we're going to look at three new examples and we have specifically selected them to expose some weaknesses. But these are things that you need to be aware of if you're building with PaddleOCR. So this is a new example. It's called report.png and it appears to be an interior page from a report about the US economy. At the top there's a table, the center has some basic text, and at the bottom there's a line chart with a caption to the left. Let's take a look at how PaddleOCR has extracted this document. So at first glance, the table output actually looks quite good and the text is fairly straightforward. But let's take a look at this line chart. So there's no box around the whole thing. So that's my first clue that it's not being recognized as a single unit. But there are boxes around some of the X and Y axis labels, such as this 0, -2, -4, -6. And these are now completely out of context, which we can see if we scroll back up. So indeed, here's that zero, completely disconnected from this -2, separated by some other content and then -4. So there's really no way of understanding that these were Y-axis labels or that they actually belonged to a chart which was ignored completely. So we've exposed one weakness. All right, let's take a look at article.jpg. What does this contain? This looks like it's the front page of an academic article about teeth. Okay, so what do we notice about this article? So there are multiple columns of text here. So at the top of the article, there's kind of two columns. We've got the abstract next to this call out. And then the body of course is presented as three columns of text, which is kind of interrupted by this table. So, let's scroll down to view the bounding box output and see how well it does with this multi-column document. All right, there's a lot of boxes here, but I'm just going to draw your attention to the first and second column. And of course, the reading order here, you would read it in most of the westernised countries that undertake oral health surveys, right? So we would be reading all the way down the left-hand column. But in the OCR, we're actually going to see reading straight across if you're following my mouse. So it would be in most of the westernized countries that system based on some of the interview, which of course makes absolutely no sense and manages to garble the entire article if you continue that way for, you know, another 10, 15 pages. So what we've learned about PaddleOCR is that it is not able to handle these multi-column layouts and runs the risk of garbling your text if this is a layout that's part of your document. So we've surfaced these weaknesses around layout, and we're starting to realize that layout-aware text detection is really going to be at the cornerstone of accurate real-world OCR because of course, layouts like this are very common. But the PaddleOCR model doesn't have vision. We actually are going to need some sort of vision model for more complex documents. So good news, PaddleOCR actually does have its own version of Layout Detection. And as we mentioned before, it has been going through active development and we haven't been using the Layout Detection previously. So here's our chance to import that and start to use it in conjunction with the text detection and recognition. So here we will initialize that LayoutDetection model. So we'll define a new function called process_document and we'll send our image to this layout_engine. And as part of the response, we'll get back the label, the score, and the bounding boxes. So the score and the bounding boxes we've seen before, but now we're going to get a label for that region. And that'll make more sense once you see it. All right, let's apply process_document to that interior page from the report on the US economy. Hopefully now the meaning of label is more clear. We're getting back things such as text, chart, paragraph_title, or number, or even footer. So this is identifying the different regions on the document and labeling them for what they are. And the purpose of this long function is just to aid in visualization. So let's look again at that economic report, but now with the layout detection overlaid. Let's take a look at things here. I'm seeing several text blocks. I'm seeing a paragraph title. I'm seeing a table up here, and very importantly, I'm seeing a chart and an entire bounding box around that chart. And then also small details such as number and footer. But the layout model really now has identified the major regions of this document correctly. And now you'll do the same thing with the article about teeth. So as I'm visually taking this in, I'm seeing labels that I've seen before, such as text and paragraph_title. I'm seeing a few new ones, such as doc_title and abstract. I'm seeing high confidence scores associated with those. And then as I scroll down, seeing some footnote and footer and also table. But I will note that the table to me is one table, and it's being recognized here as two tables with slightly lower confidence. But all in all, the layout model is now really going to help keep this text together. So we no longer have the problem of continuing from the word that to the word system because this entire text block is going to indicate that the word after that is undertake. So let's see how far we can push this with one more example. This one is a bank statement, and typically when dealing with bank statements, the ultimate downstream use case is to extract a certain number of key value pairs out of it. And I'm actually seeing the same challenge with the table. In this case, this is all being detected as one large table, but for me, you know, with my human vision and human experience looking at bank statements, can actually tell that the headers should have been right here. Date, Description, Category, Amount, and Balance. That's actually the table break and the table headers. Everything up above could be interpreted as a table, but it should definitely be a separate table from the one below. So there's definitely still some weaknesses here. Oddly in this example, the small text at the bottom is entirely ignored, and sometimes that's where a legal footnote or other important information might appear. All right, so that brings us to the end of lab two. And let me recap a few of the main takeaways. So PaddleOCR is from the Deep Learning OCR era. And it's clearly a strong engine that beats traditional OCR on many of these real-world images. But it still mostly thinks in terms of individual lines of text. Then we added the layout detection, and that gave you some of this region level structure. So where are the paragraphs, the tables and the figures, but it's still not full semantic understanding. And it doesn't really represent the way that humans see documents. And humans really operate very much on a visual system. So we'll continue to bring more of these concepts from vision as we move forward in the course. In the next lesson, we'll hand it back to David, and he'll double click on some of these concepts around layout and reading order in a lesson called layout detection and reading order.

Document AI: From OCR to Agentic Doc Extraction

Intermediate

Topics

Document Processing

Collaborator

LandingAI

Document AI: From OCR to Agentic Doc Extraction

Introduction
Video
・
3 mins

Document Processing Basics
Video
・
8 mins

Lab 1: Document Processing with OCR
Video with Code Example
・
13 mins

Four Decades of OCR Evolution
Video
・
8 mins

Lab 2: Document Processing with PaddleOCR
Video with Code Example
・
18 mins

Layout Detection and Reading Order
Video
・
12 mins

Lab 3: Building Agentic Document Understanding
Video with Code Example
・
14 mins

A Single API for Agentic Document Understanding
Video
・
6 mins

Lab 4: Document Understanding with Agentic Document Extraction
Video with Code Example
・
15 mins

Lab 4: Document Understanding with Agentic Document Extraction II
Video with Code Example
・
12 mins

Agentic Document Extraction for RAG
Video
・
14 mins

Lab 5: Agentic Document Extraction for RAG
Video with Code Example
・
9 mins

Building RAG Pipelines with Agentic Document Extraction on AWS
Video
・
14 mins

Lab 6: Building a Research Paper Chatbot with Strands Agents
Video
・
16 mins

Conclusion
Video
・
1 min

Quiz

Graded・Quiz

・

10 mins

Links & Resources
Reading
・
10 mins