Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Having learned how to parse and extract information from your files, you'll now learn how to answer questions about these files. In this lesson, you'll build a RAG application to query your PDFs. You'll first use ADE to parse the documents. Then store the parsed chunks in a vector database and retrieve them to answer questions about your documents. Let's dive in. Up to now, we've been learning about document AI, why OCR fails, how to best utilize layout, and why reading order matters. and how LandingAI's agentic document extraction gives us clean, grounded output by combining all the complexities of document processing in a single unified workflow. So now the obvious question is, what do we do with all this structured data? One possible answer is building real systems. Specifically, we'll build a Retrieval-Augmented Generation pipeline or a RAG system that turns a 74-page financial filing into a living breathing document and a queryable knowledge base. This is the same architectural pattern powering modern document QA systems in production today across all industries and verticals. Let's frame this in a real scenario. Imagine you're building an internal platform for a hedge fund. Your analysts have SEC filings like this Apple 10-K. It contains detailed financial and business information that publicly traded companies must disclose to the security and exchange commission. The analysts want to upload the files to the platform and ask questions like, what was Apple's net sales in 2023? What are the biggest business risks? How did the services revenue trend year-over-year? The information they need is in the document, somewhere. Maybe in a table on page 28, maybe in a footnote on page 45, or spread across three different risk disclosures on pages 12, 15, and 18. Now, why does traditional keyword search fail here? First is Semantic mismatch. If an analyst searches for revenue, but if the document uses phrases like net sales, traditional search would return nothing. The concepts are identical, but the words don't match exactly. Second is Context blindness. The word revenue may appear 75 times in this document. And if you want to know which mention is relevant to the analyst's question, keyword search can't tell without additional guidance. And this isn't scalable. Third is fragmentation. Information is scattered. To fully answer the question, what are the risks, you need to synthesize content from multiple pages. Traditional search can't do that. You need semantic understanding. You also need context-aware retrieval, which means you need a system that understands what the user is asking, not just matches strings. This is where RAG comes in, which is what you will build in the lab for this lesson. So what is RAG and why does it matter? RAG, or retrieval-augmented generation, is the architecture behind nearly every modern document Q&A system today. The pipeline has six steps across three core phases. First is preprocessing phase. You parse, embed, then store. Parsing takes raw documents and parse and extract clean, structured text. This is where ADE comes in. Clean input is critical. Garbage in, garbage out. If your parsing is noisy with OCR errors and mangled tables that you saw in previous lessons, everything downstream suffers. Embedding converts parsed content into embeddings or vectors that capture semantic meaning. Storing saves these vectors in a vector database optimized for similarity search. You will use ChromaDB, a local, open-source option perfect for development and learning. Second is the retrieval phase. Query and retrieve. Querying embeds the user's question, searches through the database to find the top-ranking vectors as measured by similarity, in other words, the most similar vectors to the query. Retrieving filters out any vectors whose similarity might be too low, then fetches the rest. Third is the generation phase. It feeds the retrieve content as context to a language model to generate a natural language answer with the corresponding retrieve content for grounding and verification. This is critical for heavily regulated organizations or HROs like financial services, healthcare, and life sciences. Now here's a key insight that ties back everything you've ADE's clean grounded output and phase one is what makes this entire pipeline possible. If your parsing is unreliable and if you're feeding the system distorted OCR outputs or missing tables, charts, and graphs with visual storytelling, no amount of clever embeddings, prompt engineering, or sophisticated retrieval will save you. By the end of this lesson's lab, you will understand how ADE output flows directly into a RAG system, build a local RAG pipeline using ChromaDB and embeddings generated by an OpenAI model, run semantic search, not keyword search, over real documents, verify every result visually using grounding images. Then be fully prepared to scale and deploy this system to AWS. We'll cover this in lesson six. So you may be wondering, why are we building locally? Why not jump straight to AWS for a real production system? That's a great question. And there are three reasons. number one is faster iteration. You can change code, rerun a cell, see results in two seconds with no deployment overhead, no debugging. When you're learning, speed matters. Reason number two is lower cost. Experimenting locally is essentially free, after the initial API calls. You're not burning cloud compute credits while you learn and debug. Reason three is clearer learning. You can strip away the cloud complexity to focus purely on the RAG mechanics and data flow. In lesson six, we'll take this exact logic and productionize it on AWS. Same data flow at cloud scale. But you'll build the foundation in this lab. In particular, here's what you'll use in the lab. The input will be Apple's 10-K from the previous fiscal year, a 74-page PDF of dense financial reporting, which is a real document with real-world complexity. The parser, ADE API from LandingAI. The outputs are provided to you, so you won't be doing the parsing step. This process was well covered in the previous lesson with Andrea. The output, markdown text plus JSON chunks with metadata. Clean, structured data ready for embedding. Embeddings will be generated using OpenAI's Text Embedding 3 Small model. We're using their API for convenience, but you could swap it out for any open source alternatives. Vector database, ChromaDB running locally in the lab environment with persistent storage. Let's start with the first step of the pre-processing phase, parsing with ADE. When you parse a document with ADE, You will get back a ParseResponse object. Let's break down its structure. At the top, you have parse_result.splits, which is a list. One split per page when you pass the split equals page parameter. Each split has two key attributes. First is the .markdown. This is the clean markdown text extracted from that page. Tables become markdown tables with pipes and dashes. Headings become markdown headings with hash symbols. It's human readable, structured text, not a wall of unformatted OCR mess. It's also LLM ready for downstream use cases. Second is .chunks This is a list of chunk IDs. Each chunk is a piece of content on the page. A paragraph, a table, a figure, or a caption. For each chunk, you also have metadata. chunk_id, a UUID that uniquely identifies this piece of content. The text, the extracted content itself. chunk_type, which tells you what kind of content this is. text, tables, figures, attestation, logo, etc. bbox. Bounding box coordinates showing exactly where on the page this chunk appeared. page. Page number where the chunk appears. And importantly, for every chunk, you can generate a grounding image, a visual crop from the original PDF. This gives you the ability to debug when you're building your pipeline, and you can visually verify that extraction worked correctly. No guessing. You can build trust when your system tells a user net sales were 383 billion dollars. You can show them the exact table the information came from. They can see with their own eyes, which builds massive confidence in your system. Compliance. In regulated industries like finance, healthcare, and legal, you need to prove where information came from. Grounding images give you that audit trail. No hallucinations. Since you've already seen how to call the parse API with Andrea, we'll skip this step in the lab. You'll be directly provided with these ADE outputs. The next step in preprocessing is to embed the text of each ADE's chunk. You will transform each chunk's text into an embedding vector of size 1536. Each embedding vector will mathematically encode the meaning of the parsed text so that semantically similar text gets a similar vector. So when you ask a question about your document, you can retrieve the chunks that have embeddings similar to the embeddings of your question. This is called chunk level embedding. The other option is to do Page-Level embedding, meaning you embed the entire text of a page. Let's understand the tradeoffs. In practice, Page-Level is simple to implement, fewer embeddings, faster database build, and it can be effective for broad questions and shorter documents. Chunk-Level provides more precise retrieval, exact table, paragraph matching, more granular context, and is better for focused questions. with complex documents. And with ADE, we can even give you cell level groundings for complex tables. In the lab, you'll use chunk level approach, but in production, you can tune this based on your use case. The last step in preprocessing is to store the vectors in a vector database. We're using ChromaDB because it's ideal for learning and prototyping. You have one-line install, persistent storage, since ChromaDB saves your vectors into your disk. automatically. If you close your notebook and come back tomorrow, your data will still be there. You don't have to re-embed everything. This is crucial for efficient iteration. Fast Similarity Search. ChromaDB uses HNSW indexing under the hood. Hierarchical Navigable Small World graphs. This is a state-of-the-art algorithm for approximate nearest neighbor search. Queries return in milliseconds, even with thousands of vectors. You'll have same API locally and in production. ChromaDB has a client server mode. The same API you use locally works in production with a remote ChromaDB server. This is huge. No mental model shift when you scale. And finally, rich metadata support. For each chunk, you can store metadata that can help with filtering. allowing you to retrieve items based on specific criteria within their metadata. In the lab, you'll store the chunk_type, page number, and the coordinates of the bounding boxes as metadata for each chunk. When you add a chunk to ChromaDB, you need to specify an ID. You use the exact chunk ID that is provided from ADE. Note that in lesson six, we'll swap this for AWS Bedrock Knowledge Base, but the concept stays identical. After you set up the vector database, you'll write a function that you'll use in the retrieval step. Let's walk through exactly what happens when you run it. for what was Apple's net sales in 2023? First, we'll embed the question and it will be transformed into an embedding vector. Same embedding model used during preprocessing and same size for the vector. Second, search. You'll pass that query vector to ChromaDB and say, find me the top k chunks that are closest to this query vector. By default, k equals 3. So we want the three most similar chunks. You can adjust this based on your use case. Third is score. ChromaDB also returns distance metrics. You convert the distance metrics to similarity. Similarity equals 1 minus distance. Higher similarity means better match. This is a parameter you can adjust to further fine-tune your RAG engine. Fourth is filter. You'll then remove results based on similarity thresholds. Then you can visualize. Finally, you'll display the chunk text, IDs, scores, page numbers. and types for the return results. You will also use the coordinates of the bounding boxes to display the grounding images. This is one of my favorite features of ADE, and is something that differentiates LandingAI from commodity document AI systems. For every chunk that ADE extracts, you can generate a grounding image. This is a PNG file showing the visual crop of that specific chunk from the original PDF. Let me tell you why this is so crucial in practice. This builds massive trust. Users aren't just blindly accepting answers or getting fooled by LLM hallucinations. They're verifying them. And when they verify a few answers and see they're correct, they trust the system for future queries. This will ensure adoption. Grounding images gives you that paper trail and opportunities to introduce human in the loop for risk mitigation. Imagine a financial analyst using your system to pull data for a quarterly report. Six months later, if the auditors ask where did this number come from, they can say it came from page 28, table 3, row 5, column 6 of the Q3 10-K filing. And here's the visual proof, which is really powerful. You will finally incorporate what you've learned into a full RAG pipeline. And for that, you'll be using LangChain for orchestration. taking advantage of a pre-built chain that combines the retrieval and generation phases of a RAG pipeline. You will create a retrieval object from the ChromaDB database. Retrievers are components of LangChain that fetches information and slots into prompt as added context. Sometimes a retriever might provide too many chunks. Given a limited context window of LLMs, LangChain can combine multiple chunks into a single prompt or sequence many chunks over several prompts iteratively. All right, let's build a real RAG system with ADE.