Welcome to pre-processing unstructured data for LLM applications both Retrieval Augmented Generation or RAG has been widely adopted in many enterprises. The typical RAG pipeline has key components like data loading, chunking, embedding, storing in the vector database, and then retrieval. In this course, you'll learn techniques for representing all sorts of unstructured data, like text, images, and tables from many different sources, like PDF and PowerPoint and Word, in a way that lets your LLM RAG pipeline access all of this information. A particularly challenging task in RAG is data loading and chunking due to data being stored in many different file types and data formats. A particularly challenging task in RAG is data loading and chunking due to data being stored in many different file types and data formats. For example, you may have numeric data in Excel spreadsheets, or text reports in PDF or Markdown, or presentations in PowerPoint or Slides or Keynotes, or communications in Outlook or Slack or Teams and so on. Each of these file types also in turn might support data stored inside them in different formats. A PDF or PowerPoint file, for example, may itself contain tables, images, or bulleted lists. So a data loader must first be able to parse many different file formats. But once it's parsed that data, then what? It turns out that it's very useful to normalize the data from these different sources. So when you normalize tables from, say, within a PDF or a PowerPoint or other format, it can all be represented in a similar way. Or maybe a bulleted list, whether from a PDF or from email can also be represented in a similar way. It is also useful in addition to maintain some sort of structure of the original documents by preserving that structured information in metadata. For example, maybe recording that a paragraph has a parent, which is the title of the chapter. A query that matches that chapter can be expanded to return child text as well with your data organized in this example in say a tree hierarchical structure. With us to explain how all this is done is Matt Robinson, who's head of product at Unstructured. Matt's team has been responsible for Un unstructured tools for ingesting data for LLMs to use, and he's helped many developers build LLM applications that use and combine data from diverse sources. Thanks, Andrew. I'm excited to work with you and your team on this. This course tackles the critical yet often overlooked aspect of LLM app development, data pre-processing. You'll learn how to extract and normalize content from a wide variety of document types, including PDFs, PowerPoints, Word, and HTML, enabling your LLM to access a broad range of information. You'll also learn how to enrich this content with metadata, enhancing RAG results, and supporting nuanced search capabilities. This course covers document image analysis techniques like, Finally, you'll apply these techniques to build a RAGbot using documents like PDFs, Many people have worked to create this course. I'd like to thank, from Unstructured, Brian Raymond and Ronny Hoesada there, In the first lesson, you'll learn how you can extract and normalize content from a diverse range of document types, so your LLM can reference information from PDFs, PowerPoints, Word docs, HTML, and more. Data engineering is a key aspect of getting the context you need to your LLM to let them do well on your application. I hope you enjoy learning these leading-edge techniques, which I think you'll find useful for building many applications. Let's go on to the next video and get started.

Preprocessing Unstructured Data for LLM Applications

Introduction
Video
・
4 mins

Overview of LLM Data Preprocessing
Video
・
3 mins

Normalizing the Content
Video with Code Example
・
14 mins

Metadata Extraction and Chunking
Video with Code Example
・
21 mins

Preprocessing PDFs and Images
Video with Code Example
・
10 mins

Extracting Tables
Video with Code Example
・
8 mins

Build Your Own RAG Bot
Video with Code Example
・
9 mins

Conclusion
Video
・
1 min

Quiz

Graded・Quiz

・

10 mins