Spotlights:
Anurag Patnaik
May 1, 2024
The Imperative for ETL in AI Enrichment
In the heart of every modern organization lies unimaginably large troves of data, spread across data sources and formats. We're not just talking about well-structured databases or cleanly formatted CSVs. Unstructured and semi-structured data is vast and varied, encompassing everything from casually written emails to complex technical manuals. Simply traversing this data is daunting enough, but how do you make it usable for LLMs? Over the past year the data science community has made progress maturing Retrieval Augmented Generation (RAG) architectures, which introduce a retriever module to query prompt-relevant data and provide that as context to an LLM.
The challenge over the coming year will be productionizing these systems, ensuring that all of an enterprise's data is available to foundation models, while operating efficiently at scale. This is where new approaches to data ingestion and preprocessing are absolutely critical. Effectively preprocessing and structuring this data is essential not only for making it accessible to your foundation models, but dramatically impacting the efficacy of end-user applications for accelerating and enhancing workflows. In this article, we unpack the core issues associated with data ingestion and preprocessing for LLMs and RAG architectures.
What does it mean to be RAG-ready
To fully leverage LLMs, it's essential to convert unstructured and semi-structured documents into a format that is machine-readable and optimized for use with LLMs.
First we need to extract the text from a file and convert it into a predefined, structured format with associated metadata. This constitutes the Transform stage. Transformation is not enough, though; making documents truly RAG-ready also involves Cleaning, Chunking, Summarizing, and Generating embeddings. In the following section, we'll detail each of these critical steps.
Transform Extract: This step involves pulling a representation of the document's content out of the source document file. The complexity varies depending on the document type. Partition. This entails breaking the document down into smaller, logical units by semantically meaningful context. Whereas one may simply provide the extracted content as a "wall of text," it is recommended classify documents into the smallest possible logical units (elements). Elements provide a solid, atomic-level foundation that we can use to prepare data for a range of machine learning use cases. These can be leveraged for fine-grained cleaning and chunking strategies. Some tools can generate novel element-level metadata thatās useful for cleaning as well as retrieval.
Structure: This involves writing the partitioned results to a structured format like JSON, which enables more efficient manipulation by code in subsequent preprocessing stages. JSON also has the virtue of being human readable, facilitating manual inspection and analysis of the interim or final results. Critical for enterprise scale RAG is to render data into a common data schema, irrespective of the original file format (e.g. .docx, .xml, .xlx, .pdf, etc.)
Clean: Cleaning data involves removing unwanted content (like headers, footers, duplicative templates, or irrelevant sections) that is costly to process and store and increases the likelihood polluting context windows with unnecessary or irrelevant data. Historically, data scientists have had to hard code hundreds or thousands of regular expressions or integrate custom Python scripts into preprocessing pipelines to clean their dataāa laborious approach prone to breaking if document layouts or file formats change.
Chunk: Chunking refers to breaking down a document into segments. Often, chunking is done naively, dividing extracted "wall of text" by arbitrary character count to generate the segments or relying on natural delimiters such as period marks to chunk by sentence. A better approach would be using logical, contextual boundariesāa technique known as smart-chunking. Storing data in this fashion enhances the performance of RAG applications by allowing more relevant segments of data to be retrieved and passed as context to the LLM. This can be accomplished by using an LLM API or other tools.
Summarize: Generating summaries of chunks has emerged as a powerful technique to improve the performance of Multi-Vector Retrieval systems, which use data distillation to improve retrieval and the full text in chunks for answer synthesis. In such systems, summaries provide a condensed yet rich representation of the data that enable efficient matching of queries with the summarized content and the associated raw data. In the case of images and tables, summarization can also greatly improve discoverability and retrieved context.
Generate Embeddings This process involves using ML models to represent text as vector strings, known as embeddings, which are lists of floating point numbers that encode semantic information about the underlying data. Embeddings allow for text to be searched based on semantic similarity, not just keyword matching, and are central to many LLM applications. Developers need flexibility to experiment with various combinations of chunking techniques and embedding models to identify the combination best suited for specific tasks (considering factors like speed, data specialization, and language complexity). Common embedding models are provided by model hosts (including major services: Hugging Face, AWS Bedrock, and OpenAI), allowing users to specify their model and parameters, and let us handle the rest.
The Preprocessing Workflow: So far, we've discussed the journey of rendering a single file RAG-ready. In practice, developers want to unlock the ability to continuously preprocess an ever-changing and ever-growing repository of files. For these production use cases, developers require a robust preprocessing solutionāa solution that systematically fetches all of their files from their various locations, ushers them through the document processing stages described above, and finally writes them to one or more destinations. The comprehensive sequence now looks like this: Connect (sources) ā Transform ā Clean/Curate ā Chunk ā Summarize ā Generate Embeddings ā Connect (destinations).
Source Connectors. Organizations store files in dozens of locations, such as S3, Google Drive, SharePoint, Notion, and Confluence. Connectors are specialized components that connect to these sources and ingest documents into the preprocessing pipeline. They play a pivotal role in ensuring a smooth, continuous data flow by maintaining the state of ingested documents and processing many documents in parallel for high-volume datasets. Enterprise-grade source connectors need to capture and embed source metadata elements so they can be leveraged within the preprocessing workflow or in the broader application it powers. For instance, a workflow might use version metadata to determine whether upstream sources have been updated since a previous run and require re-processing. Metadata like source url and document name allows an LLM to directly cite and link to sources in its response. Document creation date and last modified metadata also are helpful when filtering or searching for content from a document by name. Additionally, these connectors must be resilient to interruptions, able to handle tens or hundreds of thousands of files, and designed to be computationally efficient. Finally, developers must plan for how to support these connectors once theyāve moved them into production. In some cases, community supported open source connectors will be adequate but in the majority of instances, developers prefer to have SLAs attached to these connectors to ensure minimal down time when connectors inevitably break.
Destination Connectors: At the other end of the workflow, destination connectors write RAG-ready document elements and their corresponding metadata to target storage systems or databases. A destination connector should explicitly handle all available metadata, ensuring it's indexed and available for searching and filtering in downstream applications. When writing to a vector database, the destination connector should also leverage available embeddings to enable semantic search across the data. As with source connectors, in most cases developers prefer that these connectors be maintained (and fixed) by a third party to minimize the risk of downtime when they inevitably break.
Orchestration: Beyond just connecting to sources and destinations, a preprocessing platform should also handle workflow orchestration, including automation, scheduling, scaling, logging, error handling, and more. Orchestration is particularly important when moving RAG prototypes to production settings. In most cases, organizations wonāt rely on a single batch upload of data to a vector database; rather, as new files are created and existing ones updated the system automatically updates vector stores. Production RAG requires scheduling to routinely discover and process/reprocess net new data. Scalability is also an important consideration. A workflow may need to be capable of processing files in parallel to handle large repositories in a timely manner. Effective pipeline orchestration and scalability are fundamental for maintaining an up-to-date, efficient data preprocessing system that can adapt to the dynamic nature of enterprise raw data stores.
Exploring the ETL Toolkit for AI Data Preparation
Navigating the vast landscape of ETL tools suitable for preparing data for Generative AI, particularly for embeddings, requires an understanding of the tools' capabilities and their alignment with specific needs. Hereās a closer look at the types of ETL tools available, each catering to unique aspects of data preparation for AI contexts.
Cloud-Native ETL Tools:Ā For organizations leveraging cloud-based LLMs, tools like Matillion and Hevo Data offer a blend of power and simplicity. Their cloud-native design ensures seamless integration with cloud data warehouses, providing a robust platform for processing vast datasets efficiently. These tools are particularly adept at handling structured data and offer extensive connectivity options to various data sources.
Open-Source Solutions for Unstructured Data:Ā Unstructured.ioĀ emerges as a beacon for managing the complexity of unstructured data. Designed to tackle text documents, emails, and other non-traditional data forms, this open-source tool excels at extracting valuable information and transforming it into a structured format conducive to AI training. Its focus on unstructured data makes it an indispensable resource for enriching LLM contexts with diverse and deep content.
Versatile General ETL Tools:Ā Apache Airflow and Pentaho represent the Swiss Army knives of the ETL world. While not exclusively focused on AI data preparation, their extensive transformation capabilities and the ability to be tailored to specific tasks make them highly valuable. Their adaptability ensures that organizations can mold the tools to their precise requirements, whether it involves structured or unstructured data, thereby supporting a wide range of AI applications.