News Details

🤖 AI Snack 🍿 : Elevating Your RAG - The Significance of Proper Document Parsing

When building a Retrieval Augmented Generation (RAG) system, one crucial step is effectively parsing your source documents (primarily PDFs and HTMLs).  This involves breaking down the text into meaningful "chunks" that will later be indexed. So, what makes "quality chunks" and how to create them?

Properties of a Quality Knowledge Item:

  • Logical Boundaries: Chunks should begin and end at natural sentence breaks, ensuring a smooth reading experience and avoiding truncation of ideas.
  • Optimal Size: Balance chunk length with your embedding model's context window. Smaller chunks provide fine-grained detail but might miss broader context, while larger chunks offer a more general view but risk losing specific information. Finding the right balance is key and it will depend on the nature of the documents you are working with.
  • Self-Containment: Each chunk should convey a complete idea or fact, acting as a mini-summary that's understandable on its own. This is crucial for effective knowledge retrieval.

Chunking Methods: Your Options

  • Token-Based Splitting: This is a simple approach, but it requires careful tuning of chunk size and overlap. Smaller chunks with high overlap result in precise embeddings that capture specific details, while larger chunks with less overlap provide more general embeddings but might struggle with retrieving specific information. While this method can achieve the first two properties (logical boundaries and size), it often falls short on self-containment.
  • Semantic Chunking: This method aims for self-contained chunks using techniques like embedding models or large language models (LLMs). While it's the ideal approach for achieving all three properties, current implementations still have limitations and might be expensive to run.
  • Custom Parsers: Developing a custom parser is the most effective approach for achieving high-quality knowledge items, but it also requires the greatest effort. The key prerequisite is that your documents must share a consistent structure. By creating a document-specific parser, you can meticulously analyze the documents to extract chunks that satisfy all three essential properties while aligning with your specific use case. For instance, if your documents contain numbered sections (e.g., 1, 1.1, 1.2, 2.1), you can design your parser to detect these sections and return each section as an individual chunk or group sections together based on what makes the most sense for your application.

The level of customization offered by the custom chunking parser process is the best option to guarantee that the resulting knowledge items are optimally structured for your RAG system. This is the reason why our ChatFAQ's SDK allows to create custom parsers when creating a custom RAG SDK:

Python Code Example
from .parsers import CustomParser
from .. import make_chatfaq_sdk
from .fsm_definition import fsm_definition

def main():
 sdk = make_chatfaq_sdk(
  fsm_name="custom_fsm",
  fsm_definition=fsm_definition,
  data_source_parsers={"my_custom_parser": MyCustomParser()},
 )
 sdk.connect()

If you are looking for a RAG pipeline capable of interacting with your own company documents, go and clone ChatFAQ's open-source repo here!