Skip to main content

Splitting markdown documents for RAG

Written by on .

rag
markdown
pdf

  1. The Problem
    1. The Solution
      1. Splitting
        1. Retrieval
        2. Further Improvements

          I wanted to add a feature to Glama that allows users to upload documents and ask questions about them.

          Ask a PDF question
          https://glama.ai allows to upload documents and then ask those documents a question by referencing the document in the promt.

          I've built similar features before, but they were always domain specific. For example, looking up recipes, searching for products, etc. A generalized solution had a few unexpected challenges, e.g. converting documents to markdown, splitting documents, indexing documents, and retrieval of documents all turned out to be quite complex.

          In this post, I'll walk through the strategy of splitting documents into smaller chunks, since this took me a while to figure out.

          The Problem

          When you have a domain-specific rag, it is typically easy to just create a dedicated record for every entity in the domain. For example, if you are building a recipe rag, you might have a record for each recipe, ingredient, and step. You don't have to worry about splitting the document into chunks, since you already know the semantic structure of the document.

          However, when you have a generalized rag, your input is just a document. Any document. Even when you convert the document to a markdown format (which has some structure), you still have to figure out how to split the document into context aware chunks.

          Suppose user uploads a document like this:

          # Recipe Book ## Recipe 1 Name: Chocolate Chip Cookies ### Ingredients * 2 cups all-purpose flour * 1 cup granulated sugar * 1 cup unsalted butter, at room temperature * 1 cup light brown sugar, packed * 2 large eggs * 2 teaspoons vanilla extract * 2 cups semi-sweet chocolate chips ### Instructions 1. Preheat oven to 350°F (180°C). Line a baking sheet with parchment paper. 2. In a medium bowl, whisk together flour, sugar, and butter. 3. In a large bowl, beat the egg yolks and the egg whites together. 4. Stir in the vanilla. 5. Gradually stir in the flour mixture until a dough forms. 6. Fold in the chocolate chips. 7. Drop the dough by rounded tablespoons onto the prepared baking sheet. 8. Bake for 8-10 minutes, or until the edges are golden brown. 9. Let cool for a few minutes before transferring to a wire rack to cool completely. ## Recipe 2 ...

          If we knew it is a recipe book, we could just split the document into chunks based on the ## Recipe 1 and ## Recipe 2 headers. However, since we don't know the structure of the document, we can't just split it based on headers.

          1. If we split too-high (h2), we might end up with too large chunks
          2. If we split too-low (h3), we might end up with too many small chunks that do not have the necessary context to answer the question

          So we need to split the document such that:

          1. Each chunk would have useful embeddings
          2. Each chunk could retrieve sufficient context to answer the question

          Sounds like an impossible task, right? Well, it is. But I found a solution that works pretty well.

          The Solution

          The solution is a combination of several techniques.

          Splitting

          1. Parsing the document into a tree structure
          2. Splitting each node in the tree into semantically meaningful chunks

          Example:

          Using our example document, the tree structure would look like this:

          { "children": [ { "children": [ { "children": [], "content": "### Ingredients\n\n* 2 cups all-purpose flour\n* 1 cup granulated sugar\n* 1 cup unsalted butter, at room temperature\n* 1 cup light brown sugar, packed\n* 2 large eggs\n* 2 teaspoons vanilla extract\n* 2 cups semi-sweet chocolate chips\n", "heading": { "depth": 3, "title": "Ingredients" } }, { "children": [], "content": "### Instructions\n\n1. Preheat oven to 350°F (180°C). Line a baking sheet with parchment paper.\n2. In a medium bowl, whisk together flour, sugar, and butter.\n3. In a large bowl, beat the egg yolks and the egg whites together.\n4. Stir in the vanilla.\n5. Gradually stir in the flour mixture until a dough forms.\n6. Fold in the chocolate chips.\n7. Drop the dough by rounded tablespoons onto the prepared baking sheet.\n8. Bake for 8-10 minutes, or until the edges are golden brown.\n9. Let cool for a few minutes before transferring to a wire rack to cool completely.\n", "heading": { "depth": 3, "title": "Instructions" } } ], "content": "## Recipe 1\n\nName: Chocolate Chip Cookies\n", "heading": { "depth": 2, "title": "Recipe 1" } }, { "children": [], "content": "## Recipe 2 ...\n", "heading": { "depth": 2, "title": "Recipe 2" } } ], "content": null, "heading": { "depth": 1, "title": "Recipe Book" } }

          The benefit of this structure is that we can now store these sections in a database while retaining their hierarchical structure. Here is the database schema:

          Table "public.document_section" Column | Type | Collation | Nullable | Default ----------------------------+---------+-----------+----------+------------------------------ id | integer | | not null | generated always as identity uploaded_document_id | integer | | not null | parent_document_section_id | integer | | | heading_title | text | | not null | heading_depth | integer | | not null | content | text | | | sequence_number | integer | | not null | path | ltree | | not null |

          The path column is a PostgreSQL ltree column that allows us to store the hierarchical structure of the document. This is useful for querying later on.

          However, this alone is not enough. Since each section can be infinitely long, we need to split sections into smaller chunks. This also allows us to create more granular embeddings for each chunk.

          I ended up using mdast to split each section into chunks between 1000 and 2000 characters. I made exceptions for tables, code blocks, blockquotes, and lists.

          Here is the resulting database schema:

          Table "public.document_section_chunk" Column | Type | Collation | Nullable | Default ---------------------+--------------+-----------+----------+------------------------------ id | integer | | not null | generated always as identity document_section_id | integer | | not null | chunk_index | integer | | not null | content | text | | not null | embedding | vector(1024) | | not null |

          The embedding column is a PostgreSQL vector type that stores the embedding of the chunk. I used jina-embeddings-v3 to create the embeddings. I picked something that scores relatively well on the MTEB leaderboard, but also relatively low in terms of memory usage.

          Okay, so now we have a database that stores the document sections and their embeddings. The next step is to create a Rag that can retrieve the relevant sections/chunks for a given question.

          Retrieval

          Retrieval is the process of finding the relevant chunks for a given question.

          My process was to:

          1. Use LLMs to generate several questions based on user's input, e.g. If user asks "What is the recipe for chocolate chip cookies?", my LLMs would generate queries that break down the question into smaller parts, e.g. "chocolate chip cookies ingredients", "chocolate chip cookies instructions", etc.
          2. Query the database to find the top N chunks that match the generated queries.
          3. Use the document_section_chunk and document_section relationship to identify which sections chunks belong to, and which sections are referenced by the chunks the most frequently.

          At this point, we know:

          1. which chunks are the most relevant to the question
          2. which sections are the most relevant to the question

          We determine the most relevant sections based on their ordering using cosine distance.

          However, we don't know which sections/chunks can be used to answer the question, i.e. just because a chunk has a low cosine distance to the question, it does not mean that the chunk answers the question. For this step, I ended up using another LLM prompt. The prompt is given the question and the candidate chunks, and it asks the LLM to rank the chunks based on how well they answer the question.

          I later learned that Jina has a Reranker API that does essentially the same thing. I compared the two approaches and found that both solutions perform equally well. However, if you prefer a higher level of abstraction, Reranker is a good choice.

          Finally, I have a handful of sections/chunks that answer the question. The last step is to determine which sections/chunk to include in the final answer. I do this by assigning a finite budget to each question (e.g. 1000 tokens), and then prioritize adding the most relevant sections/chunks to the answer. The reason they are separated is that because sometimes a single section can answer the whole question and it might fit in the budget, but sometimes we need to include the more granular chunks to the answer.

          Further Improvements

          As I started typing this post, I realized that there are too many subtle details that if I mentioned them, it would make the post too long.

          A few things I want to mention that helped me improve the solution:

          • I use a simple LLM to generate a brief description of each section. I then create embeddings for those descriptions and use them as part of the logic used to determine which sections to include in the answer.
          • I include meta information about each section in the generated answer. For example, the section title, depth, and the surrounding section names.
          • I provide multiple tools to the LLMs to help answer the question, e.g. a tool to lookup all mentions of a term in the document, a tool to lookup the next section in the document, etc.

          Overall, I think the biggest innovation of this approach is splitting markdown documents into a hierarchical structure, and then splitting each section into smaller chunks. This allows to create a generalized rag that can answer questions about any markdown document.