de en es ja ko ru zh

docs-mcp-server

by arabold

TypeScript

MIT License

542

676

Overview InspectNew Endpoints Schema Related Servers Reviews Score

Need Help?View Source Code Report Issue

docs-mcp-server
docs

splitter-hierarchy.md•7.87 kB

# Splitter Hierarchy System ## Overview The splitter hierarchy system provides a standardized way to organize and relate document chunks through two key properties: `level` and `path`. This hierarchical structure enables semantic chunk reassembly, context-aware search, and document structure preservation across different content types. ## Core Concepts ### Hierarchy Structure Every `ContentChunk` contains a `section` property with: ```typescript section: { level: number, // Hierarchical depth starting from 0 path: string[] // Array representing the hierarchical path } ``` ### Root Level All documents start at: - **Level**: `0` - **Path**: `[]` (empty array) This represents the document root and is used for: - Plain text content with no semantic structure - Source code at global scope (imports, global variables, unstructured code) - Base level for structured content ### Level Progression Levels increase with hierarchical depth: - **Level 0**: Document root, global/unstructured content - **Level 1**: Top-level structures (H1 headings, classes, JSON root objects) - **Level 2**: Sub-structures (H2 headings, class methods, nested objects) - **Level 3+**: Further nested content (H3+ headings, inner functions, deep nesting) ## Content Type Examples ### Markdown Documents ```typescript // # Chapter 1 -> level: 1, path: ["Chapter 1"] // ## Section 1.1 -> level: 2, path: ["Chapter 1", "Section 1.1"] // ### Subsection 1.1.1 -> level: 3, path: ["Chapter 1", "Section 1.1", "Subsection 1.1.1"] // Plain text content -> level: 0, path: [] ``` ### Source Code ```typescript // File: UserService.ts [ { content: "class UserService {", section: { level: 1, path: ["UserService", "opening"] }, }, { content: " getUser(id) { return db.find(id); }", section: { level: 2, path: ["UserService", "getUser"] }, }, { content: "}", section: { level: 1, path: ["UserService", "closing"] }, }, { content: "// Global code or imports", section: { level: 0, path: [] }, }, ]; ``` ### JSON Documents ```typescript // {"users": [{"name": "Alice"}, {"name": "Bob"}]} [ { content: '"Alice"', section: { level: 4, path: ["root", "users", "[0]", "name"] }, }, { content: '"Bob"', section: { level: 4, path: ["root", "users", "[1]", "name"] }, }, ]; ``` **Note**: JSON documents use `["root"]` as the starting path because the root object/array is a real structural element. This differs from source code where global content uses `[]` (empty path) because there's no meaningful structural container. ## Level and Path Relationship ### Independent but Related **Important**: Level and path are independent properties that don't need to match exactly: - Path length often correlates with level, but not always - A chunk can have `level: 1` with `path: ["A", "B", "C"]` (3 elements) - Level represents conceptual hierarchy depth - Path represents the actual navigational structure ### When They Diverge ```typescript // Example: Content inheriting parent context { content: "Some text under a deep heading", section: { level: 2, // Conceptually a level-2 section path: ["Chapter", "Section", "Subsection", "Details"] // But deep path } } ``` ## Chunk Merging Rules The `GreedySplitter` uses sophisticated rules when combining chunks: ### Level Selection Always uses the **lowest (most general) level** between chunks: ```typescript // Merging level 1 + level 3 = level 1 (most general) merge( { level: 1, path: ["Section"] }, { level: 3, path: ["Section", "Sub", "Detail"] } ); // Result: { level: 1, path: ["Section", "Sub", "Detail"] } ``` ### Path Selection Rules 1. **Identical sections**: Preserve original metadata 2. **Parent-child relationship**: Use the child's (deeper) path 3. **Sibling sections**: Use common parent path 4. **Unrelated sections**: Use root path `[]` ### Detailed Merging Examples #### Parent-Child Merging ```typescript // Parent content + Child content current: { level: 1, path: ["Section 1"] } next: { level: 2, path: ["Section 1", "SubSection 1.1"] } // Result: { level: 1, path: ["Section 1", "SubSection 1.1"] } ``` #### Sibling Merging ```typescript // Two sibling sections current: { level: 2, path: ["Section 1", "Sub 1.1"] } next: { level: 2, path: ["Section 1", "Sub 1.2"] } // Result: { level: 2, path: ["Section 1"] } // Common parent ``` #### Unrelated Merging ```typescript // Completely different sections current: { level: 1, path: ["Section 1"] } next: { level: 1, path: ["Section 2"] } // Result: { level: 1, path: [] } // Root level ``` #### Deep Hierarchy Merging ```typescript // Deep nested content current: { level: 1, path: ["S1"] } next: { level: 2, path: ["S1", "S1.1"] } final: { level: 3, path: ["S1", "S1.1", "S1.1.1"] } // Result: { level: 1, path: ["S1", "S1.1", "S1.1.1"] } // Lowest level, deepest path ``` ## Relationship Detection ### Parent-Child Detection ```typescript // One path is a prefix of another isParentChild(["Section"], ["Section", "SubSection"]); // true isParentChild(["A", "B"], ["A", "B", "C", "D"]); // true isParentChild(["X"], ["Y"]); // false ``` ### Common Path Finding ```typescript // Longest shared prefix findCommonPrefix(["A", "B", "C"], ["A", "B", "D"]); // ["A", "B"] findCommonPrefix(["X", "Y"], ["A", "B"]); // [] findCommonPrefix(["Same"], ["Same"]); // ["Same"] ``` ## Storage and Retrieval ### Database Usage The hierarchy enables sophisticated search operations: - **Parent chunks**: `path.slice(0, -1)` - **Child chunks**: Find paths starting with current path + one more element - **Sibling chunks**: Same path length with shared parent - **Context retrieval**: Automatically include related chunks in search results ### Search Context When retrieving search results, the system provides: 1. **Direct match**: The chunk that matched the search 2. **Parent context**: Broader context for understanding 3. **Sibling navigation**: Related content at the same level 4. **Child exploration**: Deeper content for more details ## Implementation Notes ### Path Uniqueness - **Paths don't need to be unique** within a document - Multiple chunks can share the same path if they're at the same level - Example: Multiple paragraphs under the same heading share `path` and `level` ### Content Type Agnostic The hierarchy system works consistently across: - Markdown documents (heading-based hierarchy) - Source code (function/class-based hierarchy) - JSON files (object structure hierarchy) - Plain text (root level only) ### Chunk Reassembly The hierarchy enables perfect document reconstruction: - Chunks maintain their original structural relationships - Merged chunks preserve the most specific path information - Level information indicates conceptual importance/generality ## Best Practices ### For Splitter Implementations 1. **Start with root**: Initialize with `level: 0, path: []` 2. **Increment consistently**: Each structural boundary increases appropriate level 3. **Build paths incrementally**: Add to path array as structure deepens 4. **Preserve semantics**: Level should reflect conceptual hierarchy depth ### For Chunk Processing 1. **Respect boundaries**: Don't merge across major structural breaks (H1/H2) 2. **Use hierarchy for context**: Include parent/child information in search results 3. **Maintain relationships**: Preserve path information during processing 4. **Leverage for navigation**: Enable users to explore related content ## Validation The hierarchy system ensures: - **Consistency**: All chunks follow the same structural rules - **Navigability**: Users can move between related content sections - **Context preservation**: Document structure survives the chunking process - **Search enhancement**: Hierarchical information improves result relevance

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arabold/docs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server