Qdrant RAG MCP Server

ast-chunking-implementation.md•8.11 KiB

# AST-Based Hierarchical Chunking Implementation This document describes the AST-based chunking feature implemented in v0.1.5-v0.1.6 of the Qdrant RAG MCP Server. ## Overview AST (Abstract Syntax Tree) based chunking is a revolutionary approach to code indexing that understands code structure rather than treating it as plain text. This results in: - **40-70% fewer chunks** by preserving complete code structures - **Better search results** as chunks represent meaningful code units - **Hierarchical context** with relationships between classes, methods, and functions - **Progressive retrieval** - fetch only the specific method/class needed - **Multi-language support** - Python (v0.1.5), Shell scripts, and Go (v0.1.6) ## Architecture ### Core Components 1. **AST Chunker** (`src/utils/ast_chunker.py`) - Parses Python code using the built-in `ast` module - Creates hierarchical chunks based on code structure - Preserves complete functions, classes, and methods - Falls back to text splitting on parse errors 2. **Enhanced Code Indexer** (`src/indexers/code_indexer.py`) - Detects Python files and applies AST chunking - Falls back to traditional chunking for other languages - Preserves all hierarchical metadata 3. **Metadata Storage** - Stores hierarchy information (e.g., `['module', 'ClassName', 'method_name']`) - Preserves function signatures, decorators, and docstrings - Maintains line number mappings for precise navigation ## How It Works ### Traditional Chunking (Before) ```python # File gets split arbitrarily every 1500 characters # Chunk 1: class UserManager: def __init__(self): self.users = {} def add_user(self, user_id, name): """Add a new user to the system""" if user_id in self.users: raise ValueError("User already exists") self.users[user_id] = { 'name': name, 'created': datetime.now(), 'last_login': None, 'preferences': {} } self._log_action('add_user', user_ # Chunk 2: id) return True def update_user(self, user_id, **kwargs): """Update user information""" if user_id not in self.users: ``` ### AST Chunking (After) ```python # Chunk 1: Class definition class UserManager: """Manages user accounts and preferences""" ... # Chunk 2: Complete method def add_user(self, user_id, name): """Add a new user to the system""" if user_id in self.users: raise ValueError("User already exists") self.users[user_id] = { 'name': name, 'created': datetime.now(), 'last_login': None, 'preferences': {} } self._log_action('add_user', user_id) return True # Chunk 3: Another complete method def update_user(self, user_id, **kwargs): """Update user information""" if user_id not in self.users: raise ValueError("User not found") # ... rest of method ``` ## Supported Languages ### Python (v0.1.5) - Uses built-in `ast` module for parsing - Preserves complete functions, classes, and methods - Groups imports together - Maintains docstrings and decorators ### Shell Scripts (v0.1.6) - Supports `.sh` and `.bash` files - Extracts shell functions with proper boundaries - Separates setup/configuration code from functions - Handles both `function name() {}` and `name() {}` syntax ### Go (v0.1.6) - Supports `.go` files - Extracts packages, imports, functions, methods, structs, and interfaces - Preserves Go's visibility rules (exported vs unexported) - Maintains receiver types for methods ## Chunk Types The AST chunker creates language-specific chunk types: ### Python 1. **imports** - All import statements grouped together 2. **class** - Complete class (if small enough) 3. **class_definition** - Class signature and docstring (for large classes) 4. **function** - Standalone functions 5. **method** - Class methods 6. **module** - Module-level code and scripts ### Shell Scripts 1. **setup** - Top-level variables and configuration before first function 2. **function** - Shell functions 3. **script** - Complete script (when no functions are found) ### Go 1. **package** - Package declaration and imports 2. **function** - Standalone functions 3. **method** - Methods with receivers 4. **struct** - Struct definitions 5. **interface** - Interface definitions 6. **module** - Fallback for unparseable files ## Metadata Structure Each AST chunk includes rich metadata: ### Python Example ```json { "chunk_type": "method", "name": "add_user", "hierarchy": ["module", "UserManager", "add_user"], "async": false, "decorators": ["login_required"], "args": { "args": ["self", "user_id", "name"], "defaults": 0, "kwonly": [], "vararg": null, "kwarg": null }, "returns": "bool", "is_method": true, "language": "python", "line_start": 15, "line_end": 28 } ``` ### Shell Script Example ```json { "chunk_type": "function", "name": "setup_environment", "hierarchy": ["script", "setup_environment"], "language": "shell", "is_exported": false, "line_start": 14, "line_end": 24 } ``` ### Go Example ```json { "chunk_type": "method", "name": "Info", "hierarchy": ["package", "SimpleLogger", "Info"], "language": "go", "is_method": true, "receiver_type": "SimpleLogger", "is_exported": true, "line_start": 34, "line_end": 36 } ``` ## Benefits ### 1. Token Efficiency - **61.7% fewer chunks** on average - Complete code structures in single chunks - No more split functions or classes ### 2. Search Quality - Searches find complete, runnable code - Better understanding of code relationships - Hierarchical context for navigation ### 3. Developer Experience - See entire functions/classes at once - Navigate code hierarchy naturally - Understand code structure better ## Performance Characteristics Based on testing with the qdrant_mcp_context_aware.py file: - Traditional chunks: 60 - AST chunks: 23 - Reduction: 61.7% While total tokens may sometimes increase slightly (due to complete function preservation), the quality improvement and reduction in chunks more than compensates. ## API Usage ### Indexing with AST ```python # AST chunking is automatic for Python files response = await mcp.index_code({"file_path": "/path/to/file.py"}) ``` ### Searching with Hierarchy When searching, results now include hierarchical information: ```python results = await mcp.search({"query": "user authentication"}) # Results include hierarchy: ['module', 'AuthManager', 'authenticate'] ``` ## Future Enhancements 1. **Additional Language Support** - JavaScript/TypeScript (using babel parser or tree-sitter) - Java (using JavaParser) - Rust (using syn or tree-sitter) - C/C++ (using tree-sitter) 2. **Advanced Features** - Cross-reference analysis - Dependency graph building - Call hierarchy tracking - Symbol resolution 3. **Optimizations** - Incremental AST updates - Cached parse trees - Parallel parsing - Language server protocol integration ## Configuration AST chunking is enabled by default for supported languages (Python, Shell, Go). To disable: ```python # In code_indexer.py initialization indexer = CodeIndexer(use_ast_chunking=False) ``` Supported file extensions: - Python: `.py` - Shell: `.sh`, `.bash` - Go: `.go` ## Migration Existing indexed data remains compatible. To take advantage of AST chunking: ```bash # Reindex your project "Reindex this directory" ``` ## Technical Details ### Fallback Behavior The system gracefully falls back to text-based chunking when: - AST parsing fails (syntax errors) - Non-Python files are indexed - Files are too small for meaningful AST analysis ### Size Limits - Maximum chunk size: 2000 characters (configurable) - Minimum chunk size: 100 characters - Large functions are truncated with indicators ### Error Handling All AST parsing errors are logged and the system falls back to traditional chunking, ensuring indexing always succeeds. ## Conclusion AST-based chunking represents a significant advancement in code understanding for RAG systems. By treating code as structured data rather than text, we achieve better search results, more efficient token usage, and a superior developer experience.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ancoleman/qdrant-rag-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ast-chunking-implementation.md•8.11 KiB