Qdrant RAG MCP Server

smart-reindex-implementation-plan.md•8.05 KiB

# Smart Reindex Implementation Plan (v0.2.4) ## Overview Implement a hybrid incremental reindexing system that avoids clearing collections by tracking file changes and only updating what's necessary. This will make reindexing 90%+ faster for typical use cases while preserving existing embeddings. ## Goals 1. **Preserve Unchanged Data**: Keep existing embeddings for unchanged files 2. **Fast Incremental Updates**: Only process changed files 3. **Clean State Management**: Remove stale data from deleted/moved files 4. **Zero Downtime**: No service interruption during reindex 5. **Backward Compatible**: Support existing full reindex when needed ## Technical Approach: Hybrid Strategy ### 1. File State Tracking Store additional metadata in each chunk: ```python { "file_path": "/path/to/file.py", "file_hash": "sha256:abc123...", # Content hash "file_mtime": 1735425600.0, # Modification time "index_time": 1735425700.0, # When indexed "chunk_index": 0, "total_chunks": 5, # ... existing metadata } ``` ### 2. Change Detection Algorithm ```python def detect_changes(directory: str, collection_name: str) -> Dict[str, FileStatus]: """ Compare current filesystem state with indexed state. Returns dict of file_path -> status (added/modified/unchanged/deleted) """ # Get all indexed files from Qdrant indexed_files = get_indexed_files_metadata(collection_name) # Scan current directory current_files = scan_directory_with_hashes(directory) changes = { "added": [], "modified": [], "unchanged": [], "deleted": [] } # Find additions and modifications for file_path, current_hash in current_files.items(): if file_path not in indexed_files: changes["added"].append(file_path) elif indexed_files[file_path]["file_hash"] != current_hash: changes["modified"].append(file_path) else: changes["unchanged"].append(file_path) # Find deletions for file_path in indexed_files: if file_path not in current_files: changes["deleted"].append(file_path) return changes ``` ### 3. Incremental Update Process ```python def smart_reindex_directory( directory: str, incremental: bool = True, force_full: bool = False ) -> Dict[str, Any]: """Smart reindex with incremental updates.""" if force_full or not incremental: # Fall back to current behavior return full_reindex_directory(directory) # Detect changes changes = detect_changes(directory, get_collection_name()) # Apply changes results = { "added": 0, "modified": 0, "deleted": 0, "unchanged": len(changes["unchanged"]), "errors": [] } # Delete chunks for removed/modified files for file_path in changes["deleted"] + changes["modified"]: delete_file_chunks(file_path) if file_path in changes["deleted"]: results["deleted"] += 1 # Index new/modified files for file_path in changes["added"] + changes["modified"]: try: index_file_with_hash(file_path) if file_path in changes["added"]: results["added"] += 1 else: results["modified"] += 1 except Exception as e: results["errors"].append({ "file": file_path, "error": str(e) }) return results ``` ### 4. Hash Calculation Use SHA256 for reliable change detection: ```python def calculate_file_hash(file_path: str) -> str: """Calculate SHA256 hash of file content.""" sha256_hash = hashlib.sha256() with open(file_path, "rb") as f: for byte_block in iter(lambda: f.read(4096), b""): sha256_hash.update(byte_block) return f"sha256:{sha256_hash.hexdigest()}" ``` ### 5. Stale Data Cleanup ```python def delete_file_chunks(file_path: str, collection_name: str): """Delete all chunks for a specific file.""" qdrant_client.delete( collection_name=collection_name, points_selector=models.FilterSelector( filter=models.Filter( must=[ models.FieldCondition( key="file_path", match=models.MatchValue(value=file_path) ) ] ) ) ) ``` ## Implementation Steps ### Phase 1: Core Infrastructure (Day 1-2) 1. **Update Metadata Schema** - Add file_hash field to all indexers - Add index_time tracking - Ensure backward compatibility 2. **Implement Hash Calculation** - SHA256 hashing function - Integrate into all indexers - Handle large files efficiently 3. **Create Change Detection** - Query indexed files metadata - Compare with filesystem state - Return change report ### Phase 2: Incremental Logic (Day 2-3) 1. **Implement Smart Reindex** - Add incremental parameter - Detect changes before processing - Apply incremental updates 2. **Update Delete Operations** - Create delete_file_chunks function - Handle collection filtering - Ensure complete cleanup 3. **Progress Tracking** - Report files processed - Show time savings - Log detailed operations ### Phase 3: Optimizations (Day 3-4) 1. **Git Integration** - Use git status for faster change detection - Handle git-ignored files properly - Optimize for git workflows 2. **Batch Operations** - Batch deletions for efficiency - Parallel file processing - Memory-efficient scanning 3. **Error Handling** - Graceful fallback to full reindex - Detailed error reporting - Recovery mechanisms ### Phase 4: Testing & Polish (Day 4-5) 1. **Comprehensive Testing** - Unit tests for change detection - Integration tests for reindex - Performance benchmarks 2. **Documentation** - Update API documentation - Add usage examples - Performance comparisons 3. **User Experience** - Clear progress messages - Helpful error messages - Performance statistics ## API Changes ### Updated reindex_directory ```python @mcp.tool() async def reindex_directory( directory: str = ".", patterns: List[str] = None, recursive: bool = True, force: bool = False, incremental: bool = True # NEW parameter ) -> Dict[str, Any]: """ Reindex a directory with smart incremental updates. Args: directory: Directory to reindex patterns: File patterns to include recursive: Search subdirectories force: Skip confirmation incremental: Use smart incremental reindex (default: True) Returns: Reindex results with statistics """ ``` ## Performance Expectations ### Typical Reindex Scenarios 1. **No Changes**: <1 second (metadata check only) 2. **Few Files Changed**: 5-10 seconds (only changed files) 3. **Major Refactor**: 30-60 seconds (many files) 4. **Full Reindex**: Same as current (all files) ### Benchmarks - 1000 files, 10 changed: 95% faster - 5000 files, 50 changed: 90% faster - 10000 files, 500 changed: 80% faster ## Risk Mitigation 1. **Data Integrity** - Verify hash calculations - Test deletion logic thoroughly - Add data validation checks 2. **Performance** - Monitor memory usage - Optimize for large directories - Add progress indicators 3. **Compatibility** - Maintain backward compatibility - Support force full reindex - Handle edge cases gracefully ## Success Criteria 1. **Functional** - ✓ Correctly detects all file changes - ✓ Updates only changed files - ✓ Removes stale data properly - ✓ Maintains data integrity 2. **Performance** - ✓ 90%+ faster for typical updates - ✓ Sub-second for no changes - ✓ Memory efficient for large repos 3. **User Experience** - ✓ Clear progress reporting - ✓ Helpful error messages - ✓ Seamless upgrade path ## Next Steps After v0.2.4 This smart reindex capability enables: - Continuous indexing workflows - Watch mode for real-time updates - Integration with file watchers - Background index maintenance

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ancoleman/qdrant-rag-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

smart-reindex-implementation-plan.md•8.05 KiB