Skip to main content
Glama

Files-DB-MCP

by randomm
architecture.md10.2 kB
# Files-DB-MCP: Architecture Overview ## System Purpose Files-DB-MCP is a local vector database system designed to provide LLM coding agents with fast, efficient search capabilities for software project files. Instead of traditional grep/glob searching, it enables semantic understanding and retrieval of code through vector embeddings, all accessible via the Message Control Protocol (MCP). The system focuses on zero-configuration deployment to any existing project, with a single command startup that immediately begins indexing and providing search capabilities with progressive improvement. ## High-Level Architecture ``` ┌─────────────────┐ ┌─────────────────────┐ ┌───────────────────┐ │ │ │ │ │ │ │ Client Tools │◄────┤ MCP Interface │◄────┤ Vector Search │ │ (Claude Code, │ │ (stdio/SSE Handler) │ │ Engine │ │ Cursor, etc.) │ │ │ │ │ └─────────────────┘ └─────────────────────┘ └─────────┬─────────┘ │ │ ┌─────────────────────┐ ┌─────────▼─────────┐ │ │ │ │ │ File Processor │────►│ Vector Database │ │ (Indexer/Chunker) │ │ │ │ │ │ │ └─────────────────────┘ └───────────────────┘ ``` ## Core Components ### 1. MCP Interface Responsible for communication with client tools using the Message Control Protocol. **Key Functions:** - Parse incoming MCP messages from clients - Route queries to the vector search engine - Format search results as MCP responses - Handle both stdio and SSE communication modes - Manage connection lifecycle and error handling **Technologies:** - Python with standard libraries for stdio handling - FastAPI or similar for SSE endpoints (if needed) - JSON for message formatting ### 2. File Processor Handles scanning, parsing, and embedding generation for project files. **Key Functions:** - Automatically start scanning project directories on container launch - Recursively traverse files with smart exclusions (.git, node_modules, etc.) - Continuously monitor the file system for new, modified, and deleted files - Perform real-time incremental updates to the index when files change - Parse different file types appropriately - Chunk files into appropriate segments - Generate embeddings for file chunks using open source models - Store file metadata and embeddings in the database - Detect file changes for immediate incremental updates - Provide immediate search capabilities while indexing continues in background **Technologies:** - Python for file operations - Language-specific parsers for code understanding - Open source embedding models from Hugging Face optimized for code - Sentence Transformers with quantization and binary embedding support - Configurable model selection and embedding parameters ### 3. Vector Search Engine Core search functionality using vector similarity. **Key Functions:** - Convert search queries to embeddings - Perform vector similarity search - Filter results based on metadata - Rank results by relevance - Format results with file paths and snippets **Technologies:** - Vector similarity algorithms (cosine, dot product) - Query preprocessing techniques - Relevance scoring mechanisms ### 4. Vector Database Storage for file embeddings and metadata. **Key Functions:** - Store and retrieve vector embeddings - Maintain file metadata (path, type, modified date) - Support fast similarity search - Handle database versioning - Manage persistence and updates - Support different embedding formats (full, quantized, binary) **Technologies:** - Qdrant, Milvus, or similar vector database - SQLite for metadata (if needed) - File-based storage for persistence - Support for HNSW and other efficient indexing algorithms ## Data Flow 1. **Initial Indexing Process:** - File Processor scans project directories immediately on startup - Files are parsed and chunked - Chunks are converted to embeddings - Embeddings and metadata are stored in Vector Database - Search becomes available for indexed files while indexing continues 2. **Continuous Monitoring Process:** - File system watcher monitors for changes (new, modified, deleted files) - Changed files are immediately processed and re-indexed - Index is updated in real-time as developers modify their codebase - Search results reflect the current state of the project 3. **Search Process:** - Client sends query via MCP Interface - Query is routed to Vector Search Engine - Query is converted to embedding - Vector Database performs similarity search - Results are ranked and formatted - Response is sent back via MCP Interface ## Deployment Architecture The system is designed to run locally as a Docker-based service with zero configuration required: ``` ┌─────────────────────────────────────────────────────────────────┐ │ Docker Compose Environment │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │ │ │ │ │ Main Service │◄──────►│ Vector Database │ │ │ │ (Auto-starting) │ │ │ │ │ └────────┬────────┘ └─────────────────┘ │ │ │ │ └────────────┼────────────────────────────────────────────────────┘ │ ▲ │ │ │ │ Single Command ▼ │ Startup ┌──────────────────┐ ┌──────────────┐ ┌─────────────┐ │ │ │ │ │ │ │ Client Tools │ │ Project │◄────┤ Developer │ │ │ │ Directory │ │ Workspace │ └──────────────────┘ └──────────────┘ └─────────────┘ ``` - **One-Command Startup**: Simple CLI command that works in any project directory - **Main Service Container**: Auto-starting container that runs the MCP Interface, File Processor, and Vector Search Engine - **Vector Database Container**: Runs the vector database service with automatic initialization - **Shared Volume**: For persistent storage of embeddings and metadata - **Project Directory Mounting**: Automatic mounting of project files for indexing - **Network Bridge**: For communication between containers and LLM tools ## Key Technical Considerations 1. **Performance Optimization** - Efficient chunking strategies for different file types - Caching mechanisms for frequent queries - Incremental indexing to minimize resource usage - Query optimization for fast responses - Embedding quantization and binary embedding options for space efficiency 2. **Embedding Quality** - Selection of appropriate open source models from Hugging Face - Configurable embedding parameters and model selection - Sentence Transformers with quantization and binary embedding support - Chunking strategies that preserve semantic meaning - Handling of different programming languages 3. **Scalability** - Efficient handling of large codebases - Resource usage optimization - Pagination for large result sets 4. **Security** - Running with appropriate container permissions - Avoiding exposure of sensitive code - Validation of inputs and paths 5. **Extensibility** - Pluggable architecture for embedding models - Support for different vector databases - Extensible query interface 6. **Zero-Configuration Design** - Auto-detection of project structure and languages - Smart defaults for different project types - Progressive indexing with immediate search availability - Continuous file monitoring and real-time index updates - Background processing with minimal resource impact - One-command startup for any project ## Future Extensions 1. **Advanced Filtering** - Language-specific code search - Function/class level search - Code structure awareness 2. **Integration Options** - Support for additional LLM tools - IDE plugin support - CI/CD integration 3. **Intelligent Indexing** - Language-aware parsing - Code structure understanding - Dependency relationship mapping 4. **Performance Enhancements** - Distributed search for very large codebases - Optimized embedding models for code - Advanced caching strategies

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/randomm/files-db-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server