MCP Indexer

Overview InspectNew Schema Related Servers Score

MIT License

mcpIndexer
docs

ARCHITECTURE.md•6.84 kB

# Architecture This document describes the internal architecture of Scout. ## Components ### 1. Parser (`parser.py`) Tree-sitter based multi-language AST parsing. Extracts symbols, functions, classes, and imports from source files. **Supported Languages**: Python, JavaScript, TypeScript, Ruby, Go ### 2. Chunker (`chunker.py`) Intelligent code chunking that respects AST boundaries. Splits code into semantic units while: - Preserving context (imports, class definitions) - Maintaining reasonable token sizes for embeddings - Tracking line numbers and parent symbols ### 3. Embeddings (`embeddings.py`) ChromaDB + sentence-transformers for semantic search. Manages: - Embedding generation from code chunks - Vector storage and retrieval - Metadata tracking (file paths, symbols, languages) ### 4. Indexer (`indexer.py`) Orchestrates the indexing pipeline: parsing → chunking → embedding → storage. Provides: - Single repository indexing (`RepoIndexer`) - Multi-repository management (`MultiRepoIndexer`) - Incremental updates based on git commits ### 5. Dependency Analyzer (`dependency_analyzer.py`) Tracks imports and dependencies within and across repositories: - Extracts import statements - Maps internal vs external dependencies - Identifies cross-repository dependencies - Suggests missing repositories to add ### 6. Stack Config (`stack_config.py`) Persistent configuration management stored at `~/.scout/stack.json`: - Repository metadata (name, path, status) - Git commit tracking for incremental updates - Auto-reindex settings ### 7. MCP Server (`server.py`) Exposes 13 tools via Model Context Protocol: - Semantic search and Q&A - Symbol lookup (definitions and references) - Repository management - Cross-repo dependency analysis ### 8. CLI (`cli.py`) Command-line interface for: - Interactive setup wizard - Repository management - Status monitoring - Git hook installation ## Indexing Pipeline ``` Code File ↓ Parser (Tree-sitter AST) ↓ Chunker (Semantic Units) ↓ Embeddings (sentence-transformers) ↓ ChromaDB Store ``` ### Pipeline Details 1. **Parsing**: Tree-sitter parses source files into Abstract Syntax Trees (ASTs) 2. **Symbol Extraction**: Extracts functions, classes, methods, and import statements 3. **Chunking**: Groups code into semantic units (functions, classes, modules) 4. **Context Preservation**: Includes imports and parent class context in chunks 5. **Embedding Generation**: sentence-transformers creates vector embeddings 6. **Storage**: ChromaDB stores embeddings with metadata for fast retrieval ## Data Flow ### Indexing a Repository ``` User: scout add /path/to/repo ↓ CLI validates path and creates RepoIndexer ↓ Scanner walks file tree (filters by language) ↓ For each file: Parser → Chunker → Embeddings ↓ Store chunks in ChromaDB ↓ Update stack.json with metadata ``` ### Semantic Search Query ``` User: "authentication logic" ↓ MCP Server receives query ↓ EmbeddingStore generates query embedding ↓ ChromaDB performs vector similarity search ↓ Results ranked by cosine similarity ↓ Returns code chunks with metadata ``` ## Performance Based on testing with real-world repositories: | Metric | Value | |--------|-------| | **Indexing Speed** | ~56 files/sec | | **Small Repo** (162 files) | 302 chunks in 1.86s | | **Multi-Repo** (3 repos, 255 files) | 595 chunks in 4.58s | | **Search Latency** | ~100-200ms per query | ### Performance Characteristics **Indexing**: - CPU-bound during parsing and chunking - I/O-bound during embedding generation - Memory usage scales with repo size (~1-2GB for large monoliths) **Searching**: - Sub-second for most queries - Scales logarithmically with corpus size (ChromaDB vector index) - GPU acceleration available for embedding generation ## Storage ### ChromaDB Collection Schema Each code chunk is stored with: - **id**: Unique identifier (`repo_name:file_path:chunk_index`) - **embedding**: 384-dimensional vector (sentence-transformers) - **metadata**: - `file_path`: Relative path within repository - `repo_name`: Repository identifier - `language`: Source language (python, javascript, etc.) - `start_line`, `end_line`: Line number range - `symbol_name`: Function/class name (if applicable) - `symbol_type`: function, class, method, etc. - `parent_symbol`: Parent class for methods ### Stack Configuration (`~/.scout/stack.json`) ```json { "repos": { "repo-name": { "path": "/absolute/path/to/repo", "name": "repo-name", "status": "indexed", "indexed_at": "2024-01-15T10:30:00", "last_commit": "abc123...", "files_indexed": 162, "chunks_indexed": 302, "auto_reindex": true } } } ``` ## Incremental Updates Scout tracks git commits to enable incremental reindexing: 1. **Commit Tracking**: Stores last indexed commit hash in stack.json 2. **Change Detection**: Compares current HEAD with stored commit 3. **Selective Reindex**: Only reindexes if commits have changed 4. **Git Hooks**: Optional post-merge hook for automatic reindexing ## Dependency Analysis ### Internal Dependencies Tracks imports within a repository to understand module relationships. ### External Dependencies Identifies imported packages (e.g., `requests`, `lodash`) to understand external dependencies. ### Cross-Repository Dependencies Detects when one indexed repository imports from another indexed repository, enabling: - Dependency graph visualization - Missing repository suggestions - Impact analysis for changes ## Extension Points ### Adding Language Support To add a new language: 1. Install tree-sitter grammar: `pip install tree-sitter-<language>` 2. Add language to `parser.py` LANGUAGE_MAP 3. Implement symbol extraction for language's AST structure 4. Add file extension mapping ### Custom Chunking Strategies Override `CodeChunker` methods: - `chunk_by_function()`: Per-function chunking - `chunk_by_class()`: Per-class chunking - `chunk_by_symbols()`: Custom symbol-based chunking ### Alternative Embedding Models Replace sentence-transformers in `embeddings.py`: - OpenAI embeddings (requires API key) - Local LLM embeddings (ollama, llama.cpp) - Custom fine-tuned models ## Security Considerations - **Code Privacy**: All embeddings stored locally, no external API calls - **File Access**: Respects .gitignore patterns by default - **Credentials**: No credentials stored or transmitted - **Sandboxing**: Parser runs in same process (no sandboxing currently) ## Future Improvements See open issues for planned enhancements: - Language Server Protocol (LSP) integration - IDE plugins (VS Code, JetBrains) - Distributed indexing for large organizations - Advanced query syntax (boolean operators, filters) - Code change impact analysis

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gkatechis/mcpIndexer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server