Codebase Contextifier 9000

CLAUDE.md•15.8 KiB

# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Codebase Contextifier 9000 is a Docker-based Model Context Protocol (MCP) server for semantic code search. It uses AST-aware chunking with tree-sitter, local LLM embeddings via Ollama, and Qdrant vector database for fast semantic search across codebases. ## Development Commands ### Docker Operations ```bash # Build and start services docker-compose up -d # View logs docker-compose logs -f mcp-server docker-compose logs -f qdrant # Restart after code changes docker-compose restart mcp-server # Rebuild after dependency changes docker-compose up -d --build # Stop all services docker-compose down # Clean start (removes volumes) docker-compose down -v ``` ### Local Development (without Docker) ```bash # Install dependencies pip install -r requirements.txt # Start Qdrant locally docker run -p 6333:6333 qdrant/qdrant # Set environment variables export QDRANT_HOST=localhost export OLLAMA_HOST=http://localhost:11434 export INDEX_PATH=./index export CACHE_PATH=./cache export WORKSPACE_PATH=/path/to/codebase # Run MCP server python -m src.server ``` ### Testing and Code Quality ```bash # Run tests pytest # Format code black src/ # Lint code ruff src/ ``` ## Architecture Overview ### Core Components The system has 8 main modules: 1. **MCP Server** (`src/server.py`): FastMCP server exposing tools via stdio protocol 2. **AST Chunker** (`src/indexer/ast_chunker.py`): Tree-sitter-based semantic code parsing 3. **Merkle Tree Indexer** (`src/indexer/merkle_tree.py`): Blake3-based incremental change detection 4. **Embeddings** (`src/indexer/embeddings.py`): Ollama integration with content-addressable caching 5. **Vector DB** (`src/vector_db/qdrant_client.py`): Qdrant client for semantic search (code collections) 6. **Knowledge DB** (`src/vector_db/knowledge_db.py`): Separate Qdrant collection for dependency knowledge 7. **Job Manager** (`src/indexer/job_manager.py`): Background job orchestration and Docker container spawning 8. **Dependency Detector** (`src/indexer/dependency_detector.py`): WordPress, Composer, npm dependency detection ### Container-Based Indexing Architecture The system uses on-demand Docker containers for indexing any repository on the host machine: ``` MCP Client → index_repository(host_path="/Users/you/projects/my-app") ↓ JobManager (src/server.py) ↓ Spawns ephemeral indexer container via Docker API ↓ Mounts host path → /workspace ↓ Indexer script (scripts/indexer.py) runs inside container ↓ AST Chunker → Embeddings → Shared Qdrant/Volumes ↓ Container auto-removes on completion ↓ Job status available via get_job_status() ``` **Key features:** - **On-demand spawning**: No need to pre-configure docker-compose volumes - **Shared backend**: All repositories write to same Qdrant instance - **Background jobs**: Track progress with job_id, supports cancellation - **Auto-cleanup**: Containers are removed after indexing completes ### Data Flow ``` File → AST Chunker → Semantic Chunks → Embeddings → Qdrant (code_chunks collection) ↓ ↓ Merkle Tree ← IncrementalIndexingSession ← SearchTool ↓ Per-repo state stored in /index/{repo_name}/ ``` 1. **Indexing**: Files are parsed with tree-sitter to extract semantic chunks (functions, classes, methods) 2. **Hashing**: Each file gets a Blake3 content hash stored in merkle tree for change detection (per-repo) 3. **Embedding**: Chunks are sent to Ollama for embedding generation (cached by content hash in shared /cache) 4. **Storage**: Embeddings + metadata stored in Qdrant vector database with repo_name metadata 5. **Search**: Query embeddings are compared against stored vectors using cosine similarity (filter by repo_name if needed) ### AST-Aware Chunking The chunker (`ast_chunker.py`) respects semantic boundaries: - Functions, methods, classes, interfaces are extracted as complete units - Each chunk includes context (e.g., parent class name for methods) - Avoids splitting functions mid-code (30% better accuracy than fixed-size chunking) - Supports 10+ languages via tree-sitter grammars ### Incremental Indexing Uses Merkle tree pattern for efficiency: - **First indexing**: All files are parsed, embedded, and stored - **Re-indexing**: Only changed files are re-processed (detected via Blake3 hash comparison) - **Cache hit rate**: Typically 80-95% on subsequent runs - Session pattern: `IncrementalIndexingSession` context manager handles batch updates ### File Watcher Real-time monitoring (`src/indexer/file_watcher.py`): - Uses `watchdog` library to monitor filesystem events - Debounces changes (default 2 seconds) to avoid re-indexing during rapid edits - Automatically re-indexes modified files and removes deleted files from vector DB - Runs async in background task started at server initialization ### Content-Addressable Caching Embeddings are cached using content hashing: ```python cache_key = blake3(model_name + chunk_content).hexdigest() ``` Benefits: - Team members can share cache across machines - Re-indexing after git operations is fast (unchanged files use cache) - Deterministic: same content always produces same cache key ## MCP Tools Reference The server exposes 14 MCP tools: **Indexing:** - `index_repository(host_path, repo_name, incremental, exclude_patterns)` - Spawn container to index any repository on host - `get_job_status(job_id)` - Get progress of background indexing job - `list_indexing_jobs()` - List all indexing jobs (past and present) - `cancel_indexing_job(job_id)` - Cancel a running indexing job **Search:** - `search_code(query, limit, repo_name, language, file_path_filter, chunk_type)` - Semantic search across all indexed repos **Symbols:** - `get_symbols(file_path, symbol_type)` - Extract AST symbols from file **Status:** - `get_indexing_status()` - Get index statistics (code_db, knowledge_db, cache) - `clear_index()` - Clear all indexed data - `get_watcher_status()` - Check file watcher status - `health_check()` - Check component health **Dependencies (WordPress, Composer, npm):** - `detect_dependencies(workspace_path)` - Detect available dependencies in workspace - `list_indexed_dependencies()` - List dependencies already indexed in knowledge base - `index_dependencies(dependency_names, workspace_id, workspace_path)` - Index specific dependencies All tools are auto-exposed by FastMCP via `@mcp.tool` decorator. ## Configuration Key environment variables (set in `.env` or `docker-compose.yml`): - `CODEBASE_PATH` - Path to codebase to mount (Docker host path) - `WORKSPACE_PATH` - Path inside container where codebase is mounted (default: `/workspace`) - `OLLAMA_HOST` - Ollama API endpoint (default: `http://host.docker.internal:11434`) - `EMBEDDING_MODEL` - Ollama model name (default: `embeddinggemma:latest`) - `QDRANT_HOST` / `QDRANT_PORT` - Qdrant connection settings - `INDEX_PATH` - Directory for merkle tree state (default: `/index`) - `CACHE_PATH` - Directory for embedding cache (default: `/cache`) - `MAX_CHUNK_SIZE` - Maximum characters per chunk (default: 2048) - `BATCH_SIZE` - Embedding batch size (default: 32) - `MAX_CONCURRENT_EMBEDDINGS` - Concurrent requests to Ollama (default: 4) - `ENABLE_FILE_WATCHER` - Enable real-time file monitoring (default: `true`) - `WATCHER_DEBOUNCE_SECONDS` - Delay before processing changes (default: 2.0) ## Adding New Languages To support a new language: 1. Add tree-sitter grammar to `requirements.txt`: ``` tree-sitter-kotlin>=0.20.0 ``` 2. Add language config to `config/languages.json`: ```json { "kotlin": { "name": "kotlin", "extensions": [".kt", ".kts"], "tree_sitter_language": "kotlin", "chunkable_nodes": ["function_declaration", "class_declaration"], "name_fields": { "function_declaration": "simple_identifier", "class_declaration": "type_identifier" } } } ``` 3. Import module in `ast_chunker.py`: ```python import tree_sitter_kotlin as tskotlin ``` 4. Add to `LANGUAGE_MODULES` dict: ```python "kotlin": tskotlin, ``` ## Multi-Repository Workflow The system supports two deployment patterns: ### Pattern A: Centralized Server (Recommended) 1. Start backend once: `docker-compose up -d` 2. Connect Claude Desktop/Code to MCP server container 3. Index any repository: `index_repository(host_path="/Users/you/projects/my-app")` 4. Search across all indexed repos: `search_code(query="auth logic")` ### Pattern B: Per-Project Setup 1. Start shared backend: `docker-compose up -d` 2. Copy `.mcp.json` to each project directory 3. Open project in Claude Code to trigger indexing 4. Switch between projects as needed **Key insight:** All projects share the same Qdrant/cache/index volumes, enabling: - Cross-repository search - Shared embedding cache (faster indexing) - Centralized query server See `MULTI_PROJECT_SETUP.md` for details. ## Background Jobs Indexing large codebases runs in background jobs: ```python # Start indexing job = await index_repository(host_path="/Users/you/projects/large-app") # → {"job_id": "abc123", "status": "queued"} # Check progress await get_job_status(job_id="abc123") # → { # "status": "running", # "progress": { # "current_file": 45, # "total_files": 100, # "progress_pct": 45.0, # "chunks_indexed": 234, # "cache_hit_rate": "82.50%" # } # } # List all jobs await list_indexing_jobs() # Cancel if needed await cancel_indexing_job(job_id="abc123") ``` See `BACKGROUND_JOBS.md` for detailed workflow patterns. ## Dependency Knowledge Base For WordPress/PHP projects with many dependencies: ```python # Detect available dependencies await detect_dependencies() # → Lists WordPress plugins, themes, Composer packages, npm packages # Index specific dependencies into knowledge base await index_dependencies( dependency_names=["woocommerce", "acf"], workspace_id="my-site" ) # Check what's indexed await list_indexed_dependencies() ``` **Architecture:** - Code chunks → `code_chunks` collection (searchable via `search_code`) - Dependencies → `dependency_knowledge` collection (separate namespace) - Deduplication: Same dependency version indexed once, linked to multiple workspaces ## Common Development Tasks ### Debugging Container-Based Indexing ```bash # List indexer containers (ephemeral, auto-removed) docker ps -a | grep indexer- # View logs from specific job container docker logs indexer-abc123 # Check job status via MCP tools # Use get_job_status(job_id="abc123") in Claude # View shared volumes docker volume ls | grep codebase-contextifier docker volume inspect codebase-contextifier-9000_index_data ``` ### Debugging Indexing Issues ```bash # Check what files are detected docker exec -it codebase-mcp-server python -c " from src.indexer.grammars import get_language_registry registry = get_language_registry() print('Supported extensions:', registry.get_supported_extensions()) " # Check merkle tree state (per-repo) docker exec -it codebase-mcp-server ls -la /index/ # Each subdirectory is a repo_name # View cache contents (shared across all repos) docker exec -it codebase-mcp-server ls -lh /cache # Check Qdrant collections docker exec -it codebase-qdrant wget -qO- http://localhost:6333/collections ``` ### Testing Search Quality Search quality depends on: - **Embedding model**: `embeddinggemma:latest` (recommended), `mxbai-embed-large` (higher accuracy), `nomic-embed-text` (faster) - **Chunk size**: Smaller chunks = more precise but need more context - **Query phrasing**: Natural language works better than keywords - **Multi-repo search**: Filter by `repo_name` parameter if searching specific project ### Performance Tuning Indexing performance: - Increase `MAX_CONCURRENT_EMBEDDINGS` if CPU allows - Increase `BATCH_SIZE` if RAM allows (embeddings processed in batches) - Use SSD for `/cache` and `/index` volumes - Container-based indexing: Spawn multiple indexer containers in parallel (each indexes different repo) Search performance: - Qdrant is already optimized (sub-10ms typically) - Reduce `limit` parameter if only top results needed - Use `repo_name` filter to limit search scope ## Important Implementation Notes ### Container Spawning Pattern The `index_repository` tool spawns ephemeral Docker containers via `JobManager`: ```python # In IndexingTool (src/tools/index_tool.py) job = self.job_manager.create_job(repo_name, host_path) success = await self.job_manager.spawn_indexer_container( job_id=job.job_id, host_path=host_path, # Host machine path repo_name=repo_name, qdrant_host="codebase-qdrant", # Container network name incremental=incremental, exclude_patterns=exclude_patterns, ) # JobManager spawns container with: # - Volume mount: host_path -> /workspace (read-only) # - Shared volumes: index_data, cache_data (read-write) # - Network: codebase-contextifier-9000_default # - Auto-remove: Container deletes after completion # - Monitor task: Polls container status, updates job progress ``` **Key considerations:** - MCP server needs Docker socket access (`/var/run/docker.sock`) - Indexer image must be pre-built (`codebase-contextifier-9000-indexer`) - Container runs `scripts/indexer.py` as entrypoint - Job status tracked in-memory (lost on MCP server restart, but indexed data persists) ### Session Pattern for Index Updates Always use `IncrementalIndexingSession` context manager when updating the index: ```python with IncrementalIndexingSession(merkle_indexer) as session: # Plan updates files_to_index, files_to_remove = session.plan_incremental_update(all_files) # Process changes for file_path in files_to_remove: chunk_ids = session.remove_file(file_path) vector_db.delete_by_file_path(file_path) for file_path in files_to_index: chunks = chunker.chunk_file(file_path) embeddings_list = await embeddings.generate_embeddings_batched(...) vector_db.upsert_chunks(chunks, embeddings_list) session.update_file(file_path, chunk_ids) # Automatically commits on __exit__ ``` This ensures merkle tree state is only saved on success. ### Async/Sync Boundaries - FastMCP tools can be `async def` or regular `def` - Embedding generation is async (uses httpx for Ollama API) - Vector DB operations are sync (qdrant-client is sync) - File watcher uses dual-threading model (watchdog observer + asyncio debounce processor) ### File Watcher Threading Model The file watcher (`src/indexer/file_watcher.py`) uses a complex threading model: ```python # In src/server.py main block: asyncio.run(initialize_components()) # Initialize file_watcher object start_file_watcher_sync() # Start watchdog observer thread # Start debounce processor in separate thread (runs its own asyncio loop) watcher_thread = threading.Thread( target=run_watcher_debounce_in_thread, daemon=True ) watcher_thread.start() # Then run MCP server (FastMCP has its own event loop) mcp.run() ``` **Why this architecture?** - `watchdog` observer runs in background thread (synchronous) - Debounce logic needs async/await for delays and callback - FastMCP server runs in main thread with its own event loop - Can't share event loops across threads, so debounce processor gets its own thread + loop **Alternative considered:** Run debounce processor in FastMCP's event loop, but FastMCP initialization happens after `if __name__ == "__main__"`, making it hard to hook into. ### Error Handling Philosophy - Log errors but continue processing remaining files - Invalid files are skipped (unsupported language, parse errors) - Health checks warn but don't block startup (Ollama may be temporarily down) - File watcher exceptions are caught to prevent crashes - Container spawn failures mark job as failed but don't crash MCP server

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jarmentor/codebase-contextifier-9000'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CLAUDE.md•15.8 KiB