Orchestrates containerized indexing workloads by spawning on-demand indexer containers via Docker socket to process and index code repositories with shared backend services.
Uses Neo4j graph database to track code relationships including function calls, imports, inheritance, and dependencies across codebases, enabling relationship queries and dependency analysis.
Generates local embeddings for semantic code search using Ollama's embedding models, enabling privacy-focused vector representations of code without sending data to external services.
Codebase Contextifier 9000
A Docker-based Model Context Protocol (MCP) server for semantic code search with AST-aware chunking, relationship tracking via Neo4j graph database, local LLM support, and incremental indexing.
Documentation
📚 Quick Start Guide - Get running in 5 minutes
🔧 Multi-Project Setup - Index multiple projects with shared backend
⚙️ Background Jobs - Job-based indexing for large codebases
👁️ File Watcher Guide - Real-time monitoring and auto-indexing
🔬 Research & Methodology - Deep dive into semantic code search
📖 Full Documentation - Complete docs directory
Table of Contents
Indexing Tools:
index_repository,get_job_status,list_indexing_jobs,cancel_indexing_jobSearch Tools:
search_code,get_symbolsGraph Query Tools:
find_usages,find_dependencies,query_graphDependency Tools:
detect_dependencies,index_dependencies,list_indexed_dependenciesStatus Tools:
get_indexing_status,clear_index,get_watcher_status,health_check
Features
AST-Aware Chunking: Uses tree-sitter to respect function and class boundaries, maintaining semantic integrity
Relationship Tracking: Neo4j graph database tracks function calls, imports, inheritance, and dependencies across your codebase
External Dependency Mapping: Automatically creates placeholder nodes for external functions (WordPress, npm packages, etc.)
Job-Based Indexing: Background indexing with progress tracking for large codebases
On-Demand Container Spawning: Index any repository on your system without manual mounting
Multi-Repository Search: Index and search across multiple projects with a shared backend
Real-Time Updates: File system watcher automatically re-indexes changed files (optional)
Local-First: All processing happens locally using Ollama for embeddings (no data leaves your machine)
Polyglot Support: Supports 10+ programming languages including TypeScript, Python, PHP, Go, Rust, Java, C++, and more
Incremental Indexing: Merkle tree-based change detection with 80%+ cache hit rates
Production-Grade: Uses Qdrant vector database for sub-10ms search latency and Neo4j for relationship queries
Dependency Knowledge Base: Special collection for indexing WordPress plugins, Composer packages, and npm modules
Flexible Deployment: Per-project or centralized server deployment options
MCP Integration: Works with Claude Desktop, Cursor, VS Code, and other MCP-compatible tools
Architecture
Key Architectural Features
Dual Database Architecture: Qdrant for semantic vector search, Neo4j for relationship graph queries
Container Orchestration: MCP server spawns lightweight indexer containers on-demand via Docker socket
Multi-Repository Support: Each repository gets its own merkle tree state, but shares the vector & graph databases
Shared Backend: All projects use the same Qdrant & Neo4j instances, enabling cross-repository search and relationship tracking
Job-Based Processing: Background jobs with progress tracking for large codebases
Content-Addressable Caching: Embeddings are cached by content hash, shared across all repositories
Relationship Extraction: AST-based extraction of CALLS, IMPORTS, EXTENDS, and IMPLEMENTS relationships
External Dependency Tracking: Automatic creation of placeholder nodes for unresolved function calls
Quick Start
See
Prerequisites
Docker Desktop (or Docker + Docker Compose)
Ollama running locally with an embedding model:
# Install Ollama: https://ollama.ai # Recommended: Google's Gemma embedding model (best quality) ollama pull embeddinggemma:latest # Alternative: Nomic Embed (faster, smaller) ollama pull nomic-embed-text
Two Deployment Options
Option A: Centralized Server (Recommended)
Best for: Indexing from the MCP server, querying across all repositories
Option B: Per-Project Setup
Best for: Each project manages its own indexing
See MULTI_PROJECT_SETUP.md for details.
Claude Desktop Configuration
For Centralized Server (Option A):
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
For Per-Project Setup (Option B):
Just copy .mcp.json.template to your project directory - no manual configuration needed!
Usage
Once configured, you can use these tools in Claude Desktop or Claude Code:
Index any repository on your system:
The system spawns a container, indexes the repository in the background, and reports progress.
Monitor indexing progress:
Search for code across all indexed repositories:
Search with filters:
Extract symbols from a file:
Find all usages of a function (graph query):
Find dependencies of a function (graph query):
Detect and index external dependencies:
Check system status:
MCP Tools
Indexing Tools
index_repository
Index a repository from any directory on your host machine by spawning a lightweight indexer container.
Parameters:
host_path(string, required): Absolute path on host machine to repository (e.g.,/Users/me/projects/my-app)repo_name(string, optional): Unique identifier for this repository (defaults to directory name)incremental(bool): Use incremental indexing to only re-index changed files (default:true)exclude_patterns(string, optional): Comma-separated glob patterns to exclude (e.g.,"node_modules/*,dist/*")
Returns:
Example:
get_job_status
Get the status and progress of an indexing job.
Parameters:
job_id(string, required): Job identifier returned fromindex_repository
Returns:
Status values: "queued", "running", "completed", "failed", "cancelled"
list_indexing_jobs
List all indexing jobs (past and present).
Returns:
cancel_indexing_job
Cancel a running indexing job.
Parameters:
job_id(string, required): Job identifier to cancel
Returns:
Search Tools
search_code
Search code using natural language queries with semantic understanding across all indexed repositories.
Parameters:
query(string, required): Natural language search query (e.g., "authentication logic", "error handling")limit(int): Maximum number of results to return (default: 10)repo_name(string, optional): Filter by repository name (searches all repos if not specified)language(string, optional): Filter by programming language (e.g., "python", "typescript", "php")file_path_filter(string, optional): Filter by file path pattern (e.g., "src/components")chunk_type(string, optional): Filter by chunk type (e.g., "function", "class", "method")
Returns:
get_symbols
Extract symbols from a file using AST parsing.
Parameters:
file_path(string): Path to source filesymbol_type(string, optional): Filter by type (e.g.,"function","class")
Returns:
Graph Query Tools
find_usages
Find all places where a function, class, or symbol is used across the codebase using the graph database.
Parameters:
symbol_name(string, required): Name of the function/class to find usages forrepo_name(string, optional): Filter by repository name
Returns:
find_dependencies
Find all functions, classes, or imports that a symbol depends on using the graph database.
Parameters:
symbol_name(string, required): Name of the function/class to analyzerepo_name(string, optional): Filter by repository name
Returns:
query_graph
Execute custom Cypher queries against the Neo4j graph database for advanced relationship analysis.
Parameters:
cypher_query(string, required): Cypher query to executelimit(int, optional): Maximum number of results (default: 100)
Returns:
Dependency Tools
detect_dependencies
Detect available dependencies in the workspace (WordPress plugins/themes, Composer packages, npm modules).
Parameters:
workspace_path(string, optional): Path to workspace (defaults to current workspace)
Returns:
index_dependencies
Index specific dependencies into the knowledge base for better understanding of external APIs.
Parameters:
dependency_names(array, required): List of dependency names to index (e.g.,["woocommerce", "react"])workspace_id(string, required): Unique identifier for the workspace/projectworkspace_path(string, optional): Path to workspace
Returns:
list_indexed_dependencies
List all dependencies that have been indexed in the knowledge base.
Returns:
Status Tools
get_indexing_status
Get statistics about the index, including vector DB, graph DB, and cache metrics.
Returns:
clear_index
Clear the entire index (useful for fresh start).
get_watcher_status
Get status of the real-time file watcher.
Returns:
health_check
Check health status of all components (Ollama, Qdrant, Neo4j).
Supported Languages
Language | Extensions | Support Level |
Python |
| Full |
TypeScript |
| Full |
JavaScript |
| Full |
PHP |
| Full |
Go |
| Full |
Rust |
| Full |
Java |
| Full |
C++ |
| Full |
C |
| Full |
C# |
| Full |
Configuration
Environment Variables
Variable | Default | Description |
|
| Path to codebase to index |
|
| Ollama API endpoint |
|
| Ollama embedding model to use |
|
| Qdrant server hostname |
|
| Qdrant server port |
|
| Enable Neo4j graph database |
|
| Neo4j connection URI |
|
| Neo4j username |
|
| Neo4j password |
|
| Path for index metadata |
|
| Path for embedding cache |
|
| Path to mounted codebase |
|
| Maximum chunk size in characters |
|
| Embedding batch size |
|
| Concurrent embedding requests |
|
| Enable real-time file watching |
|
| Delay before processing file changes |
|
| Logging level |
Recommended Embedding Models
embeddinggemma:latest(recommended - best quality)nomic-embed-text(good balance of speed and quality)mxbai-embed-large(higher accuracy, slower)all-minilm(fastest, lower accuracy)
Performance
Indexing Performance
Medium codebase (5K-50K files): 2-10 minutes initial indexing
Incremental updates: 10-60 seconds for typical changes
Cache hit rate: 80-95% on subsequent runs
Embedding generation: ~100-500 chunks/minute (depends on Ollama performance)
Search Performance
Latency: Sub-second semantic search
Throughput: 10-50 queries/second
Accuracy: 30% better than fixed-size chunking (from research)
Troubleshooting
"Ollama health check failed"
Make sure Ollama is running:
ollama servePull the embedding model:
ollama pull embeddinggemma:latestCheck Docker can access host: Test with
curl http://host.docker.internal:11434
"Qdrant connection failed"
Check Qdrant container is running:
docker-compose psCheck Qdrant logs:
docker-compose logs qdrantRestart services:
docker-compose restart
"Graph database not enabled"
Set
ENABLE_GRAPH_DB=truein your.envfile or.mcp.jsonEnsure Neo4j environment variables are configured:
NEO4J_URI,NEO4J_USER,NEO4J_PASSWORDCheck Neo4j container is running:
docker-compose psCheck Neo4j logs:
docker-compose logs neo4jTest Neo4j connection:
docker exec codebase-neo4j cypher-shell -u neo4j -p codebase123 "RETURN 1"
"No supported files found"
Check
CODEBASE_PATHis correct in.envVerify files have supported extensions
Check
.gitignoreisn't excluding too much
Slow indexing
Reduce
BATCH_SIZEif running low on RAMIncrease
MAX_CONCURRENT_EMBEDDINGSif you have spare CPUUse
incremental=truefor re-indexing
Development
Running Locally (Without Docker)
Running Tests
Code Quality
Architecture Details
AST-Aware Chunking
The system uses tree-sitter to parse code into Abstract Syntax Trees (ASTs), then extracts semantic chunks that respect:
Function boundaries
Class definitions
Method boundaries
Interface/trait definitions
This achieves 30% better accuracy than fixed-size chunking according to research (arXiv:2506.15655).
Incremental Indexing
Uses Merkle tree-based change detection:
Compute Blake3 hash of each file
Compare with previous state
Only re-index changed files
Update vector database incrementally
Typical cache hit rates: 80-95%
Content-Addressable Storage
Embeddings are cached using content hashing:
This enables:
Team sharing of cached embeddings
Fast re-indexing after git operations
Deterministic caching across machines
Roadmap
Real-time file system watcher for instant updates
Multi-repo search with shared backend
Job-based background indexing with progress tracking
On-demand container spawning for flexible repository indexing
Neo4j integration for relationship tracking - Track function calls, imports, inheritance, with external dependency placeholders
Dependency knowledge base - Index WordPress plugins, Composer packages, npm modules
Reranking with cross-encoders for improved accuracy
Fine-tuned embeddings for domain-specific code
HTTP transport for remote MCP servers
Web UI for search and visualization
Graph-based code navigation UI (Neo4j Browser or custom visualization)
Research & References
Based on cutting-edge research in semantic code search:
cAST (arXiv:2506.15655): AST-aware chunking methodology
CodeRAG (arXiv:2504.10046): Graph-augmented retrieval
Model Context Protocol: Anthropic's standard for AI tool integration
Qdrant: High-performance vector database
tree-sitter: Incremental parsing library
License
MIT
Contributing
Contributions welcome! Please open an issue or PR.
Support
For issues, questions, or feature requests, please open a GitHub issue.