Orchestrates containerized indexing workloads by spawning on-demand indexer containers via Docker socket to process and index code repositories with shared backend services.
Uses Neo4j graph database to track code relationships including function calls, imports, inheritance, and dependencies across codebases, enabling relationship queries and dependency analysis.
Generates local embeddings for semantic code search using Ollama's embedding models, enabling privacy-focused vector representations of code without sending data to external services.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Codebase Contextifier 9000search for functions that handle user authentication across all indexed repositories"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Codebase Contextifier 9000
A Docker-based Model Context Protocol (MCP) server for semantic code search with AST-aware chunking, relationship tracking via Neo4j graph database, local LLM support, and incremental indexing.
Documentation
π Quick Start Guide - Get running in 5 minutes
π§ Multi-Project Setup - Index multiple projects with shared backend
βοΈ Background Jobs - Job-based indexing for large codebases
ποΈ File Watcher Guide - Real-time monitoring and auto-indexing
π¬ Research & Methodology - Deep dive into semantic code search
π Full Documentation - Complete docs directory
Table of Contents
Indexing Tools:
index_repository,get_job_status,list_indexing_jobs,cancel_indexing_jobSearch Tools:
search_code,get_symbolsGraph Query Tools:
find_usages,find_dependencies,query_graphDependency Tools:
detect_dependencies,index_dependencies,list_indexed_dependenciesStatus Tools:
get_indexing_status,clear_index,get_watcher_status,health_check
Features
AST-Aware Chunking: Uses tree-sitter to respect function and class boundaries, maintaining semantic integrity
Relationship Tracking: Neo4j graph database tracks function calls, imports, inheritance, and dependencies across your codebase
External Dependency Mapping: Automatically creates placeholder nodes for external functions (WordPress, npm packages, etc.)
Job-Based Indexing: Background indexing with progress tracking for large codebases
On-Demand Container Spawning: Index any repository on your system without manual mounting
Multi-Repository Search: Index and search across multiple projects with a shared backend
Real-Time Updates: File system watcher automatically re-indexes changed files (optional)
Local-First: All processing happens locally using Ollama for embeddings (no data leaves your machine)
Polyglot Support: Supports 10+ programming languages including TypeScript, Python, PHP, Go, Rust, Java, C++, and more
Incremental Indexing: Merkle tree-based change detection with 80%+ cache hit rates
Production-Grade: Uses Qdrant vector database for sub-10ms search latency and Neo4j for relationship queries
Dependency Knowledge Base: Special collection for indexing WordPress plugins, Composer packages, and npm modules
Flexible Deployment: Per-project or centralized server deployment options
MCP Integration: Works with Claude Desktop, Cursor, VS Code, and other MCP-compatible tools
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP Client (Claude Code, Claude Desktop, Cursor, etc.) β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β MCP Protocol (stdio)
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β MCP Server Container (codebase-mcp-server) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FastMCP Server - Exposes MCP Tools: β β
β β β’ index_repository (spawns indexer containers) β β
β β β’ search_code (semantic search across all repos) β β
β β β’ find_usages, find_dependencies (graph queries) β β
β β β’ detect_dependencies, index_dependencies β β
β β β’ get_job_status, list_indexing_jobs, cancel_job β β
β β β’ get_symbols, get_indexing_status, health_check β β
β ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β β Spawns via Docker Socket β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β On-Demand Indexer Containers (ephemeral) β β
β β β’ Mounts any host directory β β
β β β’ AST-aware chunking with tree-sitter β β
β β β’ Extracts relationships (CALLS, IMPORTS, etc.) β β
β β β’ Generates embeddings via Ollama β β
β β β’ Updates shared Qdrant & Neo4j databases β β
β β β’ Reports progress back to MCP server β β
β ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββββββ
β β
ββββββββΌβββββββ ββββββββββββββΌβββββββββ
β Qdrant β β Neo4j β
β Container β β Container β
β (Vectors) β β (Relationships) β
ββββββββ¬βββββββ ββββββββ¬βββββββββββββββ
β β
ββββββββΌβββββββββββββββββββββββββββββΌββββββββββββ
β Persistent Docker Volumes: β
β β’ qdrant_data (vector DB) β
β β’ neo4j_data (graph DB) β
β β’ index_data (merkle trees) β
β β’ cache_data (embeddings cache) β
βββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββ
β Ollama (Host) β
β Embedding Model β
ββββββββββββββββββββββββββKey Architectural Features
Dual Database Architecture: Qdrant for semantic vector search, Neo4j for relationship graph queries
Container Orchestration: MCP server spawns lightweight indexer containers on-demand via Docker socket
Multi-Repository Support: Each repository gets its own merkle tree state, but shares the vector & graph databases
Shared Backend: All projects use the same Qdrant & Neo4j instances, enabling cross-repository search and relationship tracking
Job-Based Processing: Background jobs with progress tracking for large codebases
Content-Addressable Caching: Embeddings are cached by content hash, shared across all repositories
Relationship Extraction: AST-based extraction of CALLS, IMPORTS, EXTENDS, and IMPLEMENTS relationships
External Dependency Tracking: Automatic creation of placeholder nodes for unresolved function calls
Quick Start
See
Prerequisites
Docker Desktop (or Docker + Docker Compose)
Ollama running locally with an embedding model:
# Install Ollama: https://ollama.ai # Recommended: Google's Gemma embedding model (best quality) ollama pull embeddinggemma:latest # Alternative: Nomic Embed (faster, smaller) ollama pull nomic-embed-text
Two Deployment Options
Option A: Centralized Server (Recommended)
Best for: Indexing from the MCP server, querying across all repositories
# 1. Start the backend
cd codebase-contextifier-9000
docker-compose up -d
# 2. Configure Claude Desktop (see below)
# 3. Index any repository
# In Claude: "Index the repository at /Users/me/projects/my-app"Option B: Per-Project Setup
Best for: Each project manages its own indexing
# 1. Start shared backend (once)
cd codebase-contextifier-9000
docker-compose up -d
# 2. Copy .mcp.json to each project
cp .mcp.json.template ~/projects/my-app/.mcp.json
# 3. Open project in Claude Code
cd ~/projects/my-app
claude-code .See MULTI_PROJECT_SETUP.md for details.
Claude Desktop Configuration
For Centralized Server (Option A):
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"codebase-contextifier": {
"command": "docker",
"args": [
"exec",
"-i",
"codebase-mcp-server",
"python",
"-m",
"src.server"
]
}
}
}For Per-Project Setup (Option B):
Just copy .mcp.json.template to your project directory - no manual configuration needed!
Usage
Once configured, you can use these tools in Claude Desktop or Claude Code:
Index any repository on your system:
Claude, index the repository at /Users/me/projects/my-appThe system spawns a container, indexes the repository in the background, and reports progress.
Monitor indexing progress:
Claude, show me the status of job abc123Search for code across all indexed repositories:
Claude, search for "authentication logic" in the codebaseSearch with filters:
Claude, search for "error handling" filtering by language=python and repo_name=my-apiExtract symbols from a file:
Claude, get all functions from /workspace/src/utils.pyFind all usages of a function (graph query):
Claude, find all places where authenticate_user is calledFind dependencies of a function (graph query):
Claude, show me all functions that processPayment depends onDetect and index external dependencies:
Claude, detect available WordPress plugins in this project
Claude, index the woocommerce plugin into the knowledge baseCheck system status:
Claude, show me the indexing status and list all jobsMCP Tools
Indexing Tools
index_repository
Index a repository from any directory on your host machine by spawning a lightweight indexer container.
Parameters:
host_path(string, required): Absolute path on host machine to repository (e.g.,/Users/me/projects/my-app)repo_name(string, optional): Unique identifier for this repository (defaults to directory name)incremental(bool): Use incremental indexing to only re-index changed files (default:true)exclude_patterns(string, optional): Comma-separated glob patterns to exclude (e.g.,"node_modules/*,dist/*")
Returns:
{
"success": true,
"job_id": "abc123def456",
"repo_name": "my-app",
"status": "queued",
"message": "Background indexing started for 'my-app'"
}Example:
# Index a WordPress site, excluding plugins and uploads
await index_repository(
host_path="/Users/me/sites/my-wordpress",
repo_name="my-wordpress",
exclude_patterns="wp-content/plugins/*,wp-content/uploads/*,wp-includes/*"
)get_job_status
Get the status and progress of an indexing job.
Parameters:
job_id(string, required): Job identifier returned fromindex_repository
Returns:
{
"success": true,
"job_id": "abc123def456",
"repo_name": "my-app",
"repo_path": "/Users/me/projects/my-app",
"status": "running",
"created_at": 1698765432.123,
"started_at": 1698765433.456,
"elapsed_seconds": 45.2,
"progress": {
"current_file": 45,
"total_files": 100,
"progress_pct": 45.0,
"current_file_path": "/workspace/src/api/auth.py",
"chunks_indexed": 234,
"failed_files_count": 2,
"cache_hit_rate": "35.50%"
}
}Status values: "queued", "running", "completed", "failed", "cancelled"
list_indexing_jobs
List all indexing jobs (past and present).
Returns:
{
"success": true,
"total_jobs": 3,
"jobs": [
{
"job_id": "abc123",
"repo_name": "my-api",
"status": "completed",
"progress": { "progress_pct": 100.0, ... }
},
{
"job_id": "def456",
"repo_name": "frontend",
"status": "running",
"progress": { "progress_pct": 67.5, ... }
}
]
}cancel_indexing_job
Cancel a running indexing job.
Parameters:
job_id(string, required): Job identifier to cancel
Returns:
{
"success": true,
"message": "Job abc123 cancelled successfully"
}Search Tools
search_code
Search code using natural language queries with semantic understanding across all indexed repositories.
Parameters:
query(string, required): Natural language search query (e.g., "authentication logic", "error handling")limit(int): Maximum number of results to return (default: 10)repo_name(string, optional): Filter by repository name (searches all repos if not specified)language(string, optional): Filter by programming language (e.g., "python", "typescript", "php")file_path_filter(string, optional): Filter by file path pattern (e.g., "src/components")chunk_type(string, optional): Filter by chunk type (e.g., "function", "class", "method")
Returns:
{
"success": true,
"query": "authentication logic",
"total_results": 5,
"results": [
{
"rank": 1,
"score": 0.8234,
"repo_name": "backend-api",
"file": "/workspace/src/auth/login.ts",
"lines": "42-68",
"language": "typescript",
"type": "function",
"context": "class:AuthService",
"code": "async function authenticateUser(username, password) { ... }"
}
]
}get_symbols
Extract symbols from a file using AST parsing.
Parameters:
file_path(string): Path to source filesymbol_type(string, optional): Filter by type (e.g.,"function","class")
Returns:
{
"success": true,
"file_path": "/workspace/src/utils.py",
"total_symbols": 15,
"symbols": [
{
"name": "format_date",
"type": "function_definition",
"start_line": 42,
"end_line": 58,
"context": "N/A",
"language": "python"
}
]
}Graph Query Tools
find_usages
Find all places where a function, class, or symbol is used across the codebase using the graph database.
Parameters:
symbol_name(string, required): Name of the function/class to find usages forrepo_name(string, optional): Filter by repository name
Returns:
{
"success": true,
"symbol_name": "authenticate_user",
"total_usages": 12,
"usages": [
{
"caller": "LoginController.handleLogin",
"caller_file": "/workspace/src/controllers/login.ts",
"line_number": 42,
"relationship_type": "CALLS"
}
]
}find_dependencies
Find all functions, classes, or imports that a symbol depends on using the graph database.
Parameters:
symbol_name(string, required): Name of the function/class to analyzerepo_name(string, optional): Filter by repository name
Returns:
{
"success": true,
"symbol_name": "processPayment",
"total_dependencies": 8,
"dependencies": [
{
"target": "validateCard",
"target_file": "/workspace/src/utils/validation.ts",
"relationship_type": "CALLS",
"is_external": false
},
{
"target": "stripe.charges.create",
"relationship_type": "CALLS",
"is_external": true
}
]
}query_graph
Execute custom Cypher queries against the Neo4j graph database for advanced relationship analysis.
Parameters:
cypher_query(string, required): Cypher query to executelimit(int, optional): Maximum number of results (default: 100)
Returns:
{
"success": true,
"query": "MATCH (f:Function)-[:CALLS]->(ext:ExternalFunction) WHERE ext.name =~ 'wp_.*' RETURN f.name, ext.name",
"results": [
{"f.name": "enqueue_scripts", "ext.name": "wp_enqueue_script"},
{"f.name": "setup_theme", "ext.name": "wp_register_nav_menu"}
],
"total_results": 2
}Dependency Tools
detect_dependencies
Detect available dependencies in the workspace (WordPress plugins/themes, Composer packages, npm modules).
Parameters:
workspace_path(string, optional): Path to workspace (defaults to current workspace)
Returns:
{
"success": true,
"dependencies": {
"wordpress_plugins": ["woocommerce", "advanced-custom-fields"],
"wordpress_themes": ["twentytwentyfour"],
"composer_packages": ["symfony/console", "guzzlehttp/guzzle"],
"npm_packages": ["react", "typescript"]
},
"total_dependencies": 6
}index_dependencies
Index specific dependencies into the knowledge base for better understanding of external APIs.
Parameters:
dependency_names(array, required): List of dependency names to index (e.g.,["woocommerce", "react"])workspace_id(string, required): Unique identifier for the workspace/projectworkspace_path(string, optional): Path to workspace
Returns:
{
"success": true,
"indexed_dependencies": ["woocommerce"],
"total_chunks": 1247,
"message": "Successfully indexed 1 dependencies with 1247 chunks"
}list_indexed_dependencies
List all dependencies that have been indexed in the knowledge base.
Returns:
{
"success": true,
"dependencies": [
{
"name": "woocommerce",
"version": "8.5.0",
"type": "wordpress_plugin",
"workspaces": ["my-store", "test-site"],
"chunks_count": 1247,
"indexed_at": "2024-01-15T10:30:00Z"
}
],
"total_dependencies": 1
}Status Tools
get_indexing_status
Get statistics about the index, including vector DB, graph DB, and cache metrics.
Returns:
{
"success": true,
"code_db": {
"total_chunks": 2450,
"vectors_count": 2450,
"status": "green"
},
"knowledge_db": {
"total_chunks": 1247,
"indexed_dependencies": ["woocommerce"]
},
"graph_db": {
"enabled": true,
"total_nodes": 2230,
"total_relationships": 4407,
"node_types": {
"Function": 1459,
"ExternalFunction": 771
}
},
"index": {
"indexed_files": 150,
"total_chunks": 2450
},
"cache": {
"enabled": true,
"cached_embeddings": 2450,
"total_size_mb": 18.5
}
}clear_index
Clear the entire index (useful for fresh start).
get_watcher_status
Get status of the real-time file watcher.
Returns:
{
"success": true,
"enabled": true,
"running": true,
"watch_path": "/workspace",
"debounce_seconds": 2.0
}health_check
Check health status of all components (Ollama, Qdrant, Neo4j).
Supported Languages
Language | Extensions | Support Level |
Python |
| Full |
TypeScript |
| Full |
JavaScript |
| Full |
PHP |
| Full |
Go |
| Full |
Rust |
| Full |
Java |
| Full |
C++ |
| Full |
C |
| Full |
C# |
| Full |
Configuration
Environment Variables
Variable | Default | Description |
|
| Path to codebase to index |
|
| Ollama API endpoint |
|
| Ollama embedding model to use |
|
| Qdrant server hostname |
|
| Qdrant server port |
|
| Enable Neo4j graph database |
|
| Neo4j connection URI |
|
| Neo4j username |
|
| Neo4j password |
|
| Path for index metadata |
|
| Path for embedding cache |
|
| Path to mounted codebase |
|
| Maximum chunk size in characters |
|
| Embedding batch size |
|
| Concurrent embedding requests |
|
| Enable real-time file watching |
|
| Delay before processing file changes |
|
| Logging level |
Recommended Embedding Models
embeddinggemma:latest(recommended - best quality)nomic-embed-text(good balance of speed and quality)mxbai-embed-large(higher accuracy, slower)all-minilm(fastest, lower accuracy)
Performance
Indexing Performance
Medium codebase (5K-50K files): 2-10 minutes initial indexing
Incremental updates: 10-60 seconds for typical changes
Cache hit rate: 80-95% on subsequent runs
Embedding generation: ~100-500 chunks/minute (depends on Ollama performance)
Search Performance
Latency: Sub-second semantic search
Throughput: 10-50 queries/second
Accuracy: 30% better than fixed-size chunking (from research)
Troubleshooting
"Ollama health check failed"
Make sure Ollama is running:
ollama servePull the embedding model:
ollama pull embeddinggemma:latestCheck Docker can access host: Test with
curl http://host.docker.internal:11434
"Qdrant connection failed"
Check Qdrant container is running:
docker-compose psCheck Qdrant logs:
docker-compose logs qdrantRestart services:
docker-compose restart
"Graph database not enabled"
Set
ENABLE_GRAPH_DB=truein your.envfile or.mcp.jsonEnsure Neo4j environment variables are configured:
NEO4J_URI,NEO4J_USER,NEO4J_PASSWORDCheck Neo4j container is running:
docker-compose psCheck Neo4j logs:
docker-compose logs neo4jTest Neo4j connection:
docker exec codebase-neo4j cypher-shell -u neo4j -p codebase123 "RETURN 1"
"No supported files found"
Check
CODEBASE_PATHis correct in.envVerify files have supported extensions
Check
.gitignoreisn't excluding too much
Slow indexing
Reduce
BATCH_SIZEif running low on RAMIncrease
MAX_CONCURRENT_EMBEDDINGSif you have spare CPUUse
incremental=truefor re-indexing
Development
Running Locally (Without Docker)
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export QDRANT_HOST=localhost
export OLLAMA_HOST=http://localhost:11434
export INDEX_PATH=./index
export CACHE_PATH=./cache
export WORKSPACE_PATH=/path/to/your/codebase
# Start Qdrant
docker run -p 6333:6333 qdrant/qdrant
# Run server
python -m src.serverRunning Tests
pip install -e ".[dev]"
pytestCode Quality
# Format code
black src/
# Lint code
ruff src/Architecture Details
AST-Aware Chunking
The system uses tree-sitter to parse code into Abstract Syntax Trees (ASTs), then extracts semantic chunks that respect:
Function boundaries
Class definitions
Method boundaries
Interface/trait definitions
This achieves 30% better accuracy than fixed-size chunking according to research (arXiv:2506.15655).
Incremental Indexing
Uses Merkle tree-based change detection:
Compute Blake3 hash of each file
Compare with previous state
Only re-index changed files
Update vector database incrementally
Typical cache hit rates: 80-95%
Content-Addressable Storage
Embeddings are cached using content hashing:
cache_key = blake3(model_name + file_content)This enables:
Team sharing of cached embeddings
Fast re-indexing after git operations
Deterministic caching across machines
Roadmap
Real-time file system watcher for instant updates
Multi-repo search with shared backend
Job-based background indexing with progress tracking
On-demand container spawning for flexible repository indexing
Neo4j integration for relationship tracking - Track function calls, imports, inheritance, with external dependency placeholders
Dependency knowledge base - Index WordPress plugins, Composer packages, npm modules
Reranking with cross-encoders for improved accuracy
Fine-tuned embeddings for domain-specific code
HTTP transport for remote MCP servers
Web UI for search and visualization
Graph-based code navigation UI (Neo4j Browser or custom visualization)
Research & References
Based on cutting-edge research in semantic code search:
cAST (arXiv:2506.15655): AST-aware chunking methodology
CodeRAG (arXiv:2504.10046): Graph-augmented retrieval
Model Context Protocol: Anthropic's standard for AI tool integration
Qdrant: High-performance vector database
tree-sitter: Incremental parsing library
License
MIT
Contributing
Contributions welcome! Please open an issue or PR.
Support
For issues, questions, or feature requests, please open a GitHub issue.