Code-Index-MCP

Code-Index-MCP
scripts

README_BM25_POPULATE.md•5.46 KiB

# BM25 Index Population Script ## Overview The `populate_bm25_index.py` script efficiently indexes all files from the SQLite database into BM25 full-text search tables using parallel processing. This enables fast keyword-based searching with BM25 ranking. ## Features - **Parallel Processing**: Uses multiple workers to index files concurrently - **Progress Tracking**: Real-time progress updates with ETA - **Error Handling**: Graceful handling of file read errors and encoding issues - **Metadata Preservation**: Maintains all original file metadata from the database - **Language-Aware**: Extracts language-specific symbols, imports, and comments - **Incremental Updates**: Skips files with unchanged content (unless forced) - **Memory Efficient**: Processes files in batches to handle large codebases ## Usage ### Basic Usage ```bash # Index all files in the database python scripts/populate_bm25_index.py # With verbose output python scripts/populate_bm25_index.py --verbose ``` ### Advanced Options ```bash # Index only Python files python scripts/populate_bm25_index.py --language python # Index files from a specific repository python scripts/populate_bm25_index.py --repository /path/to/repo # Force reindex all files python scripts/populate_bm25_index.py --force # Custom batch size and worker count python scripts/populate_bm25_index.py --batch-size 200 --workers 8 # Use a different database python scripts/populate_bm25_index.py --db-path /path/to/database.db ``` ## Command Line Options - `--db-path PATH`: Path to SQLite database (default: code_index.db) - `--batch-size SIZE`: Number of files to process per batch (default: 100) - `--workers NUM`: Number of parallel workers (default: CPU count) - `--language LANG`: Only index files of specific language - `--repository REPO`: Only index files from specific repository path - `--force`: Force reindex even if content hasn't changed - `--verbose`: Enable verbose logging ## How It Works 1. **File Discovery**: Queries the SQLite database for all files matching the filters 2. **Batch Processing**: Divides files into batches for parallel processing 3. **Content Reading**: Reads file content from disk with encoding detection 4. **Metadata Extraction**: Uses language-specific plugins to extract: - Symbol definitions (classes, functions, methods) - Import statements - Comments and documentation 5. **BM25 Indexing**: Indexes content into FTS5 tables with: - Full file content - Extracted symbols - File metadata 6. **Progress Tracking**: Reports progress with statistics: - Files processed - Success/failure counts - Processing rate - Estimated time remaining 7. **Index Optimization**: Optimizes FTS5 tables after bulk loading ## Performance The script is optimized for large codebases: - **Parallel Processing**: Utilizes all CPU cores by default - **Batch Operations**: Reduces database transaction overhead - **Memory Efficiency**: Processes files in streams, not loading entire codebase - **Skip Large Files**: Automatically skips files over 10MB - **Encoding Detection**: Tries multiple encodings to maximize readable files ### Expected Performance - Small projects (<1000 files): 1-5 seconds - Medium projects (1000-10000 files): 10-60 seconds - Large projects (10000-100000 files): 1-10 minutes - Very large monorepos (>100000 files): 10-30 minutes ## Error Handling The script handles various error conditions: - **Missing Files**: Logs warning and continues - **Encoding Errors**: Tries multiple encodings, skips if all fail - **Large Files**: Skips files over 10MB with warning - **Plugin Errors**: Logs warning but still indexes file content - **Database Errors**: Rolls back transaction and logs error ## Integration After indexing, you can search using: ```python from mcp_server.storage.sqlite_store import SQLiteStore from mcp_server.indexer.bm25_indexer import BM25Indexer storage = SQLiteStore("code_index.db") indexer = BM25Indexer(storage) # Search for content results = indexer.search("TODO", limit=10) # Search for symbols symbol_results = indexer.search("MyClass", search_type="symbols") # Phrase search phrase_results = indexer.search_phrase("import sqlite3") ``` ## Monitoring The script provides detailed progress information: ``` 2024-01-10 10:30:00 - Starting BM25 indexing with 8 workers 2024-01-10 10:30:00 - Found 5432 files to index 2024-01-10 10:30:05 - Progress: 500/5432 files (9.2%) - Rate: 100.0 files/sec - ETA: 49s 2024-01-10 10:30:10 - Progress: 1000/5432 files (18.4%) - Rate: 102.3 files/sec - ETA: 43s ... 2024-01-10 10:30:54 - Indexing completed in 54.2 seconds 2024-01-10 10:30:54 - Total files: 5432 2024-01-10 10:30:54 - Successfully indexed: 5420 2024-01-10 10:30:54 - Failed: 12 2024-01-10 10:30:54 - Average rate: 100.2 files/second ``` ## Troubleshooting ### Common Issues 1. **"File not found" errors** - Files may have been moved/deleted since database was populated - Re-run main indexing to update file paths 2. **Encoding errors** - Some files may use uncommon encodings - Script tries UTF-8, UTF-8-BOM, Latin-1, and CP1252 3. **Memory usage** - Reduce batch size: `--batch-size 50` - Reduce workers: `--workers 2` 4. **Slow performance** - Check disk I/O performance - Ensure database is on fast storage (SSD) - Run with `--verbose` to identify bottlenecks ## Related Tools - `mcp_server_cli.py index`: Initial file discovery and indexing - `demo_populate_bm25.py`: Demo script showing BM25 usage - `test_bm25_hybrid_search.py`: Test hybrid search capabilities

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ViperJuice/Code-Index-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README_BM25_POPULATE.md•5.46 KiB