Code-Index-MCP

SEMANTIC_INDEXING_SUMMARY.md•2.66 kB

# Semantic Indexing Summary ## Current Status ### ✅ Completed (9 repositories) - **c_phoenix**: 302 embeddings (30 files) - **c_redis**: 5,810 embeddings (766 files) - **cpp_grpc**: 3,278 embeddings (partial - timed out) - **go_gin**: 416 embeddings - **code-embeddings**: 495 embeddings (django) - **typescript collections**: ~19k embeddings (from earlier runs) ### 📋 Ready to Index (13 repositories) Small to medium repositories that can be indexed quickly: - c/curl - cpp/json - dart/flutter_examples - go/terraform - java/kafka - javascript/express - kotlin/kotlin - python/flask - python/requests - rust/rust - rust/tokio - scala/akka - swift/alamofire ### ⏭️ Skipped - Large Repositories (7) These require dedicated indexing sessions due to size: - **typescript/TypeScript**: ~50,000 files - **javascript/react**: ~6,000 files - **python/django**: ~5,000 files - **csharp/aspnetcore**: ~5,000 files - **ruby/rails**: ~5,000 files - **php/laravel**: ~4,000 files - **java/spring-boot**: ~3,000 files ## SQL Indexing Status - **17 out of 29 repositories** have SQL indexes - **Total documents**: ~220,000 - SQL indexing appears complete with no artificial limits ## Key Findings 1. **Semantic indexing is NOT complete** - only 9/29 repositories have embeddings 2. **SQL indexing has no limits** - uses `recursive=True` without file restrictions 3. **Semantic indexing challenges**: - Time intensive (Redis took 6 minutes for 766 files) - API costs (Voyage AI charges per token) - Large repositories cause timeouts ## Scripts Updated ### `index_all_repos_semantic_simple.py` - ✅ Updated to process entire files (no 2000-line limit) - ✅ Chunks large files appropriately - ✅ Processes test_repos directory directly - ✅ Handles both SQL-indexed and filesystem sources ### New Scripts - `index_missing_repos_semantic.py` - Identifies missing repositories - `index_all_repos_semantic_full.py` - Full indexing without limits ## Recommendations 1. **Complete small/medium repos first** - Run the updated script for the 13 ready repositories 2. **Handle large repos separately** - Create dedicated scripts with: - Resume capability - Progress tracking - Batch processing 3. **Consider partial indexing** for very large repos: - Index only key directories - Sample files strategically - Focus on most-used code ## Next Steps To complete semantic indexing: ```bash # Index the 13 small/medium repositories export $(grep -v '^#' .env | xargs) python scripts/index_all_repos_semantic_simple.py # For large repositories, use dedicated sessions: python scripts/index_large_repo.py --repo django --batch-size 100 ```

Loading blob content...

Latest Blog Posts

Don't Use Large Strings as Cache Keys
By punkpeye on January 11, 2026.
markdown
node-js
cache
What are Claude Skills?
By punkpeye on January 10, 2026.
mcp
skills
How to Test MCP Streamable HTTP Endpoints Using cURL
By punkpeye on January 2, 2026.
tutorial
bash

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ViperJuice/Code-Index-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

SEMANTIC_INDEXING_SUMMARY.md•2.66 kB