Embeddings Searcher

README.md•6.59 kB

# Embeddings Searcher for Claude Code Documentation A focused embeddings-based search system for navigating markdown documentation in code repositories. ## Features - **Semantic Search**: Uses sentence transformers to find relevant documentation based on meaning, not just keywords - **Markdown-Focused**: Optimized for markdown documentation with intelligent chunking - **Repository-Aware**: Organizes and searches across multiple repositories - **MCP Integration**: Provides an MCP server for integration with Cursor/Claude - **UV Package Management**: Uses UV for fast dependency management ## Quick Start for Claude Code ### 1. Clone and setup ```bash git clone <this-repo> cd kb uv sync ``` ### 2. Add your documentation Place your documentation repositories in the `repos/` directory. ### 3. Index your documentation ```bash uv run python embeddings_searcher.py --index ``` ### 4. (Optional) Convert to ONNX for faster inference ```bash uv run python onnx_convert.py --convert --test ``` ### 5. Add MCP server to Claude Code ```bash claude mcp add -- documentation-searcher uv run --directory /absolute/path/to/kb python mcp_server.py ``` *Replace `/absolute/path/to/kb` with the actual path to your project directory.* ### 6. Use in Claude Code Ask Claude Code questions like: - "Search for authentication patterns" - "Find API documentation" - "Look up configuration options" The MCP server will automatically search through your indexed documentation and return relevant results. ## Quick Start ### 1. Index Documentation First, index all markdown documentation in your repositories: ```bash uv run python embeddings_searcher.py --index ``` This will: - Find all `.md`, `.markdown`, and `.txt` files in the `repos/` directory - Chunk them intelligently based on markdown structure - Generate embeddings using sentence transformers - Store everything in a SQLite database ### 2. Search Documentation ```bash # Basic search uv run python embeddings_searcher.py --query "API documentation" # Search within a specific repository uv run python embeddings_searcher.py --query "authentication" --repo "my-project.git" # Limit results and set similarity threshold uv run python embeddings_searcher.py --query "configuration" --max-results 5 --min-similarity 0.2 ``` ### 3. Get Statistics ```bash # Show indexing statistics uv run python embeddings_searcher.py --stats # List indexed repositories uv run python embeddings_searcher.py --list-repos ``` ## MCP Server Integration The project includes an MCP server for integration with Cursor/Claude: ```bash # Start the MCP server uv run python mcp_server.py ``` ### MCP Tools Available 1. **search_docs**: Search through documentation using semantic similarity 2. **list_repos**: List all indexed repositories 3. **get_stats**: Get indexing statistics 4. **get_document**: Retrieve full document content by path ## Project Structure ``` kb/ ├── embeddings_searcher.py # Main searcher implementation ├── mcp_server.py # MCP server for Claude Code integration ├── onnx_convert.py # ONNX model conversion utility ├── pyproject.toml # UV project configuration ├── embeddings_docs.db # SQLite database with embeddings ├── sentence_model.onnx # ONNX model (generated) ├── model_config.json # Model configuration (generated) ├── tokenizer/ # Tokenizer files (generated) ├── repos/ # Your documentation repositories │ ├── project1.git/ │ ├── project2.git/ │ └── documentation.git/ └── README.md # This file ``` ## How It Works ### Intelligent Chunking The system chunks markdown documents based on: - Header structure (H1, H2, H3, etc.) - Content length (500 words per chunk) - Semantic boundaries ### Embedding Generation - Uses `all-MiniLM-L6-v2` sentence transformer model by default - Supports ONNX models for faster inference - Caches embeddings for efficient updates ### Search Algorithm 1. Generates embedding for your query 2. Compares against all document chunks using cosine similarity 3. Returns ranked results with context and metadata 4. Supports repository-specific searches ## CLI Options ### embeddings_searcher.py ```bash # Indexing --index # Index all repositories --force # Force reindex of all documents # Search --query "search terms" # Search query --repo "repo-name" # Search within specific repository --max-results 10 # Maximum results to return --min-similarity 0.1 # Minimum similarity threshold # Information --stats # Show indexing statistics --list-repos # List indexed repositories # Configuration --kb-path /path/to/kb # Path to knowledge base --db-path embeddings.db # Path to embeddings database --model model-name # Sentence transformer model --ignore-dirs [DIRS...] # Directories to ignore during indexing ``` ### mcp_server.py ```bash --kb-path /path/to/kb # Path to knowledge base --docs-db-path embeddings.db # Path to docs embeddings database --model model-name # Sentence transformer model ``` ## ONNX Model Conversion For faster inference, you can convert the sentence transformer model to ONNX format: ```bash # Convert model to ONNX uv run python onnx_convert.py --convert # Test ONNX model uv run python onnx_convert.py --test # Convert and test in one command uv run python onnx_convert.py --convert --test ``` ## Example Usage ```bash # Index documentation uv run python embeddings_searcher.py --index # Search for API documentation uv run python embeddings_searcher.py --query "API endpoints" # Search for authentication in specific repository uv run python embeddings_searcher.py --query "user authentication" --repo "my-project.git" # Get detailed statistics uv run python embeddings_searcher.py --stats ``` ## Performance - **Indexing**: ~1400 documents in ~1 minute - **Search**: Sub-second response times - **Storage**: ~50MB for embeddings database with 6500+ chunks - **Memory**: ~500MB during indexing, ~200MB during search ## Troubleshooting ### Unicode Errors Some files may have encoding issues. The system automatically falls back to latin-1 encoding for problematic files. ### Large Files Files larger than 1MB are automatically skipped to prevent memory issues. ### Model Loading If sentence-transformers is not available, the system will attempt to use ONNX models or fall back to dummy embeddings for testing.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/thypon/kb'

If you have feedback or need assistance with the MCP directory API, please join our Discord server