Semantic Search MCP Server

ARCHITECTURE.md•6.48 KiB

# Architecture — codesight > Guided tour of the codebase. WHY things are built, not just WHAT. > **Last Updated:** 2026-02-24 --- ## System Overview ``` Claude Code (user) │ │ MCP JSON-RPC over STDIO ▼ semantic_search_mcp server (src/semantic_search_mcp/) │ ├── search(query) ─────────────────────────────────┐ │ │ │ │ ├──► BM25 via SQLite FTS5 → top 20 │ │ └──► Vector via wam → top 20 │ │ │ │ │ RRF merge │ │ │ │ │ top K chunks ◄────┘ │ ├── index(repo_path) │ ├── Walk files (.gitignore-aware) │ ├── Chunk (language-aware regex) │ ├── Embed (sentence-transformers) │ └── Store (LanceDB + FTS5 sidecar) │ └── status(repo_path) → freshness check Storage: ~/.semantic-search/data/<repo_hash>/ ├── lancedb/ (vectors) └── fts.db (SQLite with FTS5) ``` --- ## Source Layout (`src/semantic_search_mcp/`) | File | Purpose | | --------------- | ------------------------------------------------------------------------ | | `server.py` | FastMCP entry point. Registers the 3 MCP tools. | | `indexer.py` | Orchestrates the index pipeline: walk → chunk → embed → store. | | `search.py` | Hybrid retrieval: BM25 + vector → RRF merge → return chunks. | | `chunker.py` | Language-aware regex chunking. Prepends context headers to chunks. | | `embeddings.py` | sentence-transformers wrapper. Handles model loading + MPS acceleration. | | `store.py` | LanceDB + SQLite FTS5 dual-write. Content hash deduplication. | | `config.py` | Pydantic settings from env vars. Single config object across the app. | | `git_utils.py` | .gitignore-aware file walking via `pathspec`. | | `types.py` | Shared Pydantic models (Chunk, SearchResult, IndexStatus). | | `__main__.py` | `python -m semantic_search_mcp` entry point. | **Do not add modules at the top level** — new capabilities go inside existing modules or as new submodules imported from `server.py`. --- ## The 5 MCP Tools ```python search(query, repo_path?, top_k?, file_glob?) → list[SearchResult] index(repo_path?, force_rebuild?) → IndexStatus status(repo_path?) → IndexStatus watch(repo_path?) → None # planned v0.3 unwatch(repo_path?) → None # planned v0.3 ``` Tool signatures are the public API contract. **Never change them without a spec and human approval.** Claude Code caches tool definitions — a signature change breaks active sessions. --- ## Retrieval Pipeline (The Core) ``` query string │ ├──────────────────────────────────────────────┐ │ │ ▼ ▼ SQLite FTS5 LanceDB BM25 keyword matching vector similarity (exact function names, (semantic meaning, error codes, literals) concept proximity) │ │ └──────────────┬───────────────────────────────┘ ▼ Reciprocal Rank Fusion score = Σ 1/(k + rank_i) where k=60 │ ▼ top K chunks (with file path + line range) ``` **Why hybrid matters:** Pure vector search misses exact keyword matches (function names, error codes). Pure BM25 misses semantic synonyms. RRF merges both with zero extra infrastructure — SQLite FTS5 is built into Python's `sqlite3`. --- ## Storage Layout All indexes live in `~/.semantic-search/data/` (outside the indexed repo — never write inside it). ``` ~/.semantic-search/data/ └── <sha256(repo_path)[:16]>/ ├── lancedb/ ← LanceDB vector tables │ └── chunks.lance ← chunk_id, embedding, metadata └── fts.db ← SQLite with FTS5 virtual table ├── chunks ← chunk_id, content, file_path, lines └── chunks_fts ← FTS5 index (auto-synced via triggers) ``` **Content hashing:** Each chunk is hashed `sha256(content)[:16]`. On re-index, unchanged chunks are skipped entirely — no re-embedding, no write. --- ## Chunking Strategy Language-aware regex splits on scope boundaries (class/function/block) for 10 languages: Python, JS, TS, Go, Rust, Java, Ruby, PHP, C, C++. Unknown languages fall back to sliding window with overlap. Each chunk gets a context header prepended before embedding: ``` # File: src/auth/jwt.py # Scope: function validate_token # Lines: 45-82 ``` **Why context headers:** The embedding model needs to know WHERE a chunk lives, not just what it says. Stripping the header gives the user the raw source; keeping it improves retrieval relevance. --- ## Embedding Model Default: `all-MiniLM-L6-v2` (384 dims, fast, no API key). Configurable via `CODESIGHT_EMBEDDING_MODEL` env var. Better options: `jina-embeddings-v2-base-code` (768 dims, code-specific). **Model mismatch guard:** If a repo was indexed with model A and the current model is B, the server detects the dimension mismatch and forces a full rebuild. --- ## What NOT to Change Without Discussion 1. **RRF k=60 constant** — changing this shifts recall/precision tradeoff. Benchmark before changing. 2. **Data directory location** (`~/.semantic-search/data/`) — changing this invalidates all existing user indexes. 3. **MCP tool signatures** — Claude Code caches them. Breaking change requires version bump. 4. **Content hash algorithm** — changing from `sha256[:16]` invalidates all deduplication state. 5. **FTS5 trigger schema** — the SQLite triggers that sync FTS5 from the chunks table. Incorrect triggers cause silent search failures.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/camilojourney/codesight'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ARCHITECTURE.md•6.48 KiB