Doc Agent

SEMANTIC-SEARCH-ROADMAP.md•13.8 kB

# Semantic Search Roadmap Technical roadmap for implementing production-grade semantic search and RAG capabilities in `doc-agent`. --- ## Architecture ### Vector Store: Provider Pattern Support multiple vector store backends via a common interface. The vector store is decoupled from chunk storage—it only knows about IDs and embeddings. ```typescript interface VectorStoreItem { id: string; // maps to Chunk.id embedding: number[]; metadata?: Record<string, unknown>; // for filtering } interface VectorStoreResult { id: string; score: number; } interface VectorStore { name: string; insert(items: VectorStoreItem[]): Promise<void>; search( queryEmbedding: number[], topK: number, filters?: Record<string, unknown> ): Promise<VectorStoreResult[]>; delete(ids: string[]): Promise<void>; } ``` Implementations: - `CustomVectorStore` — brute-force → HNSW - `LanceDBVectorStore` — baseline comparison The search orchestrator hydrates results by joining `VectorStoreResult.id` against the `chunks` table. ### Chunking Strategies | Strategy | Flag | Implementation | Best For | |----------|------|----------------|----------| | **Line** | `--chunk line` | Split on `\n`, group empty lines | Receipts, invoices | | **Sentence** | `--chunk sentence` | NLP tokenizer | Natural text | | **Semantic** | `--chunk semantic` | LLM-assisted boundary detection | Contracts, reports | Auto-routing by document type: - Receipts/invoices → `line` - Bank statements → `line` or `sentence` - Contracts/reports → `semantic` ### Embedding Providers ```typescript interface EmbeddingProvider { name: string; dims: number; embed(texts: string[]): Promise<number[][]>; } ``` | Provider | Models | Notes | |----------|--------|-------| | **Ollama** (default) | `nomic-embed-text`, `mxbai-embed-large` | Local, no API key | | **OpenAI** | `text-embedding-3-small` | High quality | | **Gemini** | `text-embedding-005`, `text-multilingual-embedding-002` | Multilingual support | | **Transformers.js** | Local ONNX | Zero external deps | ### LLM Providers ```typescript interface LLMProvider { name: string; generate(prompt: string, options?: { system?: string }): Promise<string>; } ``` | Provider | Models | Notes | |----------|--------|-------| | **Ollama** (default) | `llama3.2`, `mistral` | Local | | **OpenAI** | `gpt-4o-mini` | High quality | | **Gemini** | `gemini-1.5-flash` | Fast | ### Storage Model ```sql CREATE TABLE chunks ( id TEXT PRIMARY KEY, document_id INTEGER REFERENCES documents(id), content TEXT NOT NULL, metadata JSON, chunk_index INTEGER NOT NULL ); ``` **Embedding storage:** - **Phase 1:** SQLite BLOB (brute-force search) - **Phase 2+:** Vector store's native format (HNSW memory-mapped files) ### Hybrid Search FTS5 contentless index alongside vector search: ```sql CREATE VIRTUAL TABLE chunks_fts USING fts5( content, content='chunks', content_rowid='id' ); ``` Both FTS5 and vector search return `chunk.id`, enabling fusion: ```typescript interface HybridSearchResult { chunk: Chunk; vectorScore?: number; keywordScore?: number; combinedScore: number; ranks: { vectorRank?: number; keywordRank?: number; }; } ``` Search modes: - `--mode vector` — Cosine similarity only - `--mode keyword` — BM25 only - `--mode hybrid` — RRF fusion ### Reranking ```typescript interface Reranker { rerank(query: string, candidates: ScoredChunk[]): Promise<ScoredChunk[]>; } ``` Reranker receives scored results to preserve retrieval context for debugging and score blending. ### RAG Pipeline ```typescript interface RAGResponse { answer: string; chunks: RAGChunk[]; debug?: { vectorResults: ScoredChunk[]; keywordResults: ScoredChunk[]; rerankedResults: ScoredChunk[]; stats: { vectorLatencyMs: number; keywordLatencyMs?: number; rerankLatencyMs?: number; totalLatencyMs: number; }; }; } ``` Exposed via: 1. CLI: `doc search "query" --rag` 2. MCP: `search_documents` tool 3. HTTP: `POST /rag` (optional) ### Evaluation ```typescript interface EvalQuery { id: string; query: string; relevantChunkIds: string[]; category?: string; } interface EvalDataset { name: string; chunks: Chunk[]; queries: EvalQuery[]; } interface EvalResult { recallAtK: Record<number, number>; precisionAtK: Record<number, number>; mrr: number; byCategory?: Record<string, EvalResult>; } ``` --- ## Phase 1: Vector Search Core ### Scope - Chunking module (`line`, `sentence`) - Embedding provider abstraction + Ollama implementation - Custom vector store with brute-force cosine similarity - `chunks` table in SQLite - CLI: `doc ingest` and `doc search` - Evaluation harness ### File Structure ``` packages/vector-store/src/ ├── chunking/ │ ├── types.ts │ ├── line.ts │ └── sentence.ts ├── embeddings/ │ ├── types.ts │ └── ollama.ts ├── stores/ │ ├── types.ts │ └── custom.ts ├── eval/ │ ├── types.ts │ ├── dataset.ts │ └── metrics.ts ├── search.ts └── index.ts ``` ### Deliverables - [ ] `Chunk` and `ChunkingStrategy` types - [ ] Line chunker - [ ] Sentence chunker - [ ] `EmbeddingProvider` interface - [ ] Ollama embedding provider - [ ] `VectorStore` interface - [ ] Brute-force cosine similarity - [ ] `chunks` schema migration - [ ] Search orchestrator - [ ] `doc ingest <file>` command - [ ] `doc search <query>` command - [ ] Evaluation dataset - [ ] `doc eval` command ### Benchmarks - Chunk size vs recall@k - Embedding latency by provider --- ## Phase 2: Hybrid Search ### Scope - FTS5 integration for keyword search - BM25 scoring - Reciprocal Rank Fusion (RRF) - HNSW index - Metadata filtering ### File Structure Additions ``` packages/vector-store/src/ ├── ranking/ │ ├── bm25.ts │ ├── rrf.ts │ └── hybrid.ts ├── stores/ │ └── hnsw.ts ``` ### Deliverables - [ ] FTS5 virtual table + sync triggers - [ ] `bm25Search()` function - [ ] `rrfFusion()` function - [ ] `HybridSearchResult` type - [ ] `hybridSearch()` orchestrator - [ ] `--mode vector | keyword | hybrid` flag - [ ] HNSW vector store - [ ] `--filter` metadata filtering ### Benchmarks - Vector vs keyword vs hybrid recall - HNSW accuracy vs brute-force - HNSW latency vs `ef` parameter - Custom vs LanceDB comparison --- ## Phase 3: RAG & Evaluation ### Scope - LLM provider abstraction - Reranking - RAG engine with citations - MCP tool integration - Provider comparison ### File Structure Additions ``` packages/vector-store/src/ ├── llm/ │ ├── types.ts │ └── ollama.ts ├── rerank/ │ ├── types.ts │ └── ollama.ts ├── rag/ │ ├── types.ts │ ├── engine.ts │ └── prompts.ts ``` ### Deliverables - [ ] `LLMProvider` interface - [ ] Ollama LLM provider - [ ] `Reranker` interface - [ ] Ollama reranker - [ ] `runRAG()` engine - [ ] RAG prompt templates - [ ] `doc search --rag` command - [ ] MCP `search_documents` tool - [ ] Provider comparison report ### Benchmarks - Reranking impact on precision - Context window size vs answer quality - Embedding provider comparison (recall, latency) --- ## Future - [ ] HTTP server (`POST /rag`) - [ ] Search debugging UI - [ ] OpenAI / Gemini providers - [ ] Transformers.js embeddings - [ ] Semantic chunking - [ ] Index persistence - [ ] Embeddings versioning - [ ] Query caching - [ ] Multi-modal search --- ## Types Reference ```typescript // ───────────────────────────────────────────────────────────── // Chunking // ───────────────────────────────────────────────────────────── interface Chunk { id: string; documentId: string; content: string; index: number; metadata: { page?: number; section?: string; source: string; [key: string]: unknown; }; } type ChunkingStrategy = 'line' | 'sentence' | 'semantic'; interface Chunker { strategy: ChunkingStrategy; chunk(text: string, documentId: string, metadata?: Record<string, unknown>): Chunk[]; } // ───────────────────────────────────────────────────────────── // Embeddings // ───────────────────────────────────────────────────────────── interface EmbeddingProvider { name: string; dims: number; embed(texts: string[]): Promise<number[][]>; } // ───────────────────────────────────────────────────────────── // Vector Store // ───────────────────────────────────────────────────────────── interface VectorStoreItem { id: string; embedding: number[]; metadata?: Record<string, unknown>; } interface VectorStoreResult { id: string; score: number; } interface VectorStore { name: string; insert(items: VectorStoreItem[]): Promise<void>; search( queryEmbedding: number[], topK: number, filters?: Record<string, unknown> ): Promise<VectorStoreResult[]>; delete(ids: string[]): Promise<void>; } // ───────────────────────────────────────────────────────────── // LLM // ───────────────────────────────────────────────────────────── interface LLMProvider { name: string; generate(prompt: string, options?: { system?: string }): Promise<string>; } // ───────────────────────────────────────────────────────────── // Ranking // ───────────────────────────────────────────────────────────── interface ScoredChunk { chunk: Chunk; vectorScore?: number; keywordScore?: number; combinedScore: number; } interface HybridSearchResult extends ScoredChunk { ranks: { vectorRank?: number; keywordRank?: number; }; } interface Reranker { rerank(query: string, candidates: ScoredChunk[]): Promise<ScoredChunk[]>; } // ───────────────────────────────────────────────────────────── // RAG // ───────────────────────────────────────────────────────────── interface RAGRequest { query: string; topK?: number; mode?: 'vector' | 'keyword' | 'hybrid'; filters?: Record<string, unknown>; rerank?: boolean; } interface RAGChunk { id: string; content: string; score: number; source: { documentId: string; filename: string; page?: number; }; } interface RAGResponse { answer: string; chunks: RAGChunk[]; debug?: { vectorResults: ScoredChunk[]; keywordResults: ScoredChunk[]; rerankedResults: ScoredChunk[]; stats: { vectorLatencyMs: number; keywordLatencyMs?: number; rerankLatencyMs?: number; totalLatencyMs: number; }; }; } // ───────────────────────────────────────────────────────────── // Evaluation // ───────────────────────────────────────────────────────────── interface EvalQuery { id: string; query: string; relevantChunkIds: string[]; category?: string; } interface EvalDataset { name: string; description?: string; chunks: Chunk[]; queries: EvalQuery[]; } interface EvalResult { recallAtK: Record<number, number>; precisionAtK: Record<number, number>; mrr: number; byCategory?: Record<string, EvalResult>; } ``` --- ## CLI Reference ```bash # Ingestion doc ingest <file> doc ingest <file> --chunk line|sentence|semantic doc ingest <file> --embed-provider ollama|openai|gemini|transformers doc ingest <file> --embed-model <model-name> # Search doc search <query> doc search <query> --mode vector|keyword|hybrid doc search <query> --vector-store custom|lancedb doc search <query> --filter "key:value" doc search <query> --rag doc search <query> --rerank doc search <query> --top-k 10 doc search <query> --json # Evaluation doc eval --dataset <path> doc eval --compare ollama,openai,gemini # Servers doc mcp doc serve --port 3000 ``` --- ## References - [HNSW Paper](https://arxiv.org/abs/1603.09320) — Hierarchical Navigable Small World graphs - [BM25 Explained](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables) - [RRF Paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) — Reciprocal Rank Fusion - [RAG Paper](https://arxiv.org/abs/2005.11401) — Retrieval-Augmented Generation

Latest Blog Posts

What Is Context Bloat in MCP?
By Om-Shree-0709 on December 16, 2025.
mcp
Context Bloat
MCP Moves to the Linux Foundation: Neutral Stewardship for Agentic Infrastructure
By Om-Shree-0709 on December 15, 2025.
mcp
anthropic
Linux Foundation
Code Execution with MCP: Architecting Agentic Efficiency
By Om-Shree-0709 on December 14, 2025.
mcp
Token bloat

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/prosdevlab/doc-agent'

If you have feedback or need assistance with the MCP directory API, please join our Discord server