folder-mcp

Overview Schema Related Servers Score Discussions

Phase-9-Sprint-11-Document-Level-Semantics.md•13.4 KiB

# Sprint 11: Document-Level Embeddings & Keywords **Status**: In Progress **Priority**: High **Dependencies**: Sprint 10 (Semantic Extraction) completed ## Executive Summary Implement document-level embeddings and keyword extraction that efficiently processes documents with 100+ chunks without memory explosion. The system will reuse chunk-level keyword embeddings for maximum efficiency while providing high-quality document-level semantic data. ## Problem Statement Currently, our system only stores chunk-level embeddings and keywords, missing the document-level perspective that's crucial for: - Document-wide semantic search - Document summarization - Better relevance ranking - Document clustering and categorization ### Key Challenges 1. **Memory Efficiency**: Documents can have 100+ chunks - loading all embeddings causes memory explosion 2. **Keyword Quality**: Need to select best keywords from thousands of candidates 3. **Embedding Reuse**: Avoid regenerating embeddings we already computed at chunk level 4. **Database Evolution**: Add document-level fields without breaking existing functionality ## Research Findings ### Critical Discoveries 1. **We're already doing MORE work at chunk level**: - Testing 7,850 keyword candidates (50 per chunk × 157 chunks) - Document level only needs 150-200 candidates - This is a 52x reduction in computation! 2. **Keyword embeddings already exist**: - Generated during chunk processing - Stored in LRU cache - Can be reused for document-level scoring 3. **Limited deduplication benefit**: - Only 16.8% reduction (1,570 → 1,307 unique) - Most keywords are already unique - Quality matters more than quantity 4. **Incremental averaging solves memory problem**: - Welford's algorithm: O(1) memory complexity - Process chunks sequentially - Numerically stable averaging ## Technical Design ### Architecture Overview ``` ┌─────────────────────────────────────────────────────┐ │ Indexing Orchestrator │ │ ┌─────────────────────────────────────────────┐ │ │ │ Chunk Processing Loop │ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ │ 1. Process chunk │ │ │ │ │ │ 2. Extract keywords │ │ │ │ │ │ 3. Generate embeddings │ │ │ │ │ │ 4. Update document averager │◄─────┼───┼── Incremental │ │ │ 5. Collect keyword candidates │ │ │ Averaging │ │ └──────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Document-Level Processing │ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ │ 1. Finalize document embedding │ │ │ │ │ │ 2. Score collected keywords │◄─────┼───┼── Reuse cached │ │ │ 3. Select top 20-30 │ │ │ embeddings │ │ │ 4. Store in database │ │ │ │ │ └──────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘ ``` ### Welford's Incremental Averaging Algorithm ```typescript /** * Incrementally compute document embedding as average of chunk embeddings * using Welford's numerically stable online algorithm. * Memory complexity: O(dimension) regardless of chunk count */ class IncrementalEmbeddingAverager { private n = 0; // Number of chunks processed private mean: Float32Array; // Running mean add(embedding: Float32Array): void { this.n++; if (this.n === 1) { this.mean = new Float32Array(embedding); } else { // Welford's update: new_mean = old_mean + (value - old_mean) / n for (let i = 0; i < embedding.length; i++) { this.mean[i] += (embedding[i] - this.mean[i]) / this.n; } } } getAverage(): Float32Array { return this.mean; } } ``` ### Keyword Collection & Scoring Strategy ```typescript interface DocumentKeywordCandidate { text: string; // Keyword text embedding?: Float32Array; // Reused from chunk processing cache chunkFrequency: number; // How many chunks contain this keyword chunkScores: number[]; // Scores from each chunk avgChunkScore: number; // Average score across chunks documentScore?: number; // Cosine similarity to document embedding finalScore?: number; // Combined score for ranking } // Scoring formula: // finalScore = 0.7 * documentScore + 0.2 * avgChunkScore + 0.1 * log(chunkFrequency) ``` ## Implementation Plan ### Phase 1: Database Schema Evolution **File**: `src/infrastructure/embeddings/sqlite-vec/schema.ts` ```sql -- Add columns to documents table ALTER TABLE documents ADD COLUMN document_embedding TEXT; ALTER TABLE documents ADD COLUMN document_keywords TEXT; -- JSON array ALTER TABLE documents ADD COLUMN keywords_extracted INTEGER DEFAULT 0; ALTER TABLE documents ADD COLUMN embedding_generated INTEGER DEFAULT 0; ALTER TABLE documents ADD COLUMN document_processing_ms INTEGER; -- Add indexes for new fields CREATE INDEX idx_documents_keywords_extracted ON documents(keywords_extracted); CREATE INDEX idx_documents_embedding_generated ON documents(embedding_generated); ``` **Schema Version**: Increment from 2 to 3 to trigger rebuild ### Phase 2: Document Embedding Service **New File**: `src/domain/semantic/document-embedding-service.ts` Core responsibilities: - Implement Welford's algorithm for incremental averaging - Handle both ONNX and Python model outputs - Provide streaming interface for chunk-by-chunk processing - Calculate embedding statistics (magnitude, sparsity) ### Phase 3: Document Keyword Scorer **New File**: `src/domain/semantic/document-keyword-scorer.ts` Core responsibilities: - Collect unique keywords from all chunks - Retrieve cached embeddings (no regeneration!) - Calculate cosine similarity to document embedding - Apply combined scoring formula - Optional MMR for diversity - Return top 20-30 keywords with scores ### Phase 4: Orchestrator Integration **Modify**: `src/application/indexing/orchestrator.ts` Integration points: ```typescript // During chunk processing loop (lines 700-900) - Initialize document averager before loop - After each chunk: averager.add(chunkEmbedding) - Collect keyword candidates with embeddings // After all chunks processed (line 944) - Finalize document embedding - Score all collected keywords - Select top 20-30 - Store in database with single transaction ``` ### Phase 5: Database Queries **Update**: `src/infrastructure/embeddings/sqlite-vec/schema.ts` New queries: ```typescript export const QUERIES = { // ... existing queries ... updateDocumentEmbedding: ` UPDATE documents SET document_embedding = ?, embedding_generated = 1, document_processing_ms = ? WHERE id = ? `, updateDocumentKeywords: ` UPDATE documents SET document_keywords = ?, keywords_extracted = 1 WHERE id = ? `, getDocumentSemantics: ` SELECT document_embedding, document_keywords FROM documents WHERE id = ? `, getDocumentsNeedingSemantics: ` SELECT id, file_path FROM documents WHERE keywords_extracted = 0 OR embedding_generated = 0 ` }; ``` ## Testing Strategy ### Smoke Test Procedure Using the 5 indexed test folders: - `tmp/test-cpu-xenova-multilingual-e5-large` - `tmp/test-cpu-xenova-gte-multilingual-base` - `tmp/test-onnx-minilm-l6` - `tmp/test-gpu-bge-m3` - `tmp/test-gpu-e5-large` **Test Steps**: 1. Remove all `.folder-mcp` directories to force re-indexing 2. Run `npm run daemon:restart` to start fresh daemon 3. Monitor daemon logs for indexing completion 4. Verify database contents: ```sql -- Check document-level data SELECT file_path, keywords_extracted, embedding_generated, LENGTH(document_embedding) as embedding_size, JSON_ARRAY_LENGTH(document_keywords) as keyword_count FROM documents WHERE keywords_extracted = 1; -- Verify keyword quality SELECT file_path, JSON_EXTRACT(document_keywords, '$[0].text') as top_keyword, JSON_EXTRACT(document_keywords, '$[0].score') as top_score FROM documents WHERE document_keywords IS NOT NULL; ``` ### Memory Testing Monitor memory usage during processing of large documents: ```bash # Start monitoring before indexing while true; do ps aux | grep "node.*daemon" | grep -v grep | awk '{print $6}' sleep 1 done > memory_usage.log # Process should show constant memory even for 100+ chunk documents ``` ### Quality Metrics Expected outcomes: - Multi-word phrase ratio: >80% (vs current 11%) - Average keywords per document: 20-30 - Keyword score range: 0.5-1.0 - Processing time: <100ms per document (after chunks) - Memory usage: Constant regardless of chunk count ## Performance Optimizations ### Optimization Strategies 1. **Embedding Cache Reuse** - Never regenerate embeddings already in LRU cache - Cache hit rate should be >90% for keywords 2. **Incremental Processing** - O(1) memory complexity for document embedding - No array concatenation or bulk operations 3. **Batch Database Operations** - Single transaction for all document-level updates - Prepared statements for efficiency 4. **Lazy Loading** - Only load embeddings when needed for scoring - Stream chunks instead of loading all at once ## Success Criteria ### Functional Requirements - ✅ Document embeddings generated for all indexed documents - ✅ 20-30 high-quality keywords per document - ✅ Keywords have semantic scores (0.0-1.0) - ✅ Multi-word phrase ratio >80% - ✅ Database schema properly evolved ### Performance Requirements - ✅ Memory usage constant regardless of document size - ✅ Processing time <100ms per document (post-chunks) - ✅ No regeneration of existing embeddings - ✅ Cache hit rate >90% for keyword embeddings ### Quality Requirements - ✅ Keywords represent document themes accurately - ✅ Diversity in selected keywords (via MMR if needed) - ✅ Scores correlate with human relevance judgment ## Rollback Plan If issues arise: 1. Schema version can be reverted to 2 2. New columns are nullable - won't break existing code 3. Document-level processing is independent of chunk processing 4. Can disable document-level extraction via config flag ## Future Enhancements (Sprint 12+) 1. **Document Clustering** - Use document embeddings for similarity clustering - Auto-categorization of documents 2. **Smart Summarization** - Use top keywords to generate summaries - Extract key sentences near top keywords 3. **Cross-Document Linking** - Find related documents via embedding similarity - Build knowledge graphs from keyword relationships 4. **Query Expansion** - Use document keywords for query enhancement - Improve search recall with synonym matching ## Notes & Observations ### Why This Approach is Efficient The key insight is that we're already doing the hard work at chunk level: - Generating embeddings for 50 candidates per chunk - Testing 7,850 total candidates for a 157-chunk document - Document level only needs to test the "winners" from chunks This is like a tournament structure: - **Chunk Level**: Regional qualifiers (very thorough) - **Document Level**: National finals (only the best compete) ### Avoiding Common Pitfalls 1. **Don't load all embeddings at once** - Use incremental averaging 2. **Don't regenerate embeddings** - Reuse from cache 3. **Don't over-optimize keywords** - Quality > Quantity 4. **Don't break existing functionality** - Add columns, don't modify ## Implementation Checklist - [ ] Update database schema with new columns - [ ] Increment schema version to trigger rebuild - [ ] Implement IncrementalEmbeddingAverager - [ ] Create DocumentKeywordScorer service - [ ] Integrate into indexing orchestrator - [ ] Add database queries for document semantics - [ ] Run smoke test on all 5 test folders - [ ] Verify memory usage stays constant - [ ] Check keyword quality metrics - [ ] Document any edge cases found ## References - Welford's Online Algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm - KeyBERT Paper: https://arxiv.org/abs/2003.07278 - MMR (Maximal Marginal Relevance): Carbonell & Goldstein, 1998 - Cosine Similarity: https://en.wikipedia.org/wiki/Cosine_similarity

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/okets/folder-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

Phase-9-Sprint-11-Document-Level-Semantics.md•13.4 KiB