folder-mcp

Overview Schema Related Servers Score Discussions

Phase-9-Sprint-10.5.Semantic-Performance-Optimization.md•6.07 KiB

# Sprint: Semantic Extraction Performance Optimization ## Sprint Overview **Goal**: Optimize semantic extraction to reduce indexing time by 80-90% while maintaining quality **Priority**: CRITICAL - System is currently unusable with 20-260x performance regression **Duration**: 1-2 days **Success Metric**: Indexing 5 files in under 2 seconds (from current 5-26 seconds) ## Problem Statement After implementing semantic extraction in Sprint 0, indexing performance degraded catastrophically: - **Before**: ~100ms for 5 files - **After (ONNX)**: 2-5 seconds for 5 files (20-50x slower) - **After (PyTorch)**: 10-26 seconds for 5 files (100-260x slower) ### Root Causes Identified 1. **Individual Embedding Calls**: Generating embeddings for each n-gram candidate individually 2. **No Batching**: Missing batch processing at multiple levels 3. **No Caching**: Re-computing embeddings for common phrases 4. **Excessive Candidates**: Processing 50-200 n-grams per chunk before filtering 5. **Double Processing**: Chunks embedded twice (semantic extraction + storage) ## Implementation Tasks ### Task 1: Batch N-gram Embeddings (ONNX Models) **Priority**: HIGH **Impact**: 70-80% performance improvement **File**: `src/domain/semantic/algorithms/ngram-cosine-extractor.ts` **Changes**: - Replace individual `generateEmbedding()` calls with batched `generateEmbeddings()` - Modify `generateCandidateEmbeddings()` method (lines 165-189) - Send all candidates in a single batch instead of using `Promise.all()` on individual calls **Implementation**: ```typescript // OLD: Individual calls const batchEmbeddings = await Promise.all( batch.map(candidate => this.embeddingModel!.generateEmbedding(candidate)) ); // NEW: Batched call const batchEmbeddings = await this.embeddingModel!.generateEmbeddings(batch); ``` ### Task 2: Batch KeyBERT Processing (PyTorch Models) **Priority**: HIGH **Impact**: 70-80% performance improvement **Files**: - `src/domain/semantic/extraction-service.ts` - `src/application/indexing/orchestrator.ts` **Changes**: - Process multiple chunks in a single Python service call - Batch KeyBERT extraction at document level - Reduce IPC overhead ### Task 3: Implement Embedding Cache **Priority**: MEDIUM **Impact**: 20-30% improvement for similar content **File**: `src/domain/semantic/algorithms/ngram-cosine-extractor.ts` **Implementation**: ```typescript class NGramCosineExtractor { private embeddingCache: Map<string, Float32Array> = new Map(); private readonly MAX_CACHE_SIZE = 1000; private async getCachedOrGenerate(text: string): Promise<Float32Array> { if (this.embeddingCache.has(text)) { return this.embeddingCache.get(text)!; } const embedding = await this.embeddingModel!.generateEmbedding(text); // LRU eviction if cache is full if (this.embeddingCache.size >= this.MAX_CACHE_SIZE) { const firstKey = this.embeddingCache.keys().next().value; this.embeddingCache.delete(firstKey); } this.embeddingCache.set(text, embedding); return embedding; } } ``` ### Task 4: Reduce N-gram Candidates **Priority**: MEDIUM **Impact**: 30-50% fewer embeddings needed **File**: `src/domain/semantic/algorithms/ngram-utils.ts` **Changes**: - Limit candidates to top 30-50 by frequency - Apply TF-IDF pre-filtering - Stricter quality filters before embedding generation ### Task 5: Eliminate Double Embedding (ONNX) **Priority**: LOW **Impact**: 50% reduction in embedding calls for ONNX **File**: `src/application/indexing/orchestrator.ts` **Changes**: - Reuse embeddings from semantic extraction for vector storage - Pass embeddings through the pipeline instead of regenerating ### Task 6: Add Performance Metrics **Priority**: LOW **Impact**: Better monitoring and debugging **Files**: Various **Changes**: - Add timing measurements for each stage - Log batch sizes and cache hit rates - Track embeddings per minute (EPM) ## Testing Strategy ### Performance Benchmarks 1. **Baseline**: Record current timings with test folders 2. **After Each Task**: Measure improvement 3. **Target Metrics**: - ONNX: < 1 second for 5 files - PyTorch: < 2 seconds for 5 files - Cache hit rate: > 30% for similar content ### Test Folders - `/tmp/test-gpu-multilingual-e5-large` (PyTorch) - `/tmp/test-gpu-bge-m3` (PyTorch) - `/tmp/test-gpu-xenova-multilingual-e5-small` (ONNX) - `/tmp/test-gpu-xenova-multilingual-e5-large` (ONNX) ### Validation - Verify semantic extraction quality remains consistent - Check multiword phrase ratio stays > 60% - Ensure no regression in search quality ## Rollback Plan If optimizations cause issues: 1. **Quick disable**: Add environment variable to skip semantic extraction 2. **Gradual rollback**: Revert individual optimizations 3. **Emergency**: Full revert to commit before Sprint 0 ## Success Criteria - [ ] Indexing time reduced by 80% or more - [ ] No degradation in semantic extraction quality - [ ] All existing tests pass - [ ] Performance metrics logged for monitoring - [ ] Cache implementation tested with memory limits ## Code Quality Requirements - Clean, readable code with clear comments - Proper error handling for batch operations - Memory management for caches - Backward compatibility maintained ## Implementation Order 1. **Immediate Fix**: Task 1 (Batch n-grams) - Biggest impact 2. **Quick Win**: Task 4 (Reduce candidates) - Easy to implement 3. **Major Fix**: Task 2 (Batch KeyBERT) - High impact for PyTorch 4. **Enhancement**: Task 3 (Caching) - Good for repeated content 5. **Optimization**: Task 5 (Eliminate double embedding) 6. **Monitoring**: Task 6 (Metrics) ## Risk Mitigation - **Risk**: Batching might exceed memory limits - **Mitigation**: Implement adaptive batch sizing - **Risk**: Cache might grow too large - **Mitigation**: LRU eviction with configurable size - **Risk**: Quality degradation from fewer candidates - **Mitigation**: Monitor multiword ratio and adjust thresholds ## Notes - Priority is to make the system usable again - Semantic extraction should be optimized, not removed - Focus on batching as primary optimization - Cache is secondary but helpful for similar content - Monitor performance continuously during implementation

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/okets/folder-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

Phase-9-Sprint-10.5.Semantic-Performance-Optimization.md•6.07 KiB