folder-mcp

Overview Schema Related Servers Score Discussions

Phase-10-Sprint-7.5-Vec0-Infrastructure-Migration.md•19.4 KiB

# Phase 10 Sprint 7.5: Vec0 Infrastructure Migration **Sprint Type**: Ad-hoc Infrastructure Enhancement **Phase**: 10 - Semantic Endpoint Navigation **Sprint**: 7.5 (Infrastructure preparation) **Priority**: High - Foundation for search endpoint (Sprint 8) **Duration**: 6-8 hours **Pre-production**: YES - Databases are disposable, can reindex anytime --- ## 🎯 Sprint Goal Migrate embedding storage from TEXT format to SQLite vec0 virtual tables, preparing the infrastructure for native vector similarity search implementation in Sprint 8. Focus on **write path stability** - ensuring embeddings are stored correctly in vec0 format while maintaining all existing folder lifecycle operations. **Out of Scope**: Search endpoint implementation (that's Sprint 8) --- ## 📋 Context & Motivation ### Current State - **Embeddings stored as TEXT**: Chunk embeddings stored as JSON strings, document embeddings as base64 - **Manual similarity search**: MultiFolderVectorSearchService iterates ALL embeddings in JavaScript - **No indexing**: Linear scan with manual cosine similarity for every search - **Performance bottleneck**: Slow for large datasets (thousands of embeddings) ### Why Vec0 Now? 1. **Infrastructure readiness**: Before implementing search endpoint, storage layer must be ready 2. **Single model per database**: Each folder's database uses one model with fixed dimension 3. **Disposable databases**: Source files are truth, can always reindex 4. **Write path focus**: Get embeddings into vec0 format now, read path (search) comes in Sprint 8 ### Key Constraint **Single Model Per Database**: Each `.folder-mcp/embeddings.db` uses ONE model at a time, enforced by: ```sql CREATE TABLE embedding_config ( id INTEGER PRIMARY KEY CHECK (id = 1), -- Single row! model_name TEXT NOT NULL, model_dimension INTEGER NOT NULL ); ``` This means vec0 virtual tables can use a **fixed dimension per database**. --- ## 📊 Model Dimensions Overview From `src/config/curated-models.json`: | Model ID | Dimensions | Provider | |----------|------------|----------| | `gpu:bge-m3` | **1024** | Python GPU | | `gpu:multilingual-e5-large` | **1024** | Python GPU | | `gpu:paraphrase-multilingual-minilm` | **384** | Python GPU | | `cpu:xenova-multilingual-e5-small` | **384** | ONNX CPU | | `cpu:xenova-multilingual-e5-large` | **1024** | ONNX CPU | **Two dimension sizes**: 384 or 1024 --- ## 🏗️ Architecture Changes ### Current Schema (TEXT Storage) ```sql -- OLD: Regular table with TEXT column CREATE TABLE embeddings ( chunk_id INTEGER PRIMARY KEY, embedding TEXT -- JSON serialized array ); -- Document embeddings stored in documents table CREATE TABLE documents ( ... document_embedding TEXT, -- base64 encoded Float32Array embedding_generated INTEGER DEFAULT 0 ); ``` ### New Schema (Vec0 Virtual Tables with Metadata) ```sql -- ✅ NEW: Vec0 virtual table for chunk embeddings -- Uses INTEGER metadata column for explicit chunk_id storage CREATE VIRTUAL TABLE chunk_embeddings USING vec0( chunk_id INTEGER, -- Metadata column for JOIN embedding FLOAT32[${dimension}] -- Dimension from embedding_config ); -- ✅ NEW: Vec0 virtual table for document embeddings -- Uses INTEGER metadata column for explicit document_id storage CREATE VIRTUAL TABLE document_embeddings USING vec0( document_id INTEGER, -- Metadata column for JOIN embedding FLOAT32[${dimension}] -- Same dimension as chunks ); -- UNCHANGED: Semantic metadata preserved CREATE TABLE chunks ( ... key_phrases TEXT, -- JSON array (UNCHANGED) readability_score REAL, -- Score 0-100 (UNCHANGED) semantic_processed INTEGER -- Flag (UNCHANGED) ); CREATE TABLE documents ( ... document_keywords TEXT, -- JSON array (UNCHANGED) keywords_extracted INTEGER -- Flag (UNCHANGED) -- document_embedding REMOVED - now in document_embeddings table -- embedding_generated REMOVED - derive from document_embeddings existence ); ``` ### Schema Version - **Current**: Version 3 (with document-level semantics) - **Target**: Version 4 (with vec0 virtual tables) ### Vec0 Metadata Columns Design **Key Design Decision**: Vec0 supports metadata columns alongside vector data, eliminating the need for rowid synchronization. **Benefits**: - ✅ **Explicit IDs**: Store `chunk_id` and `document_id` directly in vec0 tables - ✅ **Direct JOINs**: `JOIN chunk_embeddings ON chunks.id = chunk_embeddings.chunk_id` - ✅ **No Sync Complexity**: No need to keep rowids synchronized with insertion order - ✅ **Standard SQL**: DELETE, UPDATE, and SELECT work with metadata columns - ✅ **Fail-Loud**: Can't accidentally join wrong embeddings to wrong chunks **Example**: ```sql -- Query with metadata JOIN SELECT c.content, ce.chunk_id, distance FROM chunks c JOIN chunk_embeddings ce ON c.id = ce.chunk_id WHERE ce.embedding MATCH vec_f32(?) AND k = 10; -- Delete by metadata DELETE FROM chunk_embeddings WHERE chunk_id = ?; ``` --- ## 🔧 Implementation Tasks ### Phase 1: Schema Updates (Low Risk) **Files**: `src/infrastructure/embeddings/sqlite-vec/schema.ts` **Tasks**: 1. ✅ Add `chunk_embeddings` vec0 virtual table template 2. ✅ Add `document_embeddings` vec0 virtual table template 3. ✅ Remove `embeddings` TEXT table 4. ✅ Remove `document_embedding` and `embedding_generated` columns from documents table 5. ✅ Update schema version to 4 6. ✅ Update `getAllTableStatements()` to use dynamic dimension injection **Expected Changes**: ```typescript // BEFORE export const EMBEDDINGS_TABLE_TEMPLATE = ` CREATE TABLE IF NOT EXISTS embeddings ( chunk_id INTEGER PRIMARY KEY, embedding TEXT );`; // AFTER - Vec0 with metadata columns (no PRIMARY KEY needed) export const CHUNK_EMBEDDINGS_TABLE_TEMPLATE = ` CREATE VIRTUAL TABLE IF NOT EXISTS chunk_embeddings USING vec0( chunk_id INTEGER, embedding FLOAT32[\${dimension}] );`; export const DOCUMENT_EMBEDDINGS_TABLE_TEMPLATE = ` CREATE VIRTUAL TABLE IF NOT EXISTS document_embeddings USING vec0( document_id INTEGER, embedding FLOAT32[\${dimension}] );`; export function getAllTableStatements(embeddingDimension: number): string[] { return [ DOCUMENTS_TABLE, CHUNKS_TABLE, CHUNK_EMBEDDINGS_TABLE_TEMPLATE.replace('${dimension}', embeddingDimension.toString()), DOCUMENT_EMBEDDINGS_TABLE_TEMPLATE.replace('${dimension}', embeddingDimension.toString()), EMBEDDING_CONFIG_TABLE, FILE_STATES_TABLE, SCHEMA_VERSION_TABLE, ...INDEXES ]; } ``` --- ### Phase 2: Database Manager Updates (Medium Risk) **Files**: `src/infrastructure/embeddings/sqlite-vec/database-manager.ts` **Tasks**: 1. ✅ Update `initialize()` to check schema version 2. ✅ Add `createVec0Schema()` for new databases 3. ✅ Add `verifyModelConsistency()` for existing databases 4. ✅ Add model change detection with helpful error messages 5. ✅ Update schema validation to check vec0 tables **New Methods**: ```typescript private async createVec0Schema(): Promise<void> { // 1. Get dimension from config const dimension = this.config.modelDimension; // 2. Insert embedding_config first db.prepare(` INSERT OR REPLACE INTO embedding_config (id, model_name, model_dimension, updated_at) VALUES (1, ?, ?, CURRENT_TIMESTAMP) `).run(this.config.modelName, dimension); // 3. Create vec0 tables with correct dimension const statements = getAllTableStatements(dimension); for (const statement of statements) { db.exec(statement); } // 4. Set schema version to 4 db.prepare(` INSERT OR REPLACE INTO schema_version (id, version) VALUES (1, 4) `).run(); } private async verifyModelConsistency(): Promise<void> { const configRow = db.prepare('SELECT model_name, model_dimension FROM embedding_config WHERE id = 1').get(); if (configRow.model_name !== this.config.modelName) { const storedDim = configRow.model_dimension; const newDim = this.config.modelDimension; if (storedDim !== newDim) { throw new Error( `MODEL_CHANGED: Database created with ${configRow.model_name} (${storedDim}d), ` + `now configured for ${this.config.modelName} (${newDim}d). ` + `Database rebuild required - delete .folder-mcp and reindex.` ); } } } ``` --- ### Phase 3: Write Path Updates (High Risk - Core Functionality) **Files**: `src/infrastructure/embeddings/sqlite-vec/sqlite-vec-storage.ts` **Tasks**: 1. ✅ Update `insertEmbeddingsWithMetadata()` to use vec0 tables 2. ✅ Update `updateDocumentSemantics()` to use document_embeddings table 3. ✅ Keep semantic metadata writes unchanged 4. ✅ Add transaction consistency checks **Key Changes**: ```typescript // BEFORE (chunk embeddings) const embeddingJson = JSON.stringify(embeddingArray); insertEmbeddingStmt.run(chunkId, embeddingJson); // AFTER (chunk embeddings) - Use vec_f32() SQL function const insertChunkEmbedding = db.prepare( 'INSERT INTO chunk_embeddings (chunk_id, embedding) VALUES (?, vec_f32(?))' ); insertChunkEmbedding.run(chunkId, JSON.stringify(embeddingArray)); // BEFORE (document embeddings) UPDATE documents SET document_embedding = ?, embedding_generated = 1 WHERE file_path = ? // AFTER (document embeddings) - Use vec_f32() SQL function const insertDocEmbedding = db.prepare( 'INSERT INTO document_embeddings (document_id, embedding) VALUES (?, vec_f32(?))' ); insertDocEmbedding.run(documentId, JSON.stringify(embeddingArray)); ``` **Critical**: Must use `vec_f32()` SQL function to convert JSON array to proper FLOAT32 BLOB format. --- ### Phase 4: Delete Path Updates (High Risk - Data Integrity) **Files**: `src/infrastructure/embeddings/sqlite-vec/sqlite-vec-storage.ts` **Tasks**: 1. ✅ Implement manual CASCADE delete for vec0 tables 2. ✅ Update `deleteDocument()` with vec0-aware deletion 3. ✅ Ensure transactional integrity 4. ✅ Add deletion verification **Critical Change** - Vec0 virtual tables don't support foreign keys: ```typescript async deleteDocument(filePath: string): Promise<void> { const db = this.dbManager.getDatabase(); const deleteTransaction = db.transaction(() => { // 1. Get document ID const doc = db.prepare('SELECT id FROM documents WHERE file_path = ?').get(filePath); if (!doc) return; const docId = (doc as any).id; // 2. Get all chunk IDs const chunks = db.prepare('SELECT id FROM chunks WHERE document_id = ?').all(docId); const chunkIds = chunks.map((c: any) => c.id); // 3. ⚠️ CRITICAL: Manual CASCADE for vec0 tables for (const chunkId of chunkIds) { db.prepare('DELETE FROM chunk_embeddings WHERE chunk_id = ?').run(chunkId); } db.prepare('DELETE FROM document_embeddings WHERE document_id = ?').run(docId); // 4. Delete document (regular CASCADE handles chunks table) db.prepare('DELETE FROM documents WHERE id = ?').run(docId); }); deleteTransaction(); } ``` --- ### Phase 5: Testing & Validation (Critical) **Files**: - `tests/unit/infrastructure/embeddings/sqlite-vec/sqlite-vec-storage.test.ts` - `tests/integration/folder-lifecycle-*.test.ts` **Tasks**: 1. ✅ Unit tests for vec0 schema creation 2. ✅ Unit tests for model consistency checks 3. ✅ Unit tests for manual CASCADE delete 4. ✅ Integration tests for full lifecycle with vec0 5. ✅ Verify semantic metadata still accessible 6. ✅ Verify no regressions in existing tests **Test Categories**: #### Unit Tests ```typescript describe('Vec0 Schema Creation', () => { it('should create vec0 tables with correct dimension', async () => { const storage = new SQLiteVecStorage({ folderPath: testFolder, modelName: 'gpu:paraphrase-multilingual-minilm', modelDimension: 384 }); await storage.initialize(); // Verify vec0 tables exist const tables = db.prepare( "SELECT name FROM sqlite_master WHERE type='table' AND name LIKE '%embeddings'" ).all(); expect(tables).toContainEqual({ name: 'chunk_embeddings' }); expect(tables).toContainEqual({ name: 'document_embeddings' }); }); it('should detect model dimension mismatch', async () => { // Create database with 384d model await createDatabase(384); // Try to open with 1024d model const storage = new SQLiteVecStorage({ folderPath: testFolder, modelName: 'gpu:bge-m3', modelDimension: 1024 }); await expect(storage.initialize()).rejects.toThrow('MODEL_CHANGED'); }); it('should CASCADE delete vec0 embeddings', async () => { await storage.addEmbeddings(embeddings, metadata); await storage.deleteDocument(filePath); // Verify vec0 tables are empty const chunkCount = db.prepare('SELECT COUNT(*) as count FROM chunk_embeddings').get(); const docCount = db.prepare('SELECT COUNT(*) as count FROM document_embeddings').get(); expect(chunkCount.count).toBe(0); expect(docCount.count).toBe(0); }); }); ``` #### Integration Tests ```typescript describe('Folder Lifecycle with Vec0', () => { it('should complete full indexing with vec0 storage', async () => { const lifecycleService = new FolderLifecycleService( folderId, folderPath, orchestrator, fileSystem, sqliteVecStorage, fileStateService, logger, 'gpu:paraphrase-multilingual-minilm' // 384d model ); await lifecycleService.start(); await waitForIndexing(); // Verify vec0 tables populated const stats = await sqliteVecStorage.getStats(); expect(stats.embeddingCount).toBeGreaterThan(0); // Verify semantic metadata still accessible const search = await multiFolderSearch.searchInFolder( queryVector, folderPath, 10, 0.7 ); expect(search[0].keyPhrases).toBeDefined(); expect(search[0].readabilityScore).toBeGreaterThan(0); }); }); ``` --- ## 🚨 Critical Risks & Mitigations ### Risk 1: Vec0 Virtual Tables Don't Support Foreign Keys **Impact**: CASCADE deletes won't work automatically **Mitigation**: Manual CASCADE delete in application code using metadata columns (Phase 4) **Testing**: Unit tests verify all vec0 entries deleted when document removed **Note**: Using metadata columns (`chunk_id`, `document_id`) eliminates rowid synchronization issues. DELETE operations use: `DELETE FROM chunk_embeddings WHERE chunk_id = ?` ### Risk 2: Model Change Breaks Database **Impact**: User switches model, database has wrong dimension **Mitigation**: Model consistency check throws helpful error with reindex instructions **Testing**: Unit test verifies dimension mismatch detection ### Risk 3: Semantic Metadata Lost **Impact**: Key phrases, readability scores accidentally deleted **Mitigation**: Semantic data stored in separate columns (chunks/documents tables), not in vec0 **Testing**: Integration tests verify semantic metadata accessible after vec0 migration ### Risk 4: Write Path Breaks Indexing **Impact**: Files fail to index after vec0 changes **Mitigation**: Incremental rollout with thorough testing at each phase **Testing**: Full folder lifecycle tests must pass before completion --- ## 📁 Files Modified ### Schema Layer - ✅ `src/infrastructure/embeddings/sqlite-vec/schema.ts` - Add vec0 table definitions ### Database Layer - ✅ `src/infrastructure/embeddings/sqlite-vec/database-manager.ts` - Schema initialization - ✅ `src/infrastructure/embeddings/sqlite-vec/sqlite-vec-storage.ts` - Write/delete operations ### Tests - ✅ `tests/unit/infrastructure/embeddings/sqlite-vec/sqlite-vec-storage.test.ts` - ✅ `tests/unit/infrastructure/embeddings/sqlite-vec/database-manager.test.ts` - ✅ `tests/integration/folder-lifecycle-*.test.ts` - Verify no regressions --- ## ✅ Definition of Done ### Infrastructure Readiness - [ ] Vec0 virtual tables created with correct dimension at database initialization - [ ] Model dimension lookup from `curated-models.json` works correctly - [ ] Schema version 4 properly tracked and validated - [ ] Model consistency check detects dimension mismatches ### Write Path Stability - [ ] Chunk embeddings stored in `chunk_embeddings` vec0 table - [ ] Document embeddings stored in `document_embeddings` vec0 table - [ ] Semantic metadata (key phrases, readability) still stored in chunks/documents tables - [ ] Full folder lifecycle completes without errors ### Delete Path Integrity - [ ] Manual CASCADE delete removes vec0 entries when document deleted - [ ] Transactional integrity maintained across all tables - [ ] No orphaned embeddings in vec0 tables ### Testing Coverage - [ ] All unit tests pass (sqlite-vec-storage, database-manager) - [ ] All integration tests pass (folder-lifecycle) - [ ] New tests for vec0 schema creation - [ ] New tests for manual CASCADE delete - [ ] New tests for model consistency checks ### Error Handling - [ ] Helpful error message when model dimension mismatches - [ ] Clear instructions for database rebuild when model changes - [ ] Graceful handling of corrupted/missing embedding_config --- ## 🔮 Future Work (Out of Scope) ### Sprint 8: Search Endpoint Implementation Once vec0 infrastructure is stable, implement native vector search: - Read from vec0 tables using `MATCH` operator - SIMD-accelerated similarity search - Performance benchmarking vs current manual approach - Search endpoint using pre-computed embeddings ### Model Change UX - TUI detection of model mismatch - One-click reindex button - Progress tracking during rebuild --- ## 📊 Success Metrics ### Pre-Sprint Baseline - **Embedding storage**: TEXT format (JSON/base64) - **Search method**: Manual iteration + cosine similarity in JavaScript - **Indexing time**: ~X seconds for test folder (measure before sprint) - **Database size**: ~Y MB for test folder (measure before sprint) ### Post-Sprint Target - **Embedding storage**: Vec0 virtual tables with native format - **Search method**: Still manual (search implementation is Sprint 8) - **Indexing time**: ≤ baseline (write path should not slow down) - **Database size**: Similar to baseline (vec0 should be comparable) - **Test coverage**: All lifecycle tests pass with vec0 ### Validation Method - Run full test suite before sprint (baseline) - Run full test suite after each phase - Compare results to ensure no regressions - Verify semantic metadata still accessible in all tests --- ## 🎯 Sprint Execution Order 1. **Phase 1** (1 hour): Schema updates - get table definitions ready 2. **Phase 2** (1-2 hours): Database manager - initialization and validation 3. **Phase 3** (2-3 hours): Write path - critical functionality 4. **Phase 4** (2 hours): Delete path - data integrity 5. **Phase 5** (2-3 hours): Testing and validation - comprehensive coverage **Total: 6-8 hours** --- ## 💡 Notes - **Pre-production mode**: No backward compatibility needed, can break things - **Databases are disposable**: Source files are truth, always safe to delete `.folder-mcp` and reindex - **Semantic data independence**: Key phrases and readability scores are completely separate from embeddings - **Single model constraint**: Makes implementation much simpler than multi-model support would be - **Infrastructure focus**: This sprint prepares storage, Sprint 8 implements search --- **Ready to implement?** Start with Phase 1 (Schema Updates) and work through phases sequentially, testing at each step.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/okets/folder-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

Phase-10-Sprint-7.5-Vec0-Infrastructure-Migration.md•19.4 KiB