Skip to main content
Glama
data-storage.md11.8 kB
# Data Storage ## Overview The storage system uses SQLite with a normalized, four-table schema design for efficient document storage, retrieval, and version management. The schema supports page-level metadata tracking, ETag-based change detection, and hierarchical document chunking. ## Database Schema The database consists of four core tables with normalized relationships: ```mermaid erDiagram libraries ||--o{ versions : has versions ||--o{ pages : contains pages ||--o{ documents : has_chunks libraries { int id PK text name UK datetime created_at datetime updated_at } versions { int id PK int library_id FK text name text status int progress_pages int progress_max_pages text error_message datetime started_at datetime created_at datetime updated_at text source_url json scraper_options } pages { int id PK int version_id FK text url UK text title text etag text last_modified text content_type int depth datetime created_at datetime updated_at } documents { int id PK int page_id FK text content json metadata int sort_order blob embedding datetime created_at } ``` ### Libraries Table Core library metadata and organization. **Schema:** - `id` (INTEGER PRIMARY KEY): Auto-increment identifier - `name` (TEXT UNIQUE): Library name (case-insensitive) - `created_at` (DATETIME): Creation timestamp - `updated_at` (DATETIME): Last update timestamp **Purpose:** Library name normalization and metadata storage. **Code Reference:** `src/store/types.ts` - Type definitions used throughout DocumentManagementService ### Versions Table Version tracking with comprehensive status and configuration. **Schema:** - `id` (INTEGER PRIMARY KEY): Auto-increment identifier - `library_id` (INTEGER FK): References libraries(id) - `name` (TEXT): Version name (NULL for unversioned content) - `status` (TEXT): Version indexing status (not_indexed, queued, running, completed, failed, cancelled, updating) - `progress_pages` (INTEGER): Current page count during indexing - `progress_max_pages` (INTEGER): Maximum pages to index - `error_message` (TEXT): Error details if indexing fails - `started_at` (DATETIME): When indexing job started - `created_at` (DATETIME): Creation timestamp - `updated_at` (DATETIME): Last update timestamp - `source_url` (TEXT): Original scraping URL - `scraper_options` (JSON): Stored scraper configuration for reproducibility **Purpose:** Job state management, progress tracking, and scraper configuration persistence. **Code Reference:** `src/store/types.ts` lines 184-201 (DbVersion interface) ### Pages Table Page-level metadata for each unique URL within a version. **Schema:** - `id` (INTEGER PRIMARY KEY): Auto-increment identifier - `version_id` (INTEGER FK): References versions(id) - `url` (TEXT): Page URL (unique per version) - `title` (TEXT): Page title extracted from content - `etag` (TEXT): HTTP ETag for change detection - `last_modified` (TEXT): HTTP Last-Modified header - `content_type` (TEXT): MIME type of the content - `depth` (INTEGER): Crawl depth from source URL (0 = root page) - `created_at` (DATETIME): Creation timestamp - `updated_at` (DATETIME): Last update timestamp **Purpose:** Page-level metadata tracking, ETag-based refresh support, and depth tracking for scoping. **Code Reference:** - `db/migrations/009-add-pages-table.sql` - Initial pages table creation - `db/migrations/010-add-depth-to-pages.sql` - Depth column addition - `src/store/types.ts` lines 9-20 (DbPage interface) ### Documents Table Document chunks with embeddings and hierarchical metadata. **Schema:** - `id` (INTEGER PRIMARY KEY): Auto-increment identifier - `page_id` (INTEGER FK): References pages(id) - `content` (TEXT): Chunk content text - `metadata` (JSON): Chunk-specific metadata (level, path, types) - `sort_order` (INTEGER): Ordering within page - `embedding` (BLOB): Vector embedding as binary data - `created_at` (DATETIME): Creation timestamp **Purpose:** Content storage with vector embeddings, hierarchical metadata, and search optimization. **Code Reference:** `src/store/types.ts` lines 39-48 (DbChunk interface) ## Schema Evolution ### Migration System Sequential SQL migrations in `db/migrations/`: 1. `000-initial-schema.sql` - Base schema with documents, FTS, and vector tables 2. `001-add-indexed-at-column.sql` - Indexing timestamp tracking 3. `002-normalize-library-table.sql` - Library and version normalization 4. `003-normalize-vector-table.sql` - Vector storage optimization 5. `004-complete-normalization.sql` - Remove redundant columns, finalize schema 6. `005-add-status-tracking.sql` - Job status and progress tracking 7. `006-add-scraper-options.sql` - Configuration persistence for reproducibility 8. `007-dedupe-unversioned-versions.sql` - Enforce unique unversioned content 9. `008-case-insensitive-names.sql` - Case-insensitive library name handling 10. `009-add-pages-table.sql` - Page-level metadata normalization 11. `010-add-depth-to-pages.sql` - Crawl depth tracking for refresh operations **Code Reference:** All migration files in `db/migrations/` directory ### Migration Application Automatic migration execution on startup: - Check current schema version against available migrations - Apply pending migrations sequentially - Validate schema integrity after each migration - Handle migration failures with detailed error messages - Trigger-based FTS index maintenance ## Data Location ### Storage Directory Resolution Database location determined by priority: 1. Project-local `.store` directory (development) 2. OS-specific application data directory (production) 3. Temporary directory as fallback ### Default Storage Location On macOS: `~/Library/Application Support/docs-mcp-server/` **Code Reference:** `src/utils/paths.ts` - `resolveStorePath()` function ## Document Management ### DocumentManagementService Handles document lifecycle operations with normalized schema access. **Core Operations:** - Document chunk insertion via pages table - Version management and cleanup - Library organization with case-insensitive handling - Page-level metadata management - Duplicate detection using unique constraints **Version Resolution:** - Exact version name matching - Semantic version range queries - Latest version fallback logic - Unversioned content handling (NULL version name) **Code Reference:** `src/store/DocumentManagementService.ts` ### Document Storage Flow 1. Create or resolve library record (case-insensitive name) 2. Create version record with job configuration 3. Create page records for each unique URL 4. Process and store document chunks linked to pages 5. Generate and store embeddings as binary BLOB 6. Update version status and progress ## Embedding Management ### Vector Storage Embeddings stored as BLOB in documents table: - 1536-dimensional vectors (OpenAI embedding size) - Provider-agnostic binary serialization - NULL handling for documents without embeddings - Direct storage eliminates need for separate vector table **Code Reference:** `src/store/types.ts` line 4 (VECTOR_DIMENSION constant) ### EmbeddingFactory Centralized embedding generation supporting multiple providers. **Supported Providers:** - **OpenAI:** text-embedding-3-small (default), text-embedding-3-large, custom endpoints - **Google:** Gemini embedding models, Vertex AI with service account auth - **Azure:** Azure OpenAI service with custom deployments - **AWS:** Bedrock embedding models with IAM authentication **Code Reference:** `src/store/embeddings/EmbeddingFactory.ts` ## Search Implementation ### DocumentRetrieverService Handles search and retrieval operations with hybrid ranking. **Search Methods:** - Vector similarity search using sqlite-vec extension - Full-text search using FTS5 virtual table - Hybrid search with Reciprocal Rank Fusion (RRF) - Context-aware result assembly **Search Architecture:** 1. Query embeddings generated for vector search 2. FTS5 query for keyword matching 3. Results combined using RRF algorithm 4. Chunks assembled with hierarchical context 5. Results ranked by combined score **Code Reference:** `src/store/DocumentRetrieverService.ts` ### FTS Implementation Full-text search using SQLite FTS5: - Porter stemmer for English language - Unicode61 tokenizer for international support - Trigger-based index maintenance (automatic updates) - External content mode (FTS references documents table) **Indexed Fields:** content, title, url, path (from metadata) ## Data Consistency ### Write-Through Architecture Immediate persistence of state changes via database transactions: - Job status updates to versions table - Progress tracking during indexing - Configuration changes with full audit trail - Error information for debugging ### Transaction Management Database transactions ensure consistency: - Atomic page and document insertions - Version state transitions with validation - Batch operations for performance - Automatic rollback on errors ### Concurrent Access Safe concurrent database access: - Better-sqlite3 with synchronous API - Transaction-based locking - Read operations don't block each other - Write operations serialize automatically **Code Reference:** `src/store/` - All service classes use transaction blocks ## Performance Considerations ### Index Strategy Database indexes optimize query performance: - Primary keys on all tables (automatic) - Foreign key indexes for join performance - FTS5 indexes for text search - Composite indexes for common query patterns (library_id + status) **Code Reference:** Index creation statements in migration files ### Query Optimization Efficient query patterns throughout the codebase: - Prepared statements for repeated queries - Batch operations for bulk inserts - JOIN queries to minimize round trips - Query result pagination for large result sets ### Storage Efficiency Space-efficient data storage: - Binary embedding storage (BLOB format) - JSON metadata for flexible chunk properties - Normalized schema eliminates redundant data - SQLite VACUUM operations for space reclamation ## Backup and Recovery ### Data Export Export functionality through DocumentManagementService: - Complete database export via SQLite backup API - Library-specific export using filtered queries - Version-specific export for portability - Metadata preservation in JSON format ### Data Import Import from external sources: - Database restoration from backups - Configuration-based re-indexing - Duplicate detection during import (unique constraints) - Automatic migration application ### Disaster Recovery Recovery mechanisms: - Database integrity checks on startup - Transaction log for crash recovery (SQLite WAL mode) - Schema validation after migration - Automatic repair for corrupted indexes ## Monitoring and Maintenance ### Database Health Health monitoring capabilities: - Storage space utilization tracking - Query performance metrics via logging - Connection status monitoring - Error rate tracking in version records ### Maintenance Operations Regular maintenance tasks: - VACUUM operations for space recovery - Index rebuilding via REINDEX - Orphaned record cleanup via foreign key constraints - Performance analysis using EXPLAIN QUERY PLAN ### Diagnostics Debugging and diagnostic capabilities: - Query execution analysis - Storage space breakdown by table - Relationship integrity checks via PRAGMA foreign_key_check - Performance bottleneck identification

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arabold/docs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server