KnowledgeMCP

Overview Schema Related Servers Score Discussions

KnowledgeMCP
specs
001-mcp-knowledge-server

data-model.md•12.2 KiB

# Data Model: MCP Knowledge Server **Date**: 2025-10-26 **Feature**: MCP Knowledge Server **Phase**: 1 - Design ## Overview This document defines the core entities, their relationships, and state transitions for the MCP Knowledge Server. ## Core Entities ### 1. Document Represents a file added to the knowledge base with all associated metadata. **Attributes**: - `id: str` - Unique identifier (UUID) - `filename: str` - Original filename - `file_path: str` - Path to original file in storage - `content_hash: str` - SHA-256 hash of file content (for deduplication) - `format: DocumentFormat` - File format enum (PDF, DOCX, PPTX, XLSX, HTML, JPG, PNG, SVG) - `size_bytes: int` - File size in bytes - `date_added: datetime` - When document was added to knowledge base - `date_modified: datetime` - When document was last modified - `processing_status: ProcessingStatus` - Current processing state - `processing_method: ProcessingMethod` - How content was extracted (TEXT_EXTRACTION, OCR, HYBRID) - `chunk_count: int` - Number of chunks generated - `embedding_ids: List[str]` - IDs of embeddings in vector database - `metadata: Dict[str, Any]` - Additional metadata (author, title, page count, etc.) - `error_message: Optional[str]` - Error message if processing failed **Enums**: ```python class DocumentFormat(Enum): PDF = "pdf" DOCX = "docx" PPTX = "pptx" XLSX = "xlsx" HTML = "html" JPG = "jpg" PNG = "png" SVG = "svg" class ProcessingStatus(Enum): PENDING = "pending" # Queued for processing PROCESSING = "processing" # Currently being processed COMPLETED = "completed" # Successfully processed FAILED = "failed" # Processing failed PARTIAL = "partial" # Partially processed (some chunks failed) class ProcessingMethod(Enum): TEXT_EXTRACTION = "text_extraction" # Direct text from document OCR = "ocr" # OCR applied HYBRID = "hybrid" # Combination of both IMAGE_ANALYSIS = "image_analysis" # Image content analysis ``` **Validation Rules**: - `filename`: Non-empty, valid characters only - `file_path`: Must exist in storage directory - `format`: Must match file extension - `size_bytes`: Must be > 0 and <= max_file_size_mb (config) - `chunk_count`: >= 0 - `embedding_ids`: Length must equal chunk_count when processing complete **State Transitions**: ``` PENDING → PROCESSING → COMPLETED ↓ FAILED ↓ PARTIAL ``` **Relationships**: - Document has many Embeddings (1:N) - Document belongs to one KnowledgeBase (N:1) --- ### 2. Embedding Represents a vector embedding of a document chunk with context. **Attributes**: - `id: str` - Unique identifier (UUID) - `document_id: str` - Reference to parent Document - `chunk_index: int` - Position of this chunk in document (0-indexed) - `chunk_text: str` - The text that was embedded - `vector: List[float]` - Embedding vector (384 dimensions for all-MiniLM-L6-v2) - `metadata: Dict[str, Any]` - Chunk metadata (start_pos, end_pos, page_num, etc.) - `created_at: datetime` - When embedding was created **Validation Rules**: - `document_id`: Must reference existing Document - `chunk_index`: >= 0, must be unique per document - `chunk_text`: Non-empty, <= max_chunk_size (config) - `vector`: Length must be 384 (model dimensionality) - `metadata`: Must contain source document context **Storage**: - Vectors stored in ChromaDB - Metadata stored alongside for filtering - Full text searchable via metadata **Relationships**: - Embedding belongs to one Document (N:1) --- ### 3. SearchResult Represents a single result from a semantic search query. **Attributes**: - `document_id: str` - ID of source document - `chunk_id: str` - ID of matching embedding/chunk - `chunk_text: str` - The matching text excerpt - `relevance_score: float` - Similarity score (0.0 to 1.0, higher is better) - `document_metadata: Dict[str, Any]` - Document filename, format, date added - `chunk_metadata: Dict[str, Any]` - Chunk position, page number, etc. - `highlight: Optional[str]` - Text with query terms highlighted (future enhancement) **Validation Rules**: - `relevance_score`: 0.0 <= score <= 1.0 - `chunk_text`: Non-empty - `document_id` and `chunk_id`: Must reference existing entities **Relationships**: - SearchResult references one Document (N:1) - SearchResult references one Embedding (N:1) --- ### 4. KnowledgeBase Represents the aggregate of all documents and embeddings with statistics. **Attributes**: - `id: str` - Knowledge base identifier (typically "default" for single KB) - `name: str` - Human-readable name - `document_count: int` - Total number of documents - `embedding_count: int` - Total number of embeddings/chunks - `total_size_bytes: int` - Sum of all document sizes - `storage_path: str` - Path to storage directory - `vector_db_path: str` - Path to ChromaDB storage - `created_at: datetime` - When knowledge base was created - `last_updated: datetime` - When last document was added/removed - `configuration: Dict[str, Any]` - KB-specific configuration **Computed Properties**: - `average_chunks_per_document: float` - embedding_count / document_count - `storage_size_mb: float` - total_size_bytes / (1024 * 1024) **Validation Rules**: - `name`: Non-empty - `document_count`: >= 0 - `embedding_count`: >= 0 - `storage_path` and `vector_db_path`: Must be valid directories **Operations**: - `add_document(document: Document)` - Add document and update stats - `remove_document(document_id: str)` - Remove document and update stats - `clear()` - Remove all documents and embeddings - `get_statistics()` - Return summary statistics **Relationships**: - KnowledgeBase has many Documents (1:N) --- ### 5. ProcessingStrategy Represents the decision logic for how to process a document (OCR vs text extraction). **Attributes**: - `document_format: DocumentFormat` - Type of document - `has_text_layer: bool` - Whether document has extractable text - `extracted_text_length: int` - Length of extracted text (if any) - `confidence_score: float` - Confidence in text extraction quality - `recommended_method: ProcessingMethod` - Recommended processing approach **Decision Logic**: ```python def determine_processing_method(document: Document) -> ProcessingMethod: if document.format in [JPG, PNG, SVG]: return ProcessingMethod.IMAGE_ANALYSIS # Try text extraction first extracted_text = extract_text(document) if len(extracted_text) < 100: # Too little text, likely scanned return ProcessingMethod.OCR if is_gibberish(extracted_text): # Text extraction failed, try OCR return ProcessingMethod.OCR if has_embedded_images(document): # Use both methods return ProcessingMethod.HYBRID return ProcessingMethod.TEXT_EXTRACTION ``` **Validation Rules**: - `extracted_text_length`: >= 0 - `confidence_score`: 0.0 <= score <= 1.0 --- ### 6. ProcessingTask Represents an async document processing task with progress tracking. **Attributes**: - `task_id: str` - Unique task identifier (UUID) - `document_id: str` - Document being processed - `status: TaskStatus` - Current task status - `progress: float` - Progress percentage (0.0 to 1.0) - `current_step: str` - Description of current step - `total_steps: int` - Estimated total steps - `completed_steps: int` - Number of completed steps - `started_at: datetime` - When task started - `completed_at: Optional[datetime]` - When task finished - `error: Optional[str]` - Error message if failed **Enums**: ```python class TaskStatus(Enum): QUEUED = "queued" RUNNING = "running" COMPLETED = "completed" FAILED = "failed" CANCELLED = "cancelled" ``` **State Transitions**: ``` QUEUED → RUNNING → COMPLETED ↓ FAILED ↓ CANCELLED ``` **Validation Rules**: - `progress`: 0.0 <= progress <= 1.0 - `completed_steps`: <= total_steps - `completed_at`: Must be after started_at if set --- ## Relationships Diagram ``` KnowledgeBase (1) ──────── (N) Document (1) ──────── (N) Embedding │ │ references ↓ SearchResult ←─ references ─→ Embedding ProcessingTask (N) ──references──→ Document (1) ProcessingStrategy ──analyzes──→ Document (1) ``` ## Data Persistence ### ChromaDB Schema **Collection**: `knowledge_base_documents` Each document's chunks stored as: ```python { "ids": ["chunk_uuid_1", "chunk_uuid_2", ...], "embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...], ...], "metadatas": [ { "document_id": "doc_uuid", "filename": "example.pdf", "chunk_index": 0, "chunk_text": "...", "format": "pdf", "page_num": 1, "processing_method": "text_extraction", "date_added": "2025-10-26T10:00:00Z" }, ... ], "documents": ["chunk text 1", "chunk text 2", ...] # Full text for reference } ``` **Indexes**: - Vector index (HNSW) for similarity search - Metadata filters on: `document_id`, `filename`, `format`, `date_added` ### Filesystem Storage **Document Storage**: ``` {storage_path}/ ├── documents/ │ ├── {document_id}/ │ │ ├── original.{ext} # Original file │ │ ├── metadata.json # Document metadata │ │ └── extracted_text.txt # Extracted text (optional) └── metadata.db # SQLite for document metadata (optional) ``` **Vector Database Storage**: ``` {vector_db_path}/ └── chroma.sqlite3 # ChromaDB persistent storage ``` ## Data Access Patterns ### Common Queries 1. **Add Document**: - Validate and store file - Extract text/OCR - Chunk text - Generate embeddings - Store in ChromaDB - Update KnowledgeBase stats 2. **Search**: - Generate query embedding - Query ChromaDB for similar vectors - Retrieve top N results - Hydrate with document metadata - Return SearchResults 3. **Remove Document**: - Get document embedding IDs - Delete embeddings from ChromaDB - Delete document files - Update KnowledgeBase stats 4. **List Documents**: - Query all documents from ChromaDB metadata - Group by document_id - Return document metadata 5. **Get Status**: - Count documents in ChromaDB - Sum storage sizes - Return statistics ## Data Validation ### Input Validation - File format matches extension - File size within limits (< 100MB default) - File is readable and not corrupted - Filename contains no path traversal characters ### Output Validation - Embeddings have correct dimensionality (384) - All chunks have valid document references - Metadata is complete and well-formed - State transitions are valid ## Migration Strategy ### Version 1.0 Schema Current schema as defined above. ### Future Versions - Add `schema_version` to metadata - Implement migration scripts for schema changes - Maintain backward compatibility for 1 major version ### Data Export/Import Support export to JSON format: ```json { "version": "1.0", "knowledge_base": {...}, "documents": [...], "embeddings": [...] } ``` ## Performance Considerations ### Indexing Performance - Batch insert embeddings (32-64 at a time) - Async processing to avoid blocking - Progress tracking every 10 chunks ### Query Performance - ChromaDB HNSW index for fast similarity search - Metadata filters before vector search (if applicable) - Limit results to top 10-20 for UI responsiveness ### Storage Optimization - Compress original documents (optional) - Store only essential metadata - Periodic cleanup of orphaned embeddings ## Testing Data Requirements ### Unit Tests - Mock documents with known embeddings - Test data validation edge cases - Test state transitions ### Integration Tests - Sample documents: PDF (10 pages), DOCX (5 pages), scanned PDF, image - Test with 100+ documents for performance - Test concurrent operations ### Performance Tests - 1000+ documents for search latency testing - Large documents (100+ pages) for memory profiling - Concurrent operations (10+ clients)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/maxzrff/KnowledgeMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

data-model.md•12.2 KiB