TDZ C64 Knowledge

tdz-c64-knowledge
archive
historical-docs

IMPROVEMENTS.md•41.3 KiB

# MCP Server Improvement Suggestions ## 🎯 Priority Levels - **P0**: Critical improvements (security, correctness, major performance) - **P1**: High-value improvements (user experience, scalability) - **P2**: Nice-to-have enhancements (convenience, polish) --- ## ✅ Completed Improvements ### Phase 1: Quick Wins (Completed) - ✅ **Logging Infrastructure** - File and console logging with timestamps - ✅ **Custom Exception Classes** - KnowledgeBaseError hierarchy for better error handling - ✅ **Search Term Highlighting** - Markdown bold highlighting in search results - ✅ **PDF Metadata Extraction** - Author, subject, creator, creation date extraction ### Phase 2: Search Quality (Completed) - ✅ **BM25 Search Algorithm** - Industry-standard ranking with rank-bm25 library - ✅ **Phrase Search Support** - Exact phrase matching with quote detection - ✅ **Query Preprocessing** - NLTK-powered stemming and stopword removal - ✅ **PDF Page Number Tracking** - Estimated page numbers in search results - ✅ **Score Filtering Fix** - Handles negative BM25 scores for small documents ### Phase 3: Storage & Scalability (Completed) - ✅ **SQLite Database Migration** - Migrated from JSON files to SQLite for better scalability - ✅ **Lazy Loading** - Only loads document metadata at startup, chunks loaded on-demand - ✅ **ACID Transactions** - Database transactions ensure data integrity - ✅ **Automatic Migration** - Seamless upgrade from JSON to SQLite on first run ### Phase 4: Performance Optimization (Completed) - ✅ **SQLite FTS5 Full-Text Search** - Native search with 480x performance improvement ### Phase 5: P0 Critical Features (Completed) - ✅ **Path Traversal Protection** - Security validation for document ingestion - ✅ **Semantic Search with Embeddings** - Conceptual search using sentence-transformers ### Additional Completions - ✅ **CI/CD Pipeline** - GitHub Actions workflow for multi-platform testing - ✅ **Comprehensive Test Suite** - 19 tests covering all functionality including SQLite - ✅ **Documentation** - Updated README, CLAUDE.md with all new features --- ## 1. Search Quality Improvements ### ✅ COMPLETED: P0: Implement Better Search Algorithms **Status**: ✅ Implemented with BM25Okapi **Implementation Details**: - Using `rank-bm25` package with BM25Okapi algorithm - Default k1=1.2, b=0.75 parameters - Handles negative scores for small documents with `abs(score) > 0.0001` filter - Enabled by default, can disable with `USE_BM25=0` environment variable - Falls back to simple search if rank-bm25 not available ### ✅ COMPLETED: P1: Add Phrase Search Support **Status**: ✅ Implemented with regex pattern matching **Implementation Details**: ```python # Extract phrases in quotes phrase_pattern = r'"([^"]*)"' phrases = re.findall(phrase_pattern, query) # Boost phrase matches 2x in BM25, 10x in simple search ``` ### ✅ COMPLETED: P1: Implement Query Preprocessing **Status**: ✅ Implemented with NLTK **Implementation Details**: - Using NLTK for tokenization, stemming, and stopword removal - Porter Stemmer for word normalization - English stopwords corpus for filtering - Preserves technical terms (hyphenated words, numbers) - Applied to both queries and BM25 corpus - Configurable via `USE_QUERY_PREPROCESSING` environment variable ```python # Preprocessing features: - word_tokenize() for smart tokenization - PorterStemmer for stemming ("running" → "run") - stopwords.words('english') for filtering - Special handling for "VIC-II", "6502", etc. ``` ### ✅ COMPLETED: P2: Fuzzy Search / Typo Tolerance **Status**: ✅ Implemented with rapidfuzz **Completed**: December 2025 **Implementation Details**: ```python from rapidfuzz import fuzz class KnowledgeBase: def __init__(self, data_dir): self.use_fuzzy = FUZZY_SUPPORT and os.getenv('USE_FUZZY_SEARCH', '1') == '1' self.fuzzy_threshold = int(os.getenv('FUZZY_THRESHOLD', '80')) # 0-100 def _fuzzy_match_terms(self, query_terms: list[str], content: str) -> tuple[bool, float]: """Check if query terms fuzzy match content using rapidfuzz.""" content_words = content.lower().split() match_scores = [] for query_term in query_terms: # Check exact match first (fastest) if query_term.lower() in content.lower(): match_scores.append(100.0) continue # Try fuzzy matching against content words best_score = 0.0 for content_word in content_words: score = fuzz.ratio(query_term.lower(), content_word) if score > best_score: best_score = score if score >= self.fuzzy_threshold: break match_scores.append(best_score) match_found = any(score >= self.fuzzy_threshold for score in match_scores) avg_score = sum(match_scores) / len(match_scores) if match_scores else 0.0 return (match_found, avg_score) def _search_simple(self, query_terms: set, phrases: list, tags, max_results): # Exact matching first for chunk in self.chunks: score = # ... exact match scoring ... # If no exact matches and fuzzy enabled, try fuzzy matching if self.use_fuzzy and score == 0: match_found, fuzzy_score = self._fuzzy_match_terms(list(query_terms), chunk.content) if match_found: score = fuzzy_score / 10.0 # Scaled down vs exact ``` **Key Features**: - Rapidfuzz library for Levenshtein distance calculation - Configurable similarity threshold (default: 80%) - Exact match priority (100% score) - Per-word fuzzy matching in content - Early exit when threshold met for performance - Integrated into simple search as fallback - Optional (disabled via USE_FUZZY_SEARCH=0) **Environment Variables**: - `USE_FUZZY_SEARCH=1` - Enable fuzzy search (default: enabled) - `FUZZY_THRESHOLD=80` - Similarity threshold 0-100 (default: 80%) **Benefits**: - ✅ Handles typos ("VIC-I" finds "VIC-II") - ✅ Finds similar terms ("sprites" finds "sprite") - ✅ Graceful degradation (exact match still preferred) - ✅ Configurable tolerance for precision vs recall tradeoff **Examples**: - Query: "registr" (typo) → Finds: "register" (90% similarity) - Query: "VIC-I" (typo) → Finds: "VIC-II" (83% similarity) - Query: "grafics" (typo) → Finds: "graphics" (88% similarity) --- ## 2. Performance & Scalability ### ✅ COMPLETED: P0: Move to SQLite Database **Status**: ✅ Implemented with automatic migration **Completed**: December 2025 **Implementation Details**: ```sql CREATE TABLE documents ( doc_id TEXT PRIMARY KEY, filename TEXT, title TEXT, filepath TEXT, file_type TEXT, total_pages INTEGER, total_chunks INTEGER, indexed_at TIMESTAMP, tags TEXT -- JSON array ); CREATE TABLE chunks ( doc_id TEXT, chunk_id INTEGER, content TEXT, word_count INTEGER, PRIMARY KEY (doc_id, chunk_id), FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ); CREATE INDEX idx_chunks_content ON chunks(content); CREATE INDEX idx_documents_tags ON documents(tags); ``` **Benefits**: - Scales to 100k+ documents - Efficient filtering and sorting - Full-text search (FTS5) - ACID compliance **Effort**: High (3-5 days) ### ✅ COMPLETED: P1: Implement SQLite FTS5 for Full-Text Search **Status**: ✅ Implemented with Porter stemming and automatic sync **Completed**: December 2025 **Implementation Details**: ```sql CREATE VIRTUAL TABLE chunks_fts5 USING fts5( doc_id UNINDEXED, chunk_id UNINDEXED, content, tokenize='porter unicode61' ); -- Automatic triggers keep FTS5 in sync CREATE TRIGGER chunks_fts5_insert AFTER INSERT ON chunks ... CREATE TRIGGER chunks_fts5_delete AFTER DELETE ON chunks ... CREATE TRIGGER chunks_fts5_update AFTER UPDATE ON chunks ... ``` **Key Features**: - Native SQLite BM25 ranking (`ORDER BY rank`) - Porter stemming for improved matching - Automatic population for existing databases - Environment variable control via `USE_FTS5=1` - Fallback to BM25/simple search if FTS5 unavailable **Performance Improvements**: - Search queries: **50ms (FTS5)** vs 24,000ms (BM25) = **480x faster** - No need to load all chunks into memory for search - Native tokenization and stemming in SQLite **Usage**: ```bash # Enable FTS5 search export USE_FTS5=1 # or set in MCP config env ``` **Benefits**: - 10-100x faster search with built-in ranking ✅ Achieved (480x in practice) - Reduced memory usage (no chunk loading for search) - Built-in Porter stemming tokenizer ### ✅ COMPLETED: P1: Add Caching Layer **Status**: ✅ Implemented with cachetools TTLCache **Completed**: December 2025 **Implementation Details**: ```python from cachetools import TTLCache import hashlib class KnowledgeBase: def __init__(self, data_dir): cache_size = int(os.getenv('SEARCH_CACHE_SIZE', '100')) cache_ttl = int(os.getenv('SEARCH_CACHE_TTL', '300')) # 5 minutes self._search_cache = TTLCache(maxsize=cache_size, ttl=cache_ttl) self._similar_cache = TTLCache(maxsize=cache_size, ttl=cache_ttl) def _cache_key(self, method: str, **kwargs) -> str: sorted_items = sorted(kwargs.items()) key_str = f"{method}:{sorted_items}" return hashlib.md5(key_str.encode()).hexdigest() def search(self, query: str, max_results: int = 5, tags=None): cache_key = self._cache_key('search', query=query, max_results=max_results, tags=tuple(sorted(tags)) if tags else None) if cache_key in self._search_cache: return self._search_cache[cache_key] # Cache hit results = self._search_impl(query, max_results, tags) self._search_cache[cache_key] = results return results def _invalidate_caches(self): """Clear caches on data changes.""" self._search_cache.clear() self._similar_cache.clear() ``` **Key Features**: - TTL-based expiration (default: 5 minutes) - Configurable size and TTL via environment variables - Separate caches for search() and find_similar_documents() - Automatic invalidation on add/remove operations - Thread-safe cachetools.TTLCache implementation - Debug logging for cache hits/misses **Performance Impact**: - 50-100x speedup for repeated queries - Minimal memory overhead (stores references, not copies) - Zero-config with sensible defaults **Environment Variables**: - `SEARCH_CACHE_SIZE=100` - Maximum cache entries - `SEARCH_CACHE_TTL=300` - TTL in seconds ### P2: Lazy Loading of Chunks **Current Issue**: All chunks loaded at startup **Recommendations**: - Load chunks on-demand from disk/database - Keep index in memory, stream chunks as needed --- ## 3. Feature Enhancements ### ✅ COMPLETED: P0: Add Semantic Search with Embeddings **Status**: ✅ Implemented with FAISS and sentence-transformers **Completed**: December 2025 **Implementation Details**: ```python # sentence-transformers for embeddings generation from sentence_transformers import SentenceTransformer import faiss import numpy as np class KnowledgeBase: def __init__(self, data_dir): # Initialize embeddings model self.embeddings_model = SentenceTransformer('all-MiniLM-L6-v2') self.embeddings_index = None # FAISS index self.embeddings_doc_map = [] # Maps index positions to (doc_id, chunk_id) def _build_embeddings(self): # Generate embeddings for all chunks chunks = self._get_chunks_db() texts = [chunk.content for chunk in chunks] embeddings = self.embeddings_model.encode(texts, convert_to_numpy=True) # Create FAISS index with cosine similarity dimension = embeddings.shape[1] self.embeddings_index = faiss.IndexFlatIP(dimension) faiss.normalize_L2(embeddings) self.embeddings_index.add(embeddings) def semantic_search(self, query: str, max_results: int = 5, tags=None): # Encode query and search query_embedding = self.embeddings_model.encode([query], convert_to_numpy=True) faiss.normalize_L2(query_embedding) scores, indices = self.embeddings_index.search(query_embedding, max_results) # Return results with similarity scores ``` **Key Features**: - SentenceTransformer embeddings (default: all-MiniLM-L6-v2) - FAISS vector similarity search with cosine distance - Persistent embeddings storage (embeddings.faiss, embeddings_map.json) - Lazy embeddings generation (built on first semantic search) - Automatic invalidation on add/remove operations - Tag filtering support **Environment Variables**: - `USE_SEMANTIC_SEARCH=1` - Enable semantic search - `SEMANTIC_MODEL=all-MiniLM-L6-v2` - Model to use **Performance**: - Embeddings generation: ~1 min for 2347 chunks (one-time) - Search speed: ~7-16ms per query - Persistent storage avoids rebuilding **Benefits**: - ✅ Finds "sprites" when searching for "movable objects" - ✅ Understands context and meaning - ✅ Better for natural language queries **Libraries**: `sentence-transformers>=2.0.0`, `faiss-cpu>=1.7.0` ### ✅ COMPLETED: P1: Highlight Search Terms in Results **Status**: ✅ Implemented in `_extract_snippet()` method **Implementation Details**: ```python # Highlight matching terms (case-insensitive) for term in query_terms: if len(term) >= 2: pattern = re.compile(f'({re.escape(term)})', re.IGNORECASE) snippet = pattern.sub(r'**\1**', snippet) ``` ### ✅ COMPLETED: P1: Add "More Like This" Tool **Status**: ✅ Implemented with semantic and TF-IDF similarity **Completed**: December 2025 **Implementation Details**: ```python # MCP Tool Tool( name="find_similar", description="Find documents similar to a given document or chunk", inputSchema={ "properties": { "doc_id": {"type": "string", "description": "Source document ID"}, "chunk_id": {"type": "integer", "description": "Optional chunk ID for chunk-level similarity"}, "max_results": {"type": "integer", "default": 5}, "tags": {"type": "array", "description": "Filter by tags"} } } ) # Core implementation def find_similar_documents(self, doc_id: str, chunk_id: Optional[int] = None, max_results: int = 5, tags: Optional[list[str]] = None): # Prefer semantic search if available if self.use_semantic and self.embeddings_index is not None: return self._find_similar_semantic(doc_id, chunk_id, max_results, tags) else: # Fall back to TF-IDF similarity return self._find_similar_tfidf(doc_id, chunk_id, max_results, tags) ``` **Key Features**: - **Dual-mode similarity**: Uses semantic embeddings (FAISS) if available, else TF-IDF cosine similarity - **Document-level similarity**: Find similar documents when chunk_id is None - **Chunk-level similarity**: Find chunks similar to a specific chunk when chunk_id provided - **Tag filtering**: Filter results by document tags - **Smart aggregation**: Groups chunks by document for document-level results **Semantic Similarity** (`_find_similar_semantic`): - Uses FAISS embeddings index for fast nearest-neighbor search - Aggregates chunk scores by document (mean similarity) - Returns documents sorted by average similarity score **TF-IDF Similarity** (`_find_similar_tfidf`): - Builds TF-IDF vectors for all chunks - Computes cosine similarity between vectors - Works without requiring embeddings generation **Benefits**: - ✅ Discover related documentation automatically - ✅ Navigate knowledge base by similarity - ✅ No external dependencies required (TF-IDF fallback) - ✅ Fast lookups with FAISS when available ### ✅ COMPLETED: P1: Document Update Detection **Status**: ✅ Implemented with hybrid mtime + content hash verification **Completed**: December 2025 **Implementation Details**: ```python @dataclass class DocumentMeta: # ... existing fields ... file_hash: Optional[str] = None # MD5 of file contents file_mtime: Optional[float] = None # Modification time def _compute_file_hash(self, filepath: str) -> str: """Compute MD5 hash of file content.""" md5_hash = hashlib.md5() with open(filepath, 'rb') as f: for chunk in iter(lambda: f.read(8192), b''): md5_hash.update(chunk) return md5_hash.hexdigest() def needs_reindex(self, filepath: str, doc_id: str) -> bool: """Check if document needs re-indexing (hybrid approach).""" doc = self.documents.get(doc_id) if not doc or doc.file_mtime is None: return True # Quick check: modification time current_mtime = os.path.getmtime(filepath) if current_mtime <= doc.file_mtime: return False # Not modified # Deep check: content hash (if mtime changed) current_hash = self._compute_file_hash(filepath) return current_hash != doc.file_hash def update_document(self, filepath: str, title=None, tags=None) -> DocumentMeta: """Update document if changed, or add if new.""" existing_doc = next((d for d in self.documents.values() if d.filepath == filepath), None) if not existing_doc: return self.add_document(filepath, title, tags) if not self.needs_reindex(filepath, existing_doc.doc_id): return existing_doc # Unchanged # Re-index changed document self.remove_document(existing_doc.doc_id) return self.add_document(filepath, title, tags) def check_all_updates(self, auto_update: bool = False) -> dict: """Check all documents for updates.""" results = {'unchanged': [], 'changed': [], 'missing': [], 'updated': []} for doc_id, doc in list(self.documents.items()): if not os.path.exists(doc.filepath): results['missing'].append(doc) elif self.needs_reindex(doc.filepath, doc_id): results['changed'].append(doc) if auto_update: updated = self.update_document(doc.filepath, doc.title, doc.tags) results['updated'].append(updated) else: results['unchanged'].append(doc) return results ``` **Key Features**: - Hybrid detection: fast mtime check + deep hash verification - Database schema migration (adds file_mtime, file_hash columns) - Smart re-indexing (only when content actually changed) - Batch checking with `check_all_updates()` - New MCP tool: `check_updates` with auto-update option **Schema Changes**: ```sql ALTER TABLE documents ADD COLUMN file_mtime REAL; ALTER TABLE documents ADD COLUMN file_hash TEXT; ``` **MCP Tool**: - `check_updates(auto_update=false)` - Check all documents, optionally auto-update **Benefits**: - ✅ Avoid redundant re-indexing of unchanged files - ✅ Detect real content changes (not just mtime changes) - ✅ Batch operations for efficient workflow - ✅ Automatic schema migration for existing databases ### ✅ COMPLETED: P1: Bulk Operations **Status**: ✅ Implemented with comprehensive error handling **Completed**: December 2025 **Implementation Details**: ```python def add_documents_bulk(self, directory: str, pattern: str = "**/*.{pdf,txt}", tags: Optional[list[str]] = None, recursive: bool = True, skip_duplicates: bool = True, progress_callback: ProgressCallback = None) -> dict: """Add multiple documents from a directory matching a glob pattern.""" dir_path = Path(directory).resolve() files = list(dir_path.glob(pattern)) results = {'added': [], 'skipped': [], 'failed': []} for idx, file_path in enumerate(files, 1): if progress_callback: progress_callback(ProgressUpdate( operation="add_documents_bulk", current=idx, total=len(files), message=f"Processing file {idx}/{len(files)}", item=str(file_path.name) )) try: doc = self.add_document(str(file_path), title=file_path.stem, tags=tags) # Duplicate detection logic... results['added'].append({...}) except Exception as e: results['failed'].append({'filepath': str(file_path), 'error': str(e)}) return results def remove_documents_bulk(self, doc_ids: Optional[list[str]] = None, tags: Optional[list[str]] = None) -> dict: """Remove multiple documents by doc IDs or tags.""" results = {'removed': [], 'failed': []} ids_to_remove = set() if doc_ids: ids_to_remove.update(doc_ids) if tags: for doc_id, doc in self.documents.items(): if any(tag in doc.tags for tag in tags): ids_to_remove.add(doc_id) for doc_id in ids_to_remove: try: if self.remove_document(doc_id): results['removed'].append(doc_id) except Exception as e: results['failed'].append({'doc_id': doc_id, 'error': str(e)}) return results ``` **MCP Tools**: - `add_documents_bulk(directory, pattern, tags, recursive, skip_duplicates)` - Bulk add documents from directory - `remove_documents_bulk(doc_ids, tags)` - Bulk remove by IDs or tags **Key Features**: - Glob pattern matching with recursive/non-recursive search - Duplicate detection during bulk operations - Comprehensive error handling (failed files don't stop the operation) - Detailed results reporting (added, skipped, failed counts) - Tag-based removal for flexible document management - Transaction-based operations for data integrity **Benefits**: - ✅ Efficient batch importing of large document collections - ✅ Flexible removal by IDs or tags - ✅ Graceful error handling with detailed failure reporting - ✅ Progress reporting integration for long operations ### ✅ COMPLETED: P1: Progress Reporting **Status**: ✅ Implemented with callback-based architecture **Completed**: December 2025 **Implementation Details**: ```python @dataclass class ProgressUpdate: """Progress update for long-running operations.""" operation: str # Operation name (e.g., "add_document", "add_documents_bulk") current: int # Current progress (items processed) total: int # Total items to process message: str # Status message item: Optional[str] = None # Current item being processed (e.g., filename) percentage: float = 0.0 # Percentage complete (0-100) def __post_init__(self): """Calculate percentage after initialization.""" if self.total > 0: self.percentage = (self.current / self.total) * 100.0 # Type alias for progress callback function ProgressCallback = Optional[Callable[[ProgressUpdate], None]] # Usage in add_document() def add_document(self, filepath: str, title: Optional[str] = None, tags: Optional[list[str]] = None, progress_callback: ProgressCallback = None) -> DocumentMeta: if progress_callback: progress_callback(ProgressUpdate( operation="add_document", current=0, total=4, message="Starting document ingestion", item=filepath )) # ... extract text ... if progress_callback: progress_callback(ProgressUpdate( operation="add_document", current=1, total=4, message=f"Text extraction complete ({len(text)} characters)", item=filename )) # ... chunking ... if progress_callback: progress_callback(ProgressUpdate( operation="add_document", current=2, total=4, message=f"Created {len(chunks)} chunks", item=filename )) # ... database insertion ... if progress_callback: progress_callback(ProgressUpdate( operation="add_document", current=3, total=4, message="Stored in database", item=filename )) # ... complete ... if progress_callback: progress_callback(ProgressUpdate( operation="add_document", current=4, total=4, message="Document indexed successfully", item=filename )) ``` **Key Features**: - ProgressUpdate dataclass with automatic percentage calculation - Non-blocking callback architecture - Detailed progress reporting at each operation stage - Item-level tracking for bulk operations - Optional callbacks (backwards compatible) **Integration Points**: - `add_document()` - Reports 4 stages (start, extract, chunk, store, complete) - `add_documents_bulk()` - Reports per-file progress with item names - Extensible for future operations (e.g., `_build_embeddings()`) **Benefits**: - ✅ Real-time progress visibility for long operations - ✅ Better user experience during bulk imports - ✅ Flexible callback architecture for different UIs - ✅ Backwards compatible (callbacks are optional) ### P2: Pagination for Large Result Sets ```python def search(self, query: str, max_results: int = 5, offset: int = 0, tags=None): # ... search logic ... return { 'results': results[offset:offset + max_results], 'total': len(results), 'offset': offset, 'has_more': offset + max_results < len(results) } ``` ### P2: Export/Backup Functionality ```python Tool( name="export_kb", description="Export knowledge base to a portable format", inputSchema={ "properties": { "output_path": {"type": "string"}, "format": {"enum": ["json", "sqlite", "zip"]} } } ) ``` --- ## 4. Data Quality & Processing ### ✅ COMPLETED: P0: Track Page Numbers in PDFs **Status**: ✅ Implemented with PAGE BREAK marker estimation **Implementation Details**: - PDF text extraction uses `--- PAGE BREAK ---` markers between pages - Chunk creation estimates page number by counting PAGE BREAK markers - Page numbers stored in `DocumentChunk.page` field - Search results include page numbers when available ```python # Estimate page number for PDFs based on PAGE BREAK markers if file_type == 'pdf' and '--- PAGE BREAK ---' in text: chunk_start_pos = text.find(chunk_text[:100]) if chunk_start_pos >= 0: page_breaks_before = text[:chunk_start_pos].count('--- PAGE BREAK ---') page_num = page_breaks_before + 1 ``` ### ✅ COMPLETED: P1: Extract Document Metadata **Status**: ✅ Implemented in `_extract_pdf_text()` method **Implementation Details**: ```python # Extract PDF metadata metadata = {} if reader.metadata: metadata['author'] = reader.metadata.get('/Author') metadata['subject'] = reader.metadata.get('/Subject') metadata['creator'] = reader.metadata.get('/Creator') creation_date = reader.metadata.get('/CreationDate') # Parse PDF date format: D:YYYYMMDDHHmmSS ``` Metadata fields added to `DocumentMeta` dataclass: - `author: Optional[str]` - `subject: Optional[str]` - `creator: Optional[str]` - `creation_date: Optional[str]` ### ✅ COMPLETED: P1: Duplicate Detection **Status**: ✅ Implemented with content-based doc IDs **Completed**: December 2025 **Implementation Details**: ```python def _generate_doc_id(self, filepath: str, text_content: str = None) -> str: """Generate ID based on content hash for deduplication.""" if text_content: # Content-based ID for deduplication normalized = text_content.lower().strip() words = normalized.split()[:10000] # First 10k words content_sample = ' '.join(words) return hashlib.md5(content_sample.encode('utf-8')).hexdigest()[:12] else: # Filepath-based ID (legacy support) return hashlib.md5(filepath.encode()).hexdigest()[:12] def add_document(self, filepath: str, ...): # Extract text first text = self._extract_pdf_text(filepath) or self._extract_text_file(filepath) # Generate content-based doc_id for deduplication doc_id = self._generate_doc_id(filepath, text) # Check for duplicate content if doc_id in self.documents: existing_doc = self.documents[doc_id] self.logger.warning(f"Duplicate content detected: {filepath}") self.logger.warning(f" Matches existing document: {existing_doc.filepath}") self.logger.info(f"Skipping duplicate - returning existing document {doc_id}") return existing_doc # Non-destructive - returns existing doc # Continue with normal indexing... ``` **Key Features**: - **Content-based IDs**: Doc ID derived from text content (first 10k words) not filepath - **Automatic duplicate detection**: Same content at different paths detected and skipped - **Non-destructive behavior**: Returns existing document instead of creating duplicate - **Normalized comparison**: Lowercase normalization prevents case-sensitivity duplicates - **Efficient hashing**: Uses first 10k words to handle very large documents - **Backward compatible**: Filepath-based IDs still supported when text_content is None **Benefits**: - ✅ Prevents duplicate indexing of same content - ✅ Saves storage space and reduces index bloat - ✅ Improves search quality by avoiding duplicate results - ✅ Clear logging when duplicates are detected ### ✅ COMPLETED: P2: OCR Support for Scanned PDFs **Status**: ✅ Implemented with automatic fallback **Completed**: December 2025 **Implementation Details**: ```python import pytesseract from pdf2image import convert_from_path from PIL import Image class KnowledgeBase: def __init__(self, data_dir): self.use_ocr = OCR_SUPPORT and os.getenv('USE_OCR', '1') == '1' if self.use_ocr: try: pytesseract.get_tesseract_version() # Verify Tesseract installed self.logger.info("OCR enabled (Tesseract found)") except Exception as e: self.logger.warning(f"OCR libraries installed but Tesseract not found") self.use_ocr = False def _extract_pdf_with_ocr(self, filepath: str) -> tuple[str, int]: """Extract text from scanned PDF using OCR.""" images = convert_from_path(filepath) # Convert PDF to images pages = [] for i, image in enumerate(images): text = pytesseract.image_to_string(image) # OCR each page pages.append(text) full_text = "\n\n--- PAGE BREAK ---\n\n".join(pages) return full_text, len(images) def _extract_pdf_text(self, filepath: str) -> tuple[str, int, dict]: """Extract text from PDF with automatic OCR fallback.""" reader = PdfReader(filepath) # Try extracting text normally pages = [] total_text_length = 0 for page in reader.pages: text = page.extract_text() or "" pages.append(text) total_text_length += len(text.strip()) # Detect scanned PDF (< 100 chars extracted from multi-page doc) is_scanned = total_text_length < 100 and len(reader.pages) > 0 if is_scanned and self.use_ocr: self.logger.info(f"PDF appears to be scanned, falling back to OCR") try: ocr_text, page_count = self._extract_pdf_with_ocr(filepath) full_text = ocr_text # Use OCR text except Exception as e: self.logger.warning(f"OCR fallback failed: {e}") full_text = "\n\n--- PAGE BREAK ---\n\n".join(pages) # Use extracted anyway else: full_text = "\n\n--- PAGE BREAK ---\n\n".join(pages) return full_text, len(reader.pages), metadata ``` **Key Features**: - Automatic detection of scanned PDFs (< 100 characters extracted) - Seamless fallback to OCR when needed - Per-page OCR processing with error handling - Graceful degradation if OCR fails - Requires Tesseract-OCR system installation - Optional (disabled via USE_OCR=0) **Environment Variables**: - `USE_OCR=1` - Enable OCR support (default: enabled if libraries installed) **System Requirements**: - Python libraries: `pytesseract`, `pdf2image`, `Pillow` - System binary: Tesseract-OCR (https://github.com/UB-Mannheim/tesseract/wiki) - Windows: Install from UB Mannheim installer - Linux: `apt-get install tesseract-ocr` - macOS: `brew install tesseract` **Benefits**: - ✅ Automatically handles scanned PDFs - ✅ No user intervention required (automatic fallback) - ✅ Preserves PDF metadata even with OCR - ✅ Detailed logging for debugging - ✅ Works with any PDF (text-based or image-based) **Performance**: - Text-based PDFs: Instant extraction - Scanned PDFs: ~1-2 seconds per page (OCR processing time) **Installation**: ```bash # Python libraries pip install pytesseract pdf2image Pillow # System binary (Windows) # Download and install from: https://github.com/UB-Mannheim/tesseract/wiki ``` --- ## 5. Error Handling & Robustness ### ✅ COMPLETED: P0: Better Error Handling in Async Functions **Status**: ✅ Implemented custom exception hierarchy **Implementation Details**: ```python class KnowledgeBaseError(Exception): """Base exception for knowledge base errors.""" pass class DocumentNotFoundError(KnowledgeBaseError): """Raised when a document is not found.""" pass class ChunkNotFoundError(KnowledgeBaseError): """Raised when a chunk is not found.""" pass class UnsupportedFileTypeError(KnowledgeBaseError): """Raised when file type is not supported.""" pass class IndexCorruptedError(KnowledgeBaseError): """Raised when the index is corrupted.""" pass ``` All custom exceptions are tested in `test_custom_exceptions()` test case. ### ✅ COMPLETED: P1: Add Logging **Status**: ✅ Implemented with file and console logging **Implementation Details**: ```python # Setup in __init__ log_file = self.data_dir / "server.log" logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(log_file), logging.StreamHandler(sys.stderr) ] ) self.logger = logging.getLogger(__name__) ``` Logs include: - Document additions/removals - Search queries and result counts - BM25 index building - Error conditions ### P1: Index Validation & Repair ```python def validate_index(self) -> list[str]: """Validate index integrity, return list of issues.""" issues = [] # Check for orphaned chunks for chunk_file in self.chunks_dir.glob("*.json"): doc_id = chunk_file.stem if doc_id not in self.documents: issues.append(f"Orphaned chunks for doc_id: {doc_id}") # Check for missing chunk files for doc_id in self.documents: chunk_file = self.chunks_dir / f"{doc_id}.json" if not chunk_file.exists(): issues.append(f"Missing chunks for doc_id: {doc_id}") return issues def repair_index(self): """Attempt to repair index issues.""" # Remove orphaned chunks, re-index missing documents ``` --- ## 6. Observability & Monitoring ### P1: Add Metrics Collection ```python from dataclasses import dataclass from datetime import datetime @dataclass class SearchMetrics: query: str results_count: int search_time_ms: float timestamp: datetime class KnowledgeBase: def __init__(self, data_dir): self.metrics = [] def search(self, query: str, ...): start_time = time.time() results = # ... search ... elapsed_ms = (time.time() - start_time) * 1000 self.metrics.append(SearchMetrics( query=query, results_count=len(results), search_time_ms=elapsed_ms, timestamp=datetime.now() )) return results ``` ### P2: Add Search Analytics Tool ```python Tool( name="search_analytics", description="Get analytics about search queries", inputSchema={ "properties": { "days": {"type": "integer", "default": 7} } } ) # Returns: # - Most common queries # - Queries with no results # - Average search time # - Most searched tags ``` --- ## 7. Developer Experience ### P1: Bulk Operations ```python Tool( name="add_documents_bulk", description="Add multiple documents at once", inputSchema={ "properties": { "directory": {"type": "string"}, "pattern": {"type": "string", "default": "**/*.pdf"}, "tags": {"type": "array"} } } ) ``` ### P1: Progress Reporting for Long Operations ```python async def add_document(self, filepath: str, progress_callback=None): if progress_callback: await progress_callback("Extracting text...", 0.2) text = self._extract_pdf_text(filepath) if progress_callback: await progress_callback("Creating chunks...", 0.5) chunks = self._chunk_text(text) # ... etc ``` ### P2: Configuration File Support ```yaml # config.yml knowledge_base: chunk_size: 1500 chunk_overlap: 200 search: algorithm: bm25 default_max_results: 10 semantic_search: enabled: true model: all-MiniLM-L6-v2 ``` --- ## 8. Testing Improvements ### P1: Add Performance Benchmarks ```python # tests/benchmark.py def test_search_performance(): """Search should complete in <100ms for typical queries.""" kb = setup_kb_with_1000_docs() start = time.time() results = kb.search("VIC-II register") elapsed = time.time() - start assert elapsed < 0.1, f"Search took {elapsed}s, expected <0.1s" ``` ### P1: Add Integration Tests for MCP Protocol ```python async def test_mcp_tool_call(): """Test actual MCP tool invocation.""" result = await call_tool("search_docs", {"query": "SID"}) assert len(result) > 0 assert "SID" in result[0].text ``` --- ## 9. Security Considerations ### ✅ COMPLETED: P0: Path Traversal Protection **Status**: ✅ Implemented with directory whitelisting **Completed**: December 2025 **Implementation Details**: ```python class SecurityError(KnowledgeBaseError): """Raised when a security violation is detected.""" pass class KnowledgeBase: def __init__(self, data_dir): # Parse allowed directories from environment allowed_dirs_env = os.getenv('ALLOWED_DOCS_DIRS', '') if allowed_dirs_env: self.allowed_dirs = [Path(d.strip()).resolve() for d in allowed_dirs_env.split(',') if d.strip()] else: self.allowed_dirs = None # No restrictions (backward compatible) def add_document(self, filepath: str, ...): # Resolve to absolute path to prevent path traversal resolved_path = Path(filepath).resolve() # Validate path is within allowed directories if self.allowed_dirs: is_allowed = any( resolved_path.is_relative_to(allowed_dir) for allowed_dir in self.allowed_dirs ) if not is_allowed: raise SecurityError( f"Path outside allowed directories. File must be within: {self.allowed_dirs}" ) ``` **Key Features**: - New `SecurityError` exception class - `ALLOWED_DOCS_DIRS` environment variable for directory whitelisting - Path resolution with `Path.resolve()` to normalize paths - Validation that resolved paths are within allowed directories - Blocks path traversal attempts (e.g., `../../../etc/passwd`) - Backward compatible (no restrictions if not configured) **Configuration**: ```json "env": { "ALLOWED_DOCS_DIRS": "C:\\docs\\allowed,C:\\other\\allowed" } ``` **Testing**: - Added comprehensive security test with 3 scenarios - Tests allowed directory access (passes) - Tests restricted directory access (blocks) - Tests path traversal attempts (blocks) - All 20 tests pass including new security test ### P1: Resource Limits ```python # Prevent abuse MAX_CHUNK_SIZE = 10_000 # words MAX_FILE_SIZE = 100 * 1024 * 1024 # 100MB MAX_SEARCH_RESULTS = 100 def add_document(self, filepath: str, ...): file_size = os.path.getsize(filepath) if file_size > MAX_FILE_SIZE: raise ValueError(f"File too large: {file_size} bytes") ``` --- ## Recommended Implementation Order ### Phase 1: Foundation (1-2 weeks) 1. Add logging (P1) 2. Better error handling (P0) 3. Track PDF page numbers (P0) 4. SQLite database migration (P0) ### Phase 2: Search Quality (1-2 weeks) 5. Implement BM25 search (P0) 6. Add phrase search (P1) 7. Query preprocessing (P1) 8. Highlight search terms (P1) ### Phase 3: Advanced Features (2-3 weeks) 9. Semantic search with embeddings (P0) 10. SQLite FTS5 integration (P1) 11. Caching layer (P1) 12. Update detection (P1) ### Phase 4: Polish (1 week) 13. Bulk operations (P1) 14. Progress reporting (P1) 15. Search analytics (P2) 16. Configuration file (P2) --- ## Quick Wins (Can Implement in <1 Day Each) 1. **Add logging** - Copy-paste logging setup 2. **Highlight search terms** - Simple regex replacement 3. **Better error messages** - Add custom exception classes 4. **Document metadata extraction** - pypdf already provides this 5. **Configuration from environment variables** - Add more env vars 6. **Duplicate detection** - Change ID generation to content-based hash 7. **Index validation tool** - Add CLI command for health check --- ## Breaking Changes to Consider If you're willing to accept breaking changes for better long-term architecture: 1. **Change storage from JSON to SQLite** - Requires data migration 2. **Redesign doc_id generation** - Old IDs won't work 3. **Change search result format** - Add more metadata fields 4. **Split into multiple files** - Better organization (server.py, search.py, storage.py, models.py) --- ## Resources & Libraries **Search & NLP**: - `rank-bm25` - BM25 algorithm - `sentence-transformers` - Semantic search - `spacy` or `nltk` - Text processing - `rapidfuzz` - Fuzzy matching **Storage & Performance**: - `sqlite3` (built-in) - Database - `faiss-cpu` - Vector similarity search - `chromadb` - Vector database alternative **PDF Processing**: - `pytesseract` - OCR - `pdfplumber` - Better PDF parsing **Monitoring**: - `prometheus-client` - Metrics export - `structlog` - Structured logging

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MichaelTroelsen/tdz-c64-knowledge'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

IMPROVEMENTS.md•41.3 KiB