TDZ C64 Knowledge

tdz-c64-knowledge
archive
historical-docs

FEATURES.md•16.7 KiB

# Complete Feature Documentation **TDZ C64 Knowledge Base v2.12.0 - All Features & Capabilities** --- ## Table of Contents 1. [Search Features](#search-features) 2. [Document Management](#document-management) 3. [Data Extraction](#data-extraction) 4. [User Interfaces](#user-interfaces) 5. [Advanced Features](#advanced-features) 6. [Performance Features](#performance-features) 7. [Security Features](#security-features) 8. [Integration Features](#integration-features) --- ## Search Features ### 1. Full-Text Search (FTS5) **Status:** ✅ Enabled | **Performance:** 50ms (480x faster than BM25) Native SQLite FTS5 implementation with Porter stemming tokenizer. **Features:** - Word stemming (running → run, programming → program) - Phrase matching ("VIC-II chip") - Boolean operators (AND, OR, NOT) - Ranking by relevance **Example:** ```bash search "memory mapping" # Returns: Documents about memory organization ``` **Technical Details:** - Tokenizer: Porter stemming - Language: English - Storage: Virtual FTS5 table in SQLite - Sync: Automatic triggers keep index current --- ### 2. Semantic/Conceptual Search **Status:** ✅ Installed | **Performance:** 7-16ms per query Meaning-based search using embeddings and vector similarity. **Features:** - Concept matching ("movable objects" → "sprites") - Intent understanding - Related documents discovery - No keyword matching required **Example:** ```bash set USE_SEMANTIC_SEARCH=1 search "how do I make sounds?" # Returns: SID programming documentation ``` **Technical Details:** - Model: sentence-transformers (all-MiniLM-L6-v2) - Vector DB: FAISS with cosine similarity - Embeddings: Pre-computed for all chunks - Search Speed: ~7-16ms --- ### 3. BM25 Ranking Algorithm **Status:** ✅ Enabled | **Performance:** 24000ms (fallback) Industry-standard probabilistic ranking for keyword matching. **Features:** - Document length normalization - Term frequency scoring - Inverse document frequency weighting - Parameter tuning (k1=1.2, b=0.75) **When Used:** - Primary search if FTS5 returns no results - Fallback when FTS5 unavailable - Configurable via USE_BM25 environment variable --- ### 4. Hybrid Search **Status:** ✅ Enabled | **Performance:** 60-180ms Combines FTS5 keyword search with semantic search. **Features:** - Weighted scoring (default: 30% semantic, 70% keyword) - Intelligent result merging - Best of both worlds: exact + conceptual - Configurable semantic weight (0.0-1.0) **Example:** ```bash search "SID sound programming" --hybrid # Returns: Both exact keyword matches AND related concepts ``` **Algorithm:** 1. FTS5 search for exact matches 2. Semantic search for concepts 3. Normalize both scores (0-1) 4. Weighted combination 5. Sort by combined score --- ### 5. Fuzzy/Typo-Tolerant Search **Status:** ✅ Enabled | **Threshold:** 80% Handles misspellings using Levenshtein distance. **Features:** - Automatic typo detection - Configurable similarity threshold (0-100%) - Word-level matching - Smart fallback to exact matching **Examples:** - "registr" → finds "register" (89% similar) - "VIC-I" → finds "VIC-II" (83% similar) - "asembly" → finds "assembly" (88% similar) **Configuration:** ```bash set USE_FUZZY_SEARCH=1 set FUZZY_THRESHOLD=80 # 0-100, default 80% ``` --- ### 6. Phrase Search **Status:** ✅ Enabled Exact phrase matching with quote detection. **Features:** - Exact phrase matching (whole words) - Score boost (2x for BM25, 10x for simple) - Quote detection via regex - Combined with term search **Example:** ```bash search "VIC-II chip register" # Exact phrase gets higher score than individual words ``` --- ### 7. Query Preprocessing **Status:** ✅ Enabled NLTK-powered text normalization. **Features:** - Tokenization (word segmentation) - Stopword removal (the, a, an, etc.) - Porter stemming (running → run) - Technical term preservation (VIC-II, 6502) **Applied To:** - User queries - Indexed corpus during search - Both search methods **Example:** ```bash Query: "How to program the SID chip?" Processed: ["program", "sid", "chip"] # Stopwords removed ``` --- ### 8. Search Result Ranking **Status:** ✅ Enabled | **Performance:** 50-100x speedup for cached results Results automatically ranked by relevance. **Ranking Factors:** 1. Exact phrase matches (highest weight) 2. Term frequency in document 3. Inverse document frequency (IDF) 4. Query term coverage 5. Document length normalization **Search Caching:** - TTL: 5 minutes (configurable) - Size: 100 entries (configurable) - Hit rate: 50-100x speedup for repeated queries - Auto-invalidation on document changes --- ### 9. Search Term Highlighting **Status:** ✅ Enabled Matching terms highlighted in results. **Features:** - Markdown bold (`**term**`) - Case-insensitive matching - Whole-word boundary detection - Multiple term highlighting **Example Result:** ``` ...The **SID** chip controls **sound** synthesis on the **C64**... ``` --- ### 10. Tag-Based Filtering **Status:** ✅ Enabled Filter search results by document tags. **Features:** - Multiple tag support - AND/OR filtering - Tag autocomplete (in GUI) - Hierarchical tags (category/subtopic) **Example:** ```bash search "programming" --tags assembly --max 10 # Only returns assembly-tagged documents ``` **Available Tags:** - c64, c128 (platform) - reference, codebase64 (category) - assembly, basic (language) - sid, vic-ii, cia (hardware) - music, graphics (topic) --- ## Document Management ### 1. Multi-Format Support **Status:** ✅ Enabled Support for 5+ document formats. **Formats:** - ✅ **PDF** (.pdf) - Via pypdf with OCR fallback - ✅ **Text** (.txt) - With encoding detection - ✅ **Markdown** (.md) - Treated as text - ✅ **HTML** (.html, .htm) - Via BeautifulSoup4 - ✅ **Excel** (.xlsx, .xls) - Via openpyxl **Detection:** Automatic by file extension --- ### 2. Bulk Document Import **Status:** ✅ Enabled Batch import entire directories. **Features:** - Recursive folder scanning - Glob pattern matching (default: `**/*.{pdf,txt,md,html,xlsx}`) - Duplicate skipping - Progress reporting - Error handling (failed files don't stop process) **Example:** ```bash add-folder "C:\docs" --tags reference --recursive # Processes all supported formats recursively ``` --- ### 3. Duplicate Detection **Status:** ✅ Enabled Content-based deduplication prevents duplicate indexing. **Algorithm:** - MD5 hash of normalized content (first 10k words) - Case-insensitive comparison - Lowercase normalization - Returns existing document ID **Benefits:** - Saves storage (same file at different paths) - Improves search quality (no duplicate results) - Detects truly same content --- ### 4. Document Metadata Extraction **Status:** ✅ Enabled Automatic extraction of document metadata from PDFs. **Extracted Fields:** - Author - Subject - Creator - Creation date - Page count - File type - File size **Storage:** In SQLite documents table --- ### 5. PDF Page Number Tracking **Status:** ✅ Enabled Tracks PDF page numbers for accurate referencing. **Features:** - PAGE BREAK marker insertion during extraction - Page number calculation based on markers - Returned in search results - Accurate to actual page **Example Result:** ``` Page: 47 (of 504) Chunk: 142 ...content from page 47... ``` --- ### 6. OCR for Scanned PDFs **Status:** ✅ Enabled (Tesseract installed) Automatic fallback OCR for scanned documents. **Features:** - Automatic detection (< 100 chars = scanned) - Tesseract OCR processing - Per-page processing with error handling - Graceful fallback if OCR unavailable - Optional Poppler requirement (not installed) **Configuration:** ```bash set USE_OCR=1 set POPPLER_PATH=C:\poppler\Library\bin # Optional ``` --- ## Data Extraction ### 1. Table Extraction **Status:** ✅ Enabled | **Performance:** ~1-2 sec per table Automatic extraction of structured tables from PDFs. **Features:** - pdfplumber-based extraction - Markdown format conversion - Row/column metadata - Full-text indexing - Page tracking **Extraction Details:** - Tables detected per page - Converted to markdown format - Stored in document_tables table - FTS5 indexed for searching **Example:** ``` Current KB has 209+ extracted tables - Memory mapping tables - Register reference tables - Opcode tables - Hardware specifications ``` **Search:** ```bash search "memory address" --max 5 # Returns: Tables with memory addresses ``` --- ### 2. Code Block Detection **Status:** ✅ Enabled | **Performance:** Instant detection Automatic detection of code blocks in documents. **Supported Code Types:** - **BASIC** - Line-numbered programs (10 PRINT, 20 GOTO) - **Assembly** - 6502 mnemonics (LDA, STA, JMP, BRK) - **Hex** - Memory dumps with addresses ($D000: 00 01 02) **Detection Method:** - Regex pattern matching - Minimum 3 consecutive lines - Mnemonic dictionary for assembly **Detection Examples:** - BASIC: Lines starting with numbers + keywords - Assembly: 6502 instruction mnemonics - Hex: Address format ($XXXX:) with bytes **Storage:** - document_code_blocks table - FTS5 indexed - Block type, content, line count tracked **Current KB:** - 1,876+ detected code blocks - 50+ BASIC blocks - 208+ Assembly blocks - 0 Hex blocks --- ### 3. Facet Extraction **Status:** ✅ Enabled | **Performance:** Instant Extraction of technical metadata (facets). **Facet Types:** - **Hardware Components** (VIC-II, SID, CIA, 6502) - **Assembly Instructions** (LDA, STA, JMP, BRK, etc.) - **Register References** ($D000, $D020, $1C04, etc.) **Current KB:** - 1,876+ total facets - 6 hardware components - 55 instructions - 1,829 registers **Use Cases:** - Hardware-specific documentation discovery - Instruction reference - Memory address lookup --- ## User Interfaces ### 1. Web GUI (Streamlit) **Status:** ✅ Running | **URL:** http://localhost:8501 Beautiful web-based dashboard. **Pages:** - **Search** - Full search interface with filters - **Documents** - Browse and manage documents - **Tags** - Tag management interface - **Dashboard** - Statistics and analytics - **Settings** - Feature configuration - **Analytics** - Query metrics and trends **Features:** - Real-time search - Filter by tags - Sort results - Preview documents - Responsive design - Dark/light mode (Streamlit default) **Launch:** ```bash launch-gui-full-features.bat ``` --- ### 2. Command-Line Interface (CLI) **Status:** ✅ Enabled Powerful command-line tool. **Commands:** - `search` - Search knowledge base - `list` - List all documents - `add` - Add single document - `add-folder` - Bulk import from directory - `remove` - Delete document - `stats` - Show statistics **Usage:** ```bash launch-cli-full-features.bat search "query" --max 10 --tags tag1 launch-cli-full-features.bat add-folder "path" --tags reference --recursive launch-cli-full-features.bat list launch-cli-full-features.bat stats ``` --- ### 3. MCP Server (Model Context Protocol) **Status:** ✅ Ready | **Transport:** stdio Integration with Claude Code and Claude Desktop. **Exposed Tools:** - search_docs - semantic_search - hybrid_search - find_similar - search_tables - search_code - add_document - add_documents_bulk - remove_document - list_docs - get_document - kb_stats - health_check - check_updates **Configuration:** ```json { "command": "path/to/.venv/Scripts/python.exe", "args": ["path/to/server.py"], "env": { "USE_FTS5": "1", "USE_SEMANTIC_SEARCH": "1" } } ``` --- ## Advanced Features ### 1. Hybrid Search Combination **Status:** ✅ Enabled Smart combination of multiple search methods. **Algorithm:** 1. Execute FTS5 search (fast, exact) 2. Execute semantic search (conceptual) 3. Normalize scores (0-1 range) 4. Weighted combination (default 70% FTS5, 30% semantic) 5. Merge duplicate results 6. Sort by combined score **Configuration:** ```bash set USE_SEMANTIC_SEARCH=1 # Enables hybrid # semantic_weight defaults to 0.3 ``` --- ### 2. Smart Similarity Search **Status:** ✅ Enabled Find similar documents automatically. **Modes:** - **Semantic** - Uses FAISS embeddings (if available) - **TF-IDF** - Fallback using sklearn - **Both** - Automatic selection **Features:** - Document-level similarity - Chunk-level similarity (optional) - Tag filtering - Configurable result count **Example:** ```bash find_similar <doc_id> --max 5 # Returns: 5 most similar documents ``` --- ### 3. Health Monitoring **Status:** ✅ Enabled Comprehensive system diagnostics. **Monitored Systems:** - Database integrity (table checksums, orphan detection) - Feature availability (FTS5, semantic, BM25, embeddings) - Performance metrics (cache utilization, index status) - Disk space (warnings if < 1GB free) **Information Provided:** - System status (OK/WARNING/ERROR) - Detailed metrics - Issue list with suggestions --- ## Performance Features ### 1. Search Result Caching **Status:** ✅ Enabled TTL-based caching for query results. **Configuration:** ```bash set SEARCH_CACHE_SIZE=100 # Max cached queries set SEARCH_CACHE_TTL=300 # 5-minute expiry ``` **Benefits:** - 50-100x speedup for repeated queries - Automatic invalidation on document changes - Minimal memory overhead **Cache Stats:** - Entry count tracking - Hit/miss ratio - Automatic cleanup of expired entries --- ### 2. Index Building Optimization **Status:** ✅ Enabled Lazy-loaded, efficient index building. **Optimizations:** - On-demand BM25 index building (~50 seconds one-time) - FTS5 indexes (pre-built, instantly searchable) - Semantic embeddings (one-time generation ~1 min) - Caching of built indexes **Process:** - First search builds BM25 index - Subsequent searches use cache - Document changes invalidate caches --- ### 3. Database Query Optimization **Status:** ✅ Enabled SQLite optimizations for fast queries. **Optimizations:** - FTS5 virtual tables (native search) - Automatic trigger sync (keeps indexes current) - Query result caching - Prepared statement reuse - Transaction batching **Performance:** - Search queries: 50ms (FTS5) vs 24s (BM25) - Document addition: ~5-10 seconds per document - Duplicate detection: < 1 second --- ## Security Features ### 1. Path Traversal Protection **Status:** ✅ Enabled Prevents access to files outside allowed directories. **Features:** - Directory whitelisting via ALLOWED_DOCS_DIRS - Path normalization and resolution - Blocks path traversal attempts (../ sequences) - Clear error messages **Configuration:** ```bash set ALLOWED_DOCS_DIRS=C:\docs\allowed,C:\research ``` --- ### 2. Content-Based Deduplication **Status:** ✅ Enabled Prevents duplicate content indexing. **Security Benefits:** - Prevents accidental re-indexing of malicious files - Content hash comparison (MD5) - Normalized comparison (case-insensitive) --- ### 3. Database Integrity **Status:** ✅ Enabled SQLite ACID compliance ensures data safety. **Features:** - Atomic transactions - Consistency checks - Durability (writes to disk) - Isolation between transactions --- ## Integration Features ### 1. Claude Code Integration **Status:** ✅ Ready Full integration with Claude Code MCP system. **Features:** - MCP server on stdio - All tools exposed to Claude - Full context awareness - Async support **Usage:** > "Search the C64 knowledge base for assembly examples" --- ### 2. Environment Variable Configuration **Status:** ✅ Enabled Comprehensive configuration system. **Configurable Items:** - Search algorithms (FTS5, semantic, BM25) - Performance settings (cache size/TTL) - Security (allowed directories) - LLM integration (provider, API key, model) --- ### 3. Logging & Diagnostics **Status:** ✅ Enabled Comprehensive logging system. **Log Levels:** - INFO - Normal operations - WARNING - Potential issues - ERROR - Error conditions **Output:** - File: server.log - Console: stderr --- ## Feature Status Summary | Feature | Status | Performance | |---------|--------|-------------| | FTS5 Search | ✅ | 50ms | | Semantic Search | ✅ | 7-16ms | | BM25 Ranking | ✅ | 24s (cached) | | Hybrid Search | ✅ | 60-180ms | | Fuzzy Search | ✅ | Configurable | | Query Preprocessing | ✅ | 5-10ms | | Result Caching | ✅ | 50-100x speedup | | Table Extraction | ✅ | 1-2s per table | | Code Detection | ✅ | Instant | | Facet Extraction | ✅ | Instant | | OCR Support | ✅ | 1-2s per page | | Duplicate Detection | ✅ | <1s | | Streamlit GUI | ✅ | - | | CLI Tool | ✅ | - | | MCP Server | ✅ | - | --- ## Next Generation Features (Planned) Coming in future releases: 1. **RAG Question Answering** - Natural language Q&A with citations 2. **Document Summarization** - Auto-generate summaries 3. **Entity Extraction** - Extract names, dates, technical terms 4. **VICE Emulator Integration** - Link to running C64 emulator 5. **REST API** - HTTP interface for web apps 6. **Plugin System** - Extensible architecture --- **Last Updated:** 2025-12-17 **Version:** 2.12.0 **Status:** ✅ All Features Operational

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MichaelTroelsen/tdz-c64-knowledge'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

FEATURES.md•16.7 KiB