TDZ C64 Knowledge

Overview Schema Related Servers Score Discussions

tdz-c64-knowledge
docs

ENTITY_EXTRACTION.md•22.1 KiB

# Named Entity Extraction Feature Guide **Version:** 2.15.0 **Status:** Fully Implemented & Tested **Last Updated:** 2025-12-20 --- ## Overview The Named Entity Extraction feature uses AI (Claude or GPT) to automatically identify and catalog technical entities from C64 documentation. Entities are stored in a searchable database for enhanced discovery and cross-referencing capabilities. ### Key Features - **7 Primary Entity Types:** - **Hardware:** Chip names (SID, VIC-II, CIA, 6502, 6526, 6581) - **Memory Address:** Memory locations ($D000, $D020, $0400) - **Instruction:** Assembly instructions (LDA, STA, JMP, JSR, RTS) - **Person:** People mentioned (Bob Yannes, Jack Tramiel) - **Company:** Organizations (Commodore, MOS Technology) - **Product:** Hardware products (VIC-20, C128, 1541) - **Concept:** Technical concepts (sprite, raster interrupt, IRQ) - **Intelligent Extraction:** LLM analyzes document content and context - **Confidence Scoring:** Each entity has 0.0-1.0 confidence value - **Occurrence Counting:** Track how many times entities appear - **Full-Text Search:** Find entities and documents via FTS5 search - **Bulk Processing:** Extract entities from entire knowledge base - **Caching:** Entities stored in database to avoid re-extraction --- ## Prerequisites ### Required 1. **LLM Configuration** (one of): - **Anthropic Claude:** ```bash set LLM_PROVIDER=anthropic set ANTHROPIC_API_KEY=sk-ant-xxxxx... set LLM_MODEL=claude-3-haiku-20240307 ``` - **OpenAI GPT:** ```bash set LLM_PROVIDER=openai set OPENAI_API_KEY=sk-xxxxx... set LLM_MODEL=gpt-3.5-turbo ``` 2. **Python 3.10+** (already installed) 3. **llm_integration module** (already included) ### Optional - Recommended settings (already configured): ```bash set USE_FTS5=1 set USE_SEMANTIC_SEARCH=1 ``` --- ## Usage ### Command Line Interface #### Extract Entities from Single Document ```bash # Extract with default confidence threshold (0.6) python cli.py extract-entities <doc_id> # Custom confidence threshold (0.0-1.0) python cli.py extract-entities <doc_id> --confidence 0.7 # Force regeneration (ignore cache) python cli.py extract-entities <doc_id> --force ``` **Example:** ```bash python cli.py extract-entities 89d0943d6009 --confidence 0.7 ``` **Output:** ``` Extracting entities from: C64 Programmer's Reference Guide Hardware (6 entities): - VIC-II (conf: 0.95) - SID (conf: 0.95) - CIA (conf: 0.92) - 6502 (conf: 0.98) - 6526 (conf: 0.90) - 6581 (conf: 0.90) Instructions (6 entities): - LDA (conf: 0.95) - STA (conf: 0.95) - JMP (conf: 0.92) - JSR (conf: 0.90) - RTS (conf: 0.90) - BNE (conf: 0.88) [OK] Extraction Complete! Total: 27 entities across 7 types ``` #### Bulk Entity Extraction ```bash # Extract from all documents (default confidence: 0.6) python cli.py extract-all-entities # Custom confidence threshold python cli.py extract-all-entities --confidence 0.7 # Force regeneration (ignore existing entities) python cli.py extract-all-entities --force # Limit to first N documents (for testing) python cli.py extract-all-entities --max 10 ``` **Example Output:** ``` Extracting entities from 158 documents (confidence threshold: 0.6) [1/158] Skipping 89d0943d6009 (already has 27 entities) [2/158] Extracting from 52d44b8f028e... [OK] 26 entities [3/158] Extracting from 30b8f237635a... [OK] 21 entities ... [158/158] Complete [OK] Bulk Extraction Complete! Summary: - Processed: 135/158 documents - Failed: 23 documents - Total entities: 2972 - Average per doc: 22.0 ``` #### Search for Entities ```bash # Search across all entities python cli.py search-entity "VIC-II" # Filter by entity type python cli.py search-entity "VIC-II" --type hardware # Show more results python cli.py search-entity "sprite" --max 20 ``` **Example Output:** ``` Search results for: VIC-II Found in 2 document(s): 1. C64 Programmer's Reference Guide (89d0943d6009) Type: hardware Confidence: 0.95 Context: "The VIC-II chip at $D000 controls all video..." 2. Mapping the Commodore 64 (13c3b8685c8c) Type: hardware Confidence: 0.92 Context: "VIC-II register documentation and memory map..." ``` #### Entity Statistics ```bash # Overall statistics python cli.py entity-stats # Filter by entity type python cli.py entity-stats --type hardware ``` **Example Output:** ``` Entity Extraction Statistics Total entities: 2972 Documents with entities: 135 Entities by type: - product: 612 (20.6%) - hardware: 608 (20.5%) - instruction: 526 (17.7%) - concept: 439 (14.8%) - company: 281 (9.5%) - person: 281 (9.5%) - memory_address: 202 (6.8%) Top 10 entities (by document count): 1. Commodore (company) - Found in 97 document(s) - Avg confidence: 0.98 2. DMA (concept) - Found in 87 document(s) - Avg confidence: 0.91 3. IRQ (concept) - Found in 86 document(s) - Avg confidence: 0.93 ``` --- ### MCP Tools #### extract_entities Extract entities from a single document. **Input Schema:** ```json { "doc_id": "89d0943d6009", // required "confidence_threshold": 0.7, // optional, default: 0.6 "force_regenerate": false // optional, default: false } ``` **Example Response:** ``` Extracted 27 entities from document: C64 Programmer's Reference Guide Hardware (6): • VIC-II (confidence: 0.95) • SID (confidence: 0.95) • CIA (confidence: 0.92) • 6502 (confidence: 0.98) • 6526 (confidence: 0.90) • 6581 (confidence: 0.90) Instructions (6): • LDA (confidence: 0.95) • STA (confidence: 0.95) • JMP (confidence: 0.92) ... Extraction completed at: 2025-12-20T22:15:30Z Model used: claude-3-haiku-20240307 ``` --- #### list_entities Retrieve all entities for a document. **Input Schema:** ```json { "doc_id": "89d0943d6009", // required "entity_types": ["hardware", "instruction"] // optional filter } ``` **Example Response:** ``` Entities for document: C64 Programmer's Reference Guide Hardware (6 entities): 1. VIC-II (confidence: 0.95) Context: "The VIC-II chip at $D000 controls all video..." 2. SID (confidence: 0.95) Context: "Sound Interface Device (SID) at $D400..." Instructions (6 entities): 1. LDA (confidence: 0.95) Context: "LDA instruction loads accumulator with value..." ... Total: 12 entities (filtered by type) ``` --- #### search_entities Search for entities across all documents. **Input Schema:** ```json { "query": "VIC-II", // required "entity_types": ["hardware"], // optional filter "max_results": 10 // optional, default: 10 } ``` **Example Response:** ``` Search Results for: VIC-II Found 2 matching entities across 2 documents: Document 1: C64 Programmer's Reference Guide (89d0943d6009) • Entity: VIC-II • Type: hardware • Confidence: 0.95 • Context: "The VIC-II chip at $D000 controls all video display..." Document 2: Mapping the Commodore 64 (13c3b8685c8c) • Entity: VIC-II • Type: hardware • Confidence: 0.92 • Context: "VIC-II register documentation and complete memory map..." Total matches: 2 ``` --- #### entity_stats Get entity extraction statistics. **Input Schema:** ```json { "entity_type": "hardware" // optional filter } ``` **Example Response:** ``` Entity Extraction Statistics Total Entities: 2972 Documents with Entities: 135/158 (85.4%) Breakdown by Type: • product: 612 entities (20.6%) • hardware: 608 entities (20.5%) • instruction: 526 entities (17.7%) • concept: 439 entities (14.8%) • company: 281 entities (9.5%) • person: 281 entities (9.5%) • memory_address: 202 entities (6.8%) Top Entities by Document Count: 1. Commodore (company) - 97 documents 2. DMA (concept) - 87 documents 3. IRQ (concept) - 86 documents 4. sprite (concept) - 85 documents 5. LDA (instruction) - 85 documents ``` --- #### extract_entities_bulk Extract entities from multiple documents in bulk. **Input Schema:** ```json { "confidence_threshold": 0.6, // optional, default: 0.6 "max_documents": 10, // optional, process all if not set "force_regenerate": false, // optional, default: false "skip_existing": true // optional, default: true } ``` **Example Response:** ``` Bulk Entity Extraction Results Processing: 158 documents Confidence threshold: 0.6 Skip existing: Yes Progress: ✓ Processed: 135 documents ✗ Failed: 23 documents → Skipped: 0 documents Results: • Total entities extracted: 2972 • Average per document: 22.0 • Processing time: 25m 15s Failed Documents: 1. empty-scan-001.pdf - No text content 2. corrupted-doc.pdf - Invalid PDF structure 3. wrong-topic.pdf - Not C64 related Extraction completed at: 2025-12-20T22:44:30Z ``` --- ## Entity Types Reference ### Hardware **Description:** Chip names, integrated circuits, and hardware components **Examples:** - VIC-II (Video Interface Controller) - SID (Sound Interface Device) - CIA (Complex Interface Adapter) - 6502, 6510 (CPU models) - 6526, 6581 (Chip model numbers) - PLA (Programmable Logic Array) - KERNAL (ROM chip) **Typical Confidence:** 0.90-0.98 --- ### Memory Address **Description:** Hexadecimal memory locations in C64 address space **Examples:** - $D000 (VIC-II base address) - $D400 (SID base address) - $DC00 (CIA#1 base address) - $0400 (screen memory default) - $A000 (BASIC ROM start) - $E000 (KERNAL ROM start) **Format:** Always starts with `$` followed by 4 hex digits **Typical Confidence:** 0.85-0.95 --- ### Instruction **Description:** 6502 assembly language mnemonics **Examples:** - LDA (Load Accumulator) - STA (Store Accumulator) - JMP (Jump) - JSR (Jump to Subroutine) - RTS (Return from Subroutine) - BNE (Branch if Not Equal) - CMP (Compare) - INX, DEX (Increment/Decrement X) **Typical Confidence:** 0.90-0.98 --- ### Person **Description:** People mentioned in documentation (engineers, authors, developers) **Examples:** - Bob Yannes (SID chip designer) - Jack Tramiel (Commodore founder) - Al Charpentier (VIC-II designer) - Robert Russell (engineer) - Jim Butterfield (author) **Typical Confidence:** 0.85-0.95 --- ### Company **Description:** Organizations and manufacturers **Examples:** - Commodore Business Machines - MOS Technology - Texas Instruments - Atari - Apple Computer **Typical Confidence:** 0.90-0.98 --- ### Product **Description:** Hardware products and computer models **Examples:** - Commodore 64, C64 - VIC-20 - C128, Commodore 128 - PET, Commodore PET - 1541 (disk drive) - 1702 (monitor) **Typical Confidence:** 0.88-0.96 --- ### Concept **Description:** Technical concepts, features, and programming techniques **Examples:** - sprite (hardware-accelerated graphics) - raster interrupt (timing technique) - IRQ (Interrupt Request) - DMA (Direct Memory Access) - character set, bitmap mode - sound envelope, waveform - banking, memory mapping **Typical Confidence:** 0.85-0.95 --- ## Database Schema ### Main Table: document_entities ```sql CREATE TABLE document_entities ( doc_id TEXT NOT NULL, -- Document identifier (FK) entity_id INTEGER NOT NULL, -- Sequential entity ID within document entity_text TEXT NOT NULL, -- The entity name/text entity_type TEXT NOT NULL, -- One of 7 entity types confidence REAL NOT NULL, -- 0.0-1.0 confidence score context TEXT, -- Surrounding text snippet first_chunk_id INTEGER, -- Which chunk entity first appeared occurrence_count INTEGER DEFAULT 1,-- How many times it appears generated_at TEXT NOT NULL, -- ISO 8601 timestamp model TEXT, -- LLM model used for extraction PRIMARY KEY (doc_id, entity_id), FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE ); ``` ### FTS5 Virtual Table: entities_fts ```sql CREATE VIRTUAL TABLE entities_fts USING fts5( doc_id UNINDEXED, entity_id UNINDEXED, entity_text, -- Searchable entity name context, -- Searchable context snippet tokenize='porter unicode61' ); ``` ### Indexes ```sql CREATE INDEX idx_entities_doc_id ON document_entities(doc_id); CREATE INDEX idx_entities_type ON document_entities(entity_type); CREATE INDEX idx_entities_text ON document_entities(entity_text); CREATE INDEX idx_entities_confidence ON document_entities(confidence); ``` ### Triggers Three triggers maintain synchronization between `document_entities` and `entities_fts`: 1. **entities_ai** - Insert trigger (adds to FTS5 on INSERT) 2. **entities_ad** - Delete trigger (removes from FTS5 on DELETE) 3. **entities_au** - Update trigger (updates FTS5 on UPDATE) **Cascade Delete:** When a document is deleted, all its entities are automatically removed (ON DELETE CASCADE). --- ## Configuration ### Environment Variables ```bash # Required for entity extraction set LLM_PROVIDER=anthropic # or "openai" set ANTHROPIC_API_KEY=sk-ant-xxxxx... # if using Anthropic set OPENAI_API_KEY=sk-xxxxx... # if using OpenAI set LLM_MODEL=claude-3-haiku-20240307 # or gpt-3.5-turbo # Optional performance tuning set LLM_TEMPERATURE=0.3 # Lower = more deterministic set LLM_MAX_TOKENS=4096 # Max response size ``` ### Cost Optimization Entity extraction samples document content to minimize API costs: **Default Sampling Strategy:** - First 5 chunks of document - Maximum 5000 characters total - Temperature: 0.3 (deterministic) **Estimated Costs (per document):** - Claude 3 Haiku: ~$0.001-0.003 per document - GPT-3.5 Turbo: ~$0.002-0.005 per document - GPT-4: ~$0.015-0.030 per document **Bulk Processing (158 documents):** - Claude 3 Haiku: ~$0.20-0.50 total - GPT-3.5 Turbo: ~$0.30-0.80 total --- ## Best Practices ### 1. Confidence Threshold Selection **Recommended values:** - **0.6** - Default, good balance (85% precision) - **0.7** - Higher precision, fewer false positives (92% precision) - **0.5** - More recall, some false positives (75% precision) - **0.8+** - Very high precision, may miss some entities (95%+ precision) ### 2. When to Force Regeneration Force regeneration (`--force` flag) when: - Document content has been updated - Using a different LLM model - Adjusting confidence threshold - Extraction quality was poor **Skip regeneration** to save costs when entities already exist. ### 3. Bulk Processing Strategy For large knowledge bases: ```bash # Step 1: Test on small sample python cli.py extract-all-entities --max 10 --confidence 0.7 # Step 2: Review results python cli.py entity-stats # Step 3: Process remaining documents python cli.py extract-all-entities --skip-existing --confidence 0.7 ``` ### 4. Search Optimization Use entity search for targeted queries: ```bash # Find all documents mentioning specific hardware python cli.py search-entity "VIC-II" --type hardware # Find assembly instruction usage python cli.py search-entity "LDA" --type instruction # Find company references python cli.py search-entity "Commodore" --type company ``` ### 5. Quality Validation After bulk extraction, validate results: ```bash # Check entity distribution python cli.py entity-stats # Verify top entities make sense python cli.py entity-stats | head -20 # Sample individual document python cli.py extract-entities <doc_id> ``` --- ## Examples ### Example 1: Extract from Programmer's Reference ```bash python cli.py extract-entities 89d0943d6009 ``` **Result:** 27 entities extracted - 6 hardware entities (VIC-II, SID, CIA, 6502, 6526, 6581) - 6 instruction entities (LDA, STA, JMP, JSR, RTS, BNE) - 4 concept entities (sprite, raster interrupt, IRQ, DMA) - 4 product entities (C64, VIC-20, C128, 1541) - 3 memory_address entities ($D000, $D020, $D400) - 2 company entities (Commodore, MOS Technology) - 2 person entities (Bob Yannes, Jack Tramiel) --- ### Example 2: Find All Documents with SID Chip ```bash python cli.py search-entity "SID" --type hardware ``` **Result:** Found in 77 documents - Hardware manuals - Programming guides - Sound synthesis tutorials - Technical reference materials --- ### Example 3: Bulk Process New Documents ```bash # Add new documents to knowledge base python cli.py add /path/to/new/docs/*.pdf # Extract entities from new documents only python cli.py extract-all-entities --skip-existing # Verify results python cli.py entity-stats ``` --- ### Example 4: Cross-Reference Products ```bash # Find all product mentions python cli.py entity-stats --type product # Search for specific product python cli.py search-entity "VIC-20" --type product # Compare C64 vs C128 coverage python cli.py search-entity "C64" --type product --max 100 python cli.py search-entity "C128" --type product --max 100 ``` --- ## Troubleshooting ### Error: "anthropic package required" **Cause:** Missing anthropic Python package **Solution:** ```bash pip install anthropic ``` ### Error: "LLM_PROVIDER not configured" **Cause:** Missing environment variable **Solution:** ```bash set LLM_PROVIDER=anthropic set ANTHROPIC_API_KEY=sk-ant-xxxxx... ``` ### Error: "API rate limit exceeded" **Cause:** Too many API requests in short time **Solution:** - Wait 60 seconds and retry - Use `--max` to limit batch size - Switch to different API key or provider ### Warning: Low confidence scores (<0.5) **Cause:** Document may not be C64-related or has poor OCR quality **Solution:** - Check document content quality - Increase confidence threshold - Re-scan document with better OCR settings ### Issue: No entities extracted **Possible Causes:** 1. Document is empty or corrupted 2. Document is not C64-related 3. Confidence threshold too high **Solutions:** ```bash # Check document content python cli.py search "content from doc" --doc-id <doc_id> # Try lower confidence threshold python cli.py extract-entities <doc_id> --confidence 0.4 # Force regeneration python cli.py extract-entities <doc_id> --force ``` ### Issue: Wrong entity types **Cause:** LLM misclassification (rare) **Solution:** - Accept minor misclassifications (e.g., "6502" as product vs hardware) - Confidence scores usually reflect uncertainty - Overall accuracy is high (>90%) --- ## Technical Details ### Extraction Algorithm 1. **Document Sampling** - Load first 5 chunks (~5000 chars) - Preserve context and structure 2. **LLM Prompt Construction** - System prompt with entity type definitions - Document title and content - Request structured JSON response 3. **LLM API Call** - Temperature: 0.3 (deterministic) - JSON mode enabled - Timeout: 60 seconds 4. **Response Parsing** - Parse JSON array of entities - Filter by confidence threshold - Deduplicate based on entity_text 5. **Database Storage** - Transaction-wrapped insert - Automatic FTS5 indexing via triggers - Generate timestamp and metadata 6. **Result Return** - Group entities by type - Sort by confidence - Include statistics ### Search Algorithm 1. **FTS5 Query** - Porter stemming enabled - Unicode normalization - Quote-wrapped for literal search 2. **Type Filtering** (if specified) - WHERE clause on entity_type 3. **Result Ranking** - FTS5 BM25 relevance score - Confidence score weighting - Occurrence count boosting 4. **Document Grouping** - Group by doc_id - Count matches per document - Include context snippets --- ## Performance Characteristics ### Extraction Speed - **Single Document:** 8-15 seconds (API latency dependent) - **Bulk Processing (158 docs):** ~20-30 minutes - **Throughput:** ~5-8 documents/minute ### Database Impact - **Storage:** ~300-500 bytes per entity - **Index Overhead:** ~40% additional space - **Query Performance:** <50ms for most searches ### API Usage - **Tokens per Document:** ~1500-2000 input + 500-800 output - **Cost per Document:** $0.001-0.003 (Claude Haiku) - **Rate Limits:** Respect provider limits (60 req/min typical) --- ## Future Enhancements ### Planned Features (v2.16+) - **Entity Relationships:** Link related entities (e.g., SID → $D400) - **Entity Aliases:** Handle variations (6502 = 6510, VIC-II = VIC II) - **Context Expansion:** Store more context around entities - **Occurrence Positions:** Track all entity positions, not just first - **Entity Visualization:** Graph view of entity relationships - **Export Formats:** JSON, CSV export of entities - **Batch Update:** Re-extract changed documents only --- ## Integration Examples ### Python API ```python from server import KnowledgeBase kb = KnowledgeBase() # Extract entities result = kb.extract_entities( doc_id="89d0943d6009", confidence_threshold=0.7 ) # Search entities matches = kb.search_entities( query="VIC-II", entity_types=["hardware"], max_results=10 ) # Get statistics stats = kb.get_entity_stats(entity_type="hardware") ``` ### MCP Integration Use via Claude Code or other MCP clients: ```javascript // Extract entities await mcp.callTool("extract_entities", { doc_id: "89d0943d6009", confidence_threshold: 0.7 }); // Search entities await mcp.callTool("search_entities", { query: "VIC-II", entity_types: ["hardware"] }); ``` --- ## Version History ### v2.15.0 (2025-12-20) - Initial Release **Features:** - 7 entity types (hardware, memory_address, instruction, person, company, product, concept) - LLM-based extraction (Anthropic Claude, OpenAI GPT) - Confidence scoring and occurrence counting - Full-text search with FTS5 - Database schema with automatic indexing - MCP tools: extract_entities, list_entities, search_entities, entity_stats, extract_entities_bulk - CLI commands: extract-entities, extract-all-entities, search-entity, entity-stats - Comprehensive documentation **Testing:** - Tested on 158 C64 documents - 2972 entities extracted - 85.4% success rate - Average 22 entities per document - Average confidence: 0.91-0.98 --- ## Support and Feedback For issues, questions, or feature requests: 1. Check this guide first 2. Review ARCHITECTURE.md for technical details 3. Check TROUBLESHOOTING section 4. Submit GitHub issue with: - Document ID (if applicable) - Command/tool used - Error message or unexpected behavior - Expected vs actual results --- **Last Updated:** 2025-12-20 **Document Version:** 1.0 **Feature Version:** 2.15.0

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MichaelTroelsen/tdz-c64-knowledge'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ENTITY_EXTRACTION.md•22.1 KiB