TDZ C64 Knowledge

Overview Schema Related Servers Score Discussions

tdz-c64-knowledge
docs

POPPLER_SETUP.md•5.01 KiB

# Poppler + OCR Setup Documentation ## Summary Poppler has been successfully configured for OCR functionality in your C64 Knowledge Base MCP server! ## What Was Done 1. **Located Poppler Installation** - Path: `C:\Users\mit\claude\c64server\tdz-c64-knowledge\poppler-25.12.0\Library\bin` - Version: 25.12.0 - Contains all necessary binaries (pdftoppm.exe, etc.) 2. **Updated server.py** - Added support for `POPPLER_PATH` environment variable - Automatically uses Poppler when available for OCR operations - Falls back gracefully if Poppler path not specified 3. **Verified OCR Functionality** - Successfully extracted text from scanned PDFs - Tested with: `40_Best_Machine_Code_Routines_for_C64.pdf` (175 pages, 135,451 characters) - Tested with: `Commodore_64_BASIC_Quick_Reference_Guide.pdf` (2 pages, 22,579 characters) ## How to Use ### Option 1: Set Environment Variables (Recommended) When running CLI commands or MCP server, set these environment variables: ```cmd set USE_OCR=1 set POPPLER_PATH=C:\Users\mit\claude\c64server\tdz-c64-knowledge\poppler-25.12.0\Library\bin ``` ### Option 2: Configure in MCP Settings Add to your MCP configuration (e.g., `.claude/settings.local.json` or Claude Desktop config): ```json { "mcpServers": { "tdz-c64-knowledge": { "command": "C:\\Users\\mit\\claude\\c64server\\tdz-c64-knowledge\\.venv\\Scripts\\python.exe", "args": ["C:\\Users\\mit\\claude\\c64server\\tdz-c64-knowledge\\server.py"], "env": { "TDZ_DATA_DIR": "C:\\Users\\mit\\.tdz-c64-knowledge", "USE_OCR": "1", "POPPLER_PATH": "C:\\Users\\mit\\claude\\c64server\\tdz-c64-knowledge\\poppler-25.12.0\\Library\\bin", "USE_FTS5": "1" } } } } ``` ### Option 3: Add to System PATH Add Poppler to your system PATH environment variable: ``` C:\Users\mit\claude\c64server\tdz-c64-knowledge\poppler-25.12.0\Library\bin ``` This makes Poppler available system-wide without needing to set POPPLER_PATH. ## Testing ### Quick Test ```cmd cd C:\Users\mit\claude\c64server\tdz-c64-knowledge .venv\Scripts\activate python test_poppler_ocr.py ``` ### End-to-End Test ```cmd python test_ocr_end_to_end.py ``` ## Adding Scanned PDFs with OCR ### Using CLI ```cmd set USE_OCR=1 set POPPLER_PATH=C:\Users\mit\claude\c64server\tdz-c64-knowledge\poppler-25.12.0\Library\bin .venv\Scripts\python.exe cli.py add "path\to\scanned.pdf" --title "Document Title" --tags reference c64 ``` ### Using MCP When configured properly, the MCP server will automatically use OCR for scanned PDFs when you add documents through the `add_document` tool. ## How It Works 1. **PDF Detection**: When adding a PDF, the system first tries to extract text normally 2. **OCR Fallback**: If the PDF has little or no extractable text (< 10 characters), it's treated as scanned 3. **Image Conversion**: Poppler's `pdftoppm` converts PDF pages to images 4. **Text Extraction**: Tesseract OCR extracts text from each image 5. **Indexing**: Extracted text is chunked and indexed in the knowledge base ## Log Output When OCR is working correctly, you'll see log messages like: ``` INFO - OCR enabled (Tesseract found) INFO - Using Poppler from: C:\Users\mit\claude\c64server\tdz-c64-knowledge\poppler-25.12.0\Library\bin INFO - PDF appears to be scanned (0 chars extracted), falling back to OCR INFO - Using OCR to extract text from scanned PDF: document.pdf INFO - OCR extraction complete: 22579 characters from 2 pages ``` ## Troubleshooting ### Issue: "Unable to get page count. Is poppler installed and in PATH?" **Solution**: Set the `POPPLER_PATH` environment variable as described above. ### Issue: OCR not working **Checklist**: 1. Verify `USE_OCR=1` is set 2. Verify Tesseract is installed: `.venv\Scripts\python.exe -c "import pytesseract; print(pytesseract.get_tesseract_version())"` 3. Verify Poppler path is correct: Check that `pdftoppm.exe` exists in the specified path 4. Run `python check_ocr.py` to diagnose issues ### Issue: OCR is slow This is normal. OCR processing is CPU-intensive: - Small documents (2-10 pages): 10-30 seconds - Medium documents (50-100 pages): 2-5 minutes - Large documents (200+ pages): 5-15 minutes ## Additional Resources - Poppler documentation: https://poppler.freedesktop.org/ - Tesseract OCR: https://github.com/tesseract-ocr/tesseract - pdf2image: https://github.com/Belval/pdf2image ## Changes Made to Code ### server.py **Added** (lines 285-292): ```python self.poppler_path = os.getenv('POPPLER_PATH', None) if self.use_ocr: # ... existing code ... if self.poppler_path: self.logger.info(f"Using Poppler from: {self.poppler_path}") ``` **Modified** `_extract_pdf_with_ocr()` method (lines 850-853): ```python if self.poppler_path: images = convert_from_path(filepath, poppler_path=self.poppler_path) else: images = convert_from_path(filepath) ``` ## Status ✅ **Poppler installed and configured** ✅ **server.py updated to support POPPLER_PATH** ✅ **OCR tested and working with scanned PDFs** ✅ **Ready for production use** --- Last Updated: 2025-12-12

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MichaelTroelsen/tdz-c64-knowledge'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

POPPLER_SETUP.md•5.01 KiB