OCR-MCP

OCR-MCP_MASTER_PLAN.md•19.9 KiB

# OCR-MCP: Revolutionary Document Understanding Server ## 🎯 Executive Summary **OCR-MCP** is a revolutionary **dual-repository OCR ecosystem** providing comprehensive document digitization and understanding capabilities: ### **🖥️ OCR-MCP Server (This Repo)** - **FastMCP 2.13+ Backend**: Specialized server with scanner control and OCR processing - **Hardware Integration**: Direct flatbed scanner control via WIA/TWAIN - **Multiple OCR Backends**: GOT-OCR2.0, Tesseract, EasyOCR, PaddleOCR, TrOCR - **Advanced Processing**: Plain text, formatted text, layout preservation, HTML rendering - **MCP Ecosystem**: Serve any MCP client with OCR capabilities ### **🌐 OCR-MCP Frontend (Future Repo)** - **Web Application**: Modern React-based interface for scanner control and OCR workflow - **Windows 64-bit App**: Native desktop application with full scanner integration - **Batch Processing**: Multi-document scanning and processing pipelines - **Results Management**: Organized storage and export of OCR results ### **🔧 Key Innovations** - **Scanner Control**: Direct hardware integration with flatbed scanners - **One-Touch Digitization**: Scan → OCR → Export in single workflow - **Multi-Format Output**: Text, HTML, Markdown, JSON, XML, PDF - **Batch Operations**: Process multiple documents with progress tracking - **Open-Source Democratization**: Bring professional OCR to everyone ## 🏗️ Architecture Overview ### Core Design Principles - **Modular Backend System**: Pluggable OCR engines with unified interface - **Specialized Tools**: Dedicated tools for different OCR use cases - **Resource Management**: Smart model loading and GPU memory management - **Error Resilience**: Graceful fallbacks and comprehensive error handling - **Performance Optimization**: Batch processing, caching, and async operations ### Tool Architecture #### **Scanner Control Tools** 1. **`list_scanners`** - Discover and enumerate available scanners 2. **`scanner_properties`** - Get scanner capabilities and settings 3. **`configure_scan`** - Configure scan parameters (DPI, color, size) 4. **`scan_document`** - Perform single document scan 5. **`scan_batch`** - Batch scanning with automatic processing 6. **`preview_scan`** - Preview scan for positioning and cropping #### **Primary OCR Tools (Portmanteau Pattern)** 7. **`process_document`** - Main OCR processing with multiple backends 8. **`process_batch`** - Batch document processing 9. **`analyze_layout`** - Document layout and structure analysis 10. **`extract_regions`** - Fine-grained region-based OCR #### **Specialized OCR Tools** 11. **`got_ocr_processor`** - GOT-OCR2.0 specific operations 12. **`tesseract_processor`** - Tesseract OCR operations 13. **`paddle_processor`** - PaddleOCR operations 14. **`easyocr_processor`** - EasyOCR operations #### **Utility Tools** 15. **`ocr_health_check`** - Backend availability and diagnostics 16. **`model_manager`** - Model loading and management 17. **`format_converter`** - OCR result format conversion 18. **`scan_to_ocr_pipeline`** - Complete scan-to-OCR workflow ### Backend Architecture ``` OCR-MCP Server ├── Core Engine │ ├── Backend Manager (GOT-OCR, Tesseract, Paddle, EasyOCR) │ ├── Model Manager (GPU memory, loading/unloading) │ └── Resource Pool (connection pooling, rate limiting) │ ├── Processing Pipeline │ ├── Preprocessing (image enhancement, rotation correction) │ ├── OCR Processing (backend routing, fallback logic) │ └── Postprocessing (format conversion, validation) │ ├── Specialized Tools │ ├── Document Analysis (layout, structure, entities) │ ├── Batch Processing (queue management, progress tracking) │ └── Format Conversion (HTML, Markdown, JSON, XML) │ └── Integration Layer ├── MCP Protocol Handler (FastMCP 2.13+) ├── Result Formatting (structured data, error handling) └── Client Adaptation (different MCP clients) ``` ## 📷 Scanner Control Architecture ### **Windows Image Acquisition (WIA) Integration** **Primary Scanner API for Windows:** - **Native Windows Support**: Built-in to Windows XP and later - **Device Discovery**: Automatic detection of connected scanners - **Parameter Control**: DPI, color depth, paper size, brightness/contrast - **Preview Scanning**: Live preview for document positioning - **Batch Scanning**: ADF (Automatic Document Feeder) support **Implementation Approach:** ```python # WIA Scanner Manager class WIAScannerManager: def __init__(self): self._wia_manager = None self._devices = {} def discover_scanners(self) -> List[ScannerInfo]: """Discover all connected scanners via WIA""" def get_scanner_properties(self, device_id: str) -> ScannerProperties: """Get scanner capabilities and current settings""" def configure_scan(self, device_id: str, settings: ScanSettings) -> bool: """Configure scanner parameters""" def scan_document(self, device_id: str, settings: ScanSettings) -> PIL.Image: """Perform document scan and return image""" async def scan_batch(self, device_id: str, count: int) -> List[PIL.Image]: """Batch scan multiple documents""" ``` ### **TWAIN Support (Cross-Platform Compatibility)** **Fallback and Advanced Features:** - **Industry Standard**: Supported by 99% of scanner manufacturers - **Cross-Platform**: Windows, macOS, Linux compatibility - **Advanced Features**: Custom driver features, specialized scanners - **Vendor Extensions**: Manufacturer-specific capabilities ### **Scanner Hardware Support Matrix** | Scanner Type | WIA Support | TWAIN Support | ADF Support | Duplex Support | |-------------|-------------|----------------|-------------|----------------| | **Flatbed** | ✅ Native | ✅ Full | ❌ | ❌ | | **ADF Scanner** | ✅ Native | ✅ Full | ✅ | ❌ | | **Duplex Scanner** | ✅ Native | ✅ Full | ✅ | ✅ | | **Network Scanner** | ⚠️ Limited | ✅ Full | ✅ | ✅ | | **Camera/Photo** | ✅ Native | ⚠️ Limited | ❌ | ❌ | ### **Scan Parameter Configuration** **Resolution Settings:** - **Draft Quality**: 150-200 DPI (fast scanning) - **Standard Quality**: 300 DPI (document scanning) - **High Quality**: 600+ DPI (archival, photos) **Color Modes:** - **Black & White**: 1-bit (text documents) - **Grayscale**: 8-bit (mixed content) - **Color**: 24-bit RGB (photos, color documents) **Paper Size Presets:** - **Letter**: 8.5" × 11" (216mm × 279mm) - **A4**: 210mm × 297mm - **Legal**: 8.5" × 14" (216mm × 356mm) - **Custom**: User-defined dimensions ## 🎨 Key Features ### **Multi-Backend OCR Processing** - **GOT-OCR2.0**: Revolutionary vision-language model for formatted OCR - **Tesseract**: Proven open-source OCR with extensive language support - **PaddleOCR**: Industrial-grade OCR with high accuracy - **EasyOCR**: User-friendly OCR with 80+ language support - **TrOCR**: Microsoft transformer-based OCR ### **Advanced Processing Modes** - **Plain Text OCR**: Basic text extraction - **Formatted OCR**: Preserve layout, tables, columns - **Fine-grained OCR**: Extract specific regions by coordinates - **Multi-crop OCR**: Process large images by intelligent splitting - **Batch Processing**: Process multiple documents efficiently ### **Output Formats** - **Text**: Plain text extraction - **HTML**: Formatted HTML with styling and structure - **Markdown**: Structured markdown with tables and formatting - **JSON**: Structured data with coordinates and confidence scores - **XML**: Industry-standard document markup ### **Specialized Capabilities** - **Document Layout Analysis**: Detect tables, columns, headers - **Language Detection**: Automatic language identification - **Text Orientation**: Auto-rotate and correct orientation - **Quality Assessment**: OCR confidence scoring and validation - **Progress Tracking**: Real-time progress for long operations ## 🛠️ Implementation Plan ### **Phase 1: Core Infrastructure (Week 1-2)** ✅ **COMPLETED** - [x] Project structure setup (FastMCP 2.13+, dependencies) - [x] Basic server skeleton with MCP protocol - [x] Backend manager framework - [x] Model management system - [x] Basic error handling and logging ### **Phase 2: Scanner Control & Image Acquisition (Week 3-4)** ✅ **COMPLETED** - [x] WIA (Windows Image Acquisition) backend implementation - [x] Scanner device discovery and enumeration - [x] Scan parameter configuration (DPI, color mode, paper size) - [x] Batch scanning capabilities - [ ] TWAIN scanner support for broader compatibility - [ ] Image preprocessing pipeline (deskew, enhance, crop) ### **Phase 3: GOT-OCR2.0 Integration (Week 5)** - [ ] GOT-OCR2.0 backend implementation - [ ] Model loading and caching - [ ] Basic OCR processing pipeline - [ ] Plain text and formatted OCR modes - [ ] HTML rendering capability ### **Phase 3: Multi-Backend Support (Week 4)** - [ ] Tesseract backend integration - [ ] EasyOCR backend integration - [ ] Backend auto-selection logic - [ ] Fallback mechanisms - [ ] Performance benchmarking ### **Phase 4: Advanced Features (Week 5)** - [ ] Fine-grained region OCR - [ ] Multi-crop processing - [ ] Batch processing system - [ ] Progress tracking and cancellation - [ ] Quality assessment ### **Phase 5: Specialized Tools (Week 6)** - [ ] Document layout analysis - [ ] Format conversion tools - [ ] Model management tools - [ ] Health check and diagnostics - [ ] Performance monitoring ### **Phase 6: Ecosystem Integration (Week 7)** - [ ] MCPB packaging for easy installation - [ ] Documentation and examples - [ ] Integration guides for different MCP clients - [ ] Docker containerization - [ ] CI/CD pipeline setup ## 📋 Technical Specifications ### **Dependencies** ```toml [tool.poetry.dependencies] python = "^3.11" fastmcp = {extras = ["server"], version = "^2.13.0"} transformers = "^4.35.0" torch = "^2.1.0" pillow = "^10.0.0" accelerate = "^0.24.0" pytesseract = "^0.3.10" easyocr = "^1.7.0" paddlepaddle = "^2.5.0" paddleocr = "^2.7.0" ``` ### **Model Requirements** - **GOT-OCR2.0**: ~7GB VRAM for base model, ~14GB for large - **Tesseract**: Minimal resources, CPU-only - **EasyOCR**: ~2GB VRAM, supports CPU fallback - **PaddleOCR**: ~4GB VRAM, GPU acceleration recommended ### **API Design** #### **Main OCR Tool** ```python @mcp.tool() async def process_document( image_path: str, backend: str = "auto", # auto, got-ocr, tesseract, easyocr, paddle mode: str = "text", # text, format, fine-grained, layout output_format: str = "text", # text, html, markdown, json, xml region: Optional[List[int]] = None, # [x1,y1,x2,y2] for fine-grained language: Optional[str] = None, enhance_image: bool = True ) -> Dict[str, Any]: """Process document with specified OCR backend and mode.""" ``` #### **Batch Processing Tool** ```python @mcp.tool() async def process_batch( image_paths: List[str], backend: str = "auto", mode: str = "text", output_format: str = "json", max_concurrent: int = 4, progress_callback: Optional[str] = None ) -> Dict[str, Any]: """Process multiple documents in batch with progress tracking.""" ``` ## 🎯 Use Cases & Integration Points ### **MCP Ecosystem Integration** - **CalibreMCP**: Enhanced OCR backend (already integrated) - **Document MCP**: General document processing - **Research MCP**: Academic paper OCR and analysis - **Image MCP**: Image-to-text conversion - **Note-taking MCP**: Scanned note digitization ### **Real-World Applications** - **Digital Library Management**: OCR scanned books and documents - **Document Digitization**: Convert paper documents to searchable text - **Research Paper Processing**: Extract text from academic PDFs - **Receipt/Invoice Processing**: Automated data extraction - **Accessibility Tools**: Convert images to readable text for screen readers ### **Performance Targets** - **Single Image**: < 5 seconds for GOT-OCR2.0, < 2 seconds for Tesseract - **Batch Processing**: < 30 seconds for 10 images (concurrent processing) - **Memory Usage**: < 8GB VRAM for typical operations - **Accuracy**: > 95% for clean documents, > 85% for complex layouts ## 🏗️ Dual-Repository Architecture ### **Repository Structure** ``` ocr-mcp-ecosystem/ ├── ocr-mcp-server/ # This repo - FastMCP backend │ ├── src/ocr_mcp/ │ │ ├── core/ # Backend management, config │ │ ├── backends/ # OCR engines (GOT-OCR, Tesseract, etc.) │ │ ├── scanners/ # Scanner control (WIA, TWAIN) │ │ └── tools/ # MCP tool definitions │ └── pyproject.toml │ └── ocr-mcp-frontend/ # Future repo - User interface ├── webapp/ # React-based web application ├── desktop/ # Electron/Tauri Windows app ├── mobile/ # React Native iOS/Android app └── shared/ # Shared components and utilities ``` ### **OCR-MCP Server (This Repository)** **Role**: FastMCP backend providing scanner control and OCR processing - **Scanner Hardware Integration**: Direct WIA/TWAIN control of flatbed scanners - **Multi-OCR Backend Support**: GOT-OCR2.0, Tesseract, EasyOCR, PaddleOCR - **Image Processing Pipeline**: Acquisition → Preprocessing → OCR → Output - **MCP Ecosystem Integration**: Serve any MCP client with OCR capabilities ### **OCR-MCP Frontend (Future Repository)** **Role**: User interfaces for scanner control and OCR workflow management - **Web Application**: Modern React interface for browser-based scanning - **Windows Desktop App**: Native Windows application with full scanner integration - **Mobile Apps**: iOS/Android apps for mobile scanning workflows - **Shared Components**: Reusable UI components and utilities ## 🔧 Development Roadmap ### **Week 1-2: Foundation** - [ ] Initialize repository with FastMCP 2.13+ - [ ] Set up project structure and dependencies - [ ] Implement basic MCP server with health check - [ ] Create backend manager framework - [ ] Add comprehensive logging and error handling ### **Week 3: GOT-OCR2.0 Core** - [ ] Implement GOTOCRProcessor class - [ ] Add model loading and caching - [ ] Create basic OCR processing pipeline - [ ] Support plain text and formatted OCR modes - [ ] Add HTML rendering for formatted results ### **Week 4: Multi-Backend Expansion** - [ ] Integrate Tesseract backend - [ ] Add EasyOCR support - [ ] Implement auto-selection logic - [ ] Add backend fallback mechanisms - [ ] Performance testing and optimization ### **Week 5: Advanced Capabilities** - [ ] Fine-grained region OCR implementation - [ ] Multi-crop processing for large images - [ ] Batch processing with progress tracking - [ ] Quality assessment and confidence scoring - [ ] Document layout analysis ### **Week 6: Tool Ecosystem** - [ ] Specialized OCR tools for different backends - [ ] Format conversion utilities - [ ] Model management and monitoring tools - [ ] Health check and diagnostic tools - [ ] Performance benchmarking suite ### **Week 7: Production Ready** - [ ] Comprehensive documentation - [ ] MCPB packaging for distribution - [ ] Docker containerization - [ ] CI/CD pipeline with automated testing - [ ] Integration examples for different MCP clients ## 📊 Success Metrics ### **Technical Metrics** - **Tool Count**: 11 specialized OCR tools - **Backend Support**: 5 OCR engines (GOT-OCR2.0, Tesseract, EasyOCR, PaddleOCR, TrOCR) - **Processing Modes**: 4 modes (text, format, fine-grained, layout) - **Output Formats**: 5 formats (text, HTML, Markdown, JSON, XML) - **Performance**: < 5 seconds average processing time ### **Adoption Metrics** - **GitHub Stars**: Target 500+ stars within 6 months - **Downloads**: 10,000+ MCPB package downloads - **Ecosystem Integration**: 5+ MCP servers using OCR-MCP - **Community Contributions**: 10+ community backend integrations ### **Quality Metrics** - **Test Coverage**: > 90% code coverage - **Documentation**: 100% tool documentation compliance - **Error Rate**: < 1% processing failures - **User Satisfaction**: > 4.5/5 average rating ## 🎉 Vision & Impact **OCR-MCP represents the future of document understanding in the MCP ecosystem:** - **Democratization**: Bring state-of-the-art OCR to every MCP client - **Innovation**: Enable new document processing workflows and applications - **Integration**: Seamlessly connect with existing MCP servers and tools - **Performance**: Optimized for both speed and accuracy - **Open Source**: Community-driven development and improvement **By creating a dedicated OCR-MCP server, we transform document understanding from a niche capability into a fundamental building block of the MCP ecosystem, enabling powerful new workflows across research, content management, accessibility, and automation domains.** --- ## 📷 Scanner Control Implementation ### **WIA Backend Architecture** ```python # Core WIA Implementation class WIABackend: def __init__(self): self._wia_manager = None self._devices = {} def discover_scanners(self) -> List[ScannerInfo]: """Enumerate all WIA-compatible scanners""" def configure_scan(self, device_id: str, settings: ScanSettings): """Set DPI, color mode, paper size, brightness/contrast""" def scan_document(self, device_id: str) -> PIL.Image: """Acquire image from scanner and return PIL Image""" async def scan_batch(self, device_id: str, count: int) -> List[PIL.Image]: """Batch scan multiple documents with ADF support""" ``` ### **Scanner Tool Integration** - **`list_scanners`**: Discover and enumerate connected scanners - **`scanner_properties`**: Get detailed scanner capabilities - **`configure_scan`**: Set scan parameters (DPI, color, size) - **`scan_document`**: Single document scanning - **`scan_batch`**: Multi-document batch scanning - **`preview_scan`**: Low-resolution preview for positioning ### **Supported Scanner Types** - **Flatbed Scanners**: Epson, Canon, HP, Brother, etc. - **ADF Scanners**: Automatic Document Feeders for batch scanning - **Duplex Scanners**: Two-sided scanning capability - **Network Scanners**: IP-connected multifunction devices ### **Scan Workflow Integration** ``` User Request → MCP Tool → Scanner Manager → WIA Backend → Hardware → Image Processing → OCR → Results ``` --- ## 📊 Current Implementation Status ### **✅ Completed Features** - **OCR Backends**: GOT-OCR2.0, Tesseract (EasyOCR pending installation) - **Scanner Control**: Complete WIA implementation for Windows flatbed scanners - **Tool Ecosystem**: 9 MCP tools registered and functional - **Architecture**: Modular backend system with unified interfaces ### **🔧 Tool Count Breakdown** - **OCR Tools (2)**: `process_document`, `ocr_health_check` - **Scanner Tools (6)**: `list_scanners`, `scanner_properties`, `configure_scan`, `scan_document`, `scan_batch`, `preview_scan` - **Utility Tools (1)**: `list_backends` - **Total**: 9 tools registered and ready for use ### **🎯 Next Development Phases** **Phase 3: Enhanced OCR & Document Processing (Week 5-6)** ✅ **COMPLETED** - [x] Document processor backend for PDFs, CBZ, CBR, images - [x] Multi-format input handling (scanner, PDF, CBZ, JPG, etc.) - [x] Comic/manga processing modes with GOT-OCR2.0 optimization - [x] Advanced manga features (scaffold separation, panel analysis) - [x] Batch document processing with concurrency control - [x] File type auto-detection and routing - [ ] Complete GOT-OCR2.0 model integration with actual inference - [ ] Image preprocessing pipeline (deskew, enhance, crop) - [ ] TWAIN scanner support for broader compatibility **Phase 4: Production Readiness (Week 7-8)** - [ ] Comprehensive testing and error handling - [ ] Documentation and usage examples - [ ] Performance benchmarking - [ ] Docker containerization - [ ] MCPB packaging for distribution --- **Status**: Phase 2 Complete - OCR-MCP server functional with scanner control **Timeline**: 6-8 weeks to production-ready dual-repo ecosystem **Priority**: High - Unique scanner + OCR combination creates market differentiation **Next Step**: Begin Phase 3: Complete GOT-OCR2.0 integration and image preprocessing

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sandraschi/ocr-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

OCR-MCP_MASTER_PLAN.md•19.9 KiB