Skip to main content
Glama
OCR-MCP_MASTER_PLAN.mdβ€’20.4 kB
# OCR-MCP: Revolutionary Document Understanding Server ## 🎯 Executive Summary **OCR-MCP** is a revolutionary **dual-repository OCR ecosystem** providing comprehensive document digitization and understanding capabilities: ### **πŸ–₯️ OCR-MCP Server (This Repo)** - **FastMCP 2.13+ Backend**: Specialized server with scanner control and OCR processing - **Hardware Integration**: Direct flatbed scanner control via WIA/TWAIN - **Multiple OCR Backends**: GOT-OCR2.0, Tesseract, EasyOCR, PaddleOCR, TrOCR - **Advanced Processing**: Plain text, formatted text, layout preservation, HTML rendering - **MCP Ecosystem**: Serve any MCP client with OCR capabilities ### **🌐 OCR-MCP Frontend (Future Repo)** - **Web Application**: Modern React-based interface for scanner control and OCR workflow - **Windows 64-bit App**: Native desktop application with full scanner integration - **Batch Processing**: Multi-document scanning and processing pipelines - **Results Management**: Organized storage and export of OCR results ### **πŸ”§ Key Innovations** - **Scanner Control**: Direct hardware integration with flatbed scanners - **One-Touch Digitization**: Scan β†’ OCR β†’ Export in single workflow - **Multi-Format Output**: Text, HTML, Markdown, JSON, XML, PDF - **Batch Operations**: Process multiple documents with progress tracking - **Open-Source Democratization**: Bring professional OCR to everyone ## πŸ—οΈ Architecture Overview ### Core Design Principles - **Modular Backend System**: Pluggable OCR engines with unified interface - **Specialized Tools**: Dedicated tools for different OCR use cases - **Resource Management**: Smart model loading and GPU memory management - **Error Resilience**: Graceful fallbacks and comprehensive error handling - **Performance Optimization**: Batch processing, caching, and async operations ### Tool Architecture #### **Scanner Control Tools** 1. **`list_scanners`** - Discover and enumerate available scanners 2. **`scanner_properties`** - Get scanner capabilities and settings 3. **`configure_scan`** - Configure scan parameters (DPI, color, size) 4. **`scan_document`** - Perform single document scan 5. **`scan_batch`** - Batch scanning with automatic processing 6. **`preview_scan`** - Preview scan for positioning and cropping #### **Primary OCR Tools (Portmanteau Pattern)** 7. **`process_document`** - Main OCR processing with multiple backends 8. **`process_batch`** - Batch document processing 9. **`analyze_layout`** - Document layout and structure analysis 10. **`extract_regions`** - Fine-grained region-based OCR #### **Specialized OCR Tools** 11. **`got_ocr_processor`** - GOT-OCR2.0 specific operations 12. **`tesseract_processor`** - Tesseract OCR operations 13. **`paddle_processor`** - PaddleOCR operations 14. **`easyocr_processor`** - EasyOCR operations #### **Utility Tools** 15. **`ocr_health_check`** - Backend availability and diagnostics 16. **`model_manager`** - Model loading and management 17. **`format_converter`** - OCR result format conversion 18. **`scan_to_ocr_pipeline`** - Complete scan-to-OCR workflow ### Backend Architecture ``` OCR-MCP Server β”œβ”€β”€ Core Engine β”‚ β”œβ”€β”€ Backend Manager (GOT-OCR, Tesseract, Paddle, EasyOCR) β”‚ β”œβ”€β”€ Model Manager (GPU memory, loading/unloading) β”‚ └── Resource Pool (connection pooling, rate limiting) β”‚ β”œβ”€β”€ Processing Pipeline β”‚ β”œβ”€β”€ Preprocessing (image enhancement, rotation correction) β”‚ β”œβ”€β”€ OCR Processing (backend routing, fallback logic) β”‚ └── Postprocessing (format conversion, validation) β”‚ β”œβ”€β”€ Specialized Tools β”‚ β”œβ”€β”€ Document Analysis (layout, structure, entities) β”‚ β”œβ”€β”€ Batch Processing (queue management, progress tracking) β”‚ └── Format Conversion (HTML, Markdown, JSON, XML) β”‚ └── Integration Layer β”œβ”€β”€ MCP Protocol Handler (FastMCP 2.13+) β”œβ”€β”€ Result Formatting (structured data, error handling) └── Client Adaptation (different MCP clients) ``` ## πŸ“· Scanner Control Architecture ### **Windows Image Acquisition (WIA) Integration** **Primary Scanner API for Windows:** - **Native Windows Support**: Built-in to Windows XP and later - **Device Discovery**: Automatic detection of connected scanners - **Parameter Control**: DPI, color depth, paper size, brightness/contrast - **Preview Scanning**: Live preview for document positioning - **Batch Scanning**: ADF (Automatic Document Feeder) support **Implementation Approach:** ```python # WIA Scanner Manager class WIAScannerManager: def __init__(self): self._wia_manager = None self._devices = {} def discover_scanners(self) -> List[ScannerInfo]: """Discover all connected scanners via WIA""" def get_scanner_properties(self, device_id: str) -> ScannerProperties: """Get scanner capabilities and current settings""" def configure_scan(self, device_id: str, settings: ScanSettings) -> bool: """Configure scanner parameters""" def scan_document(self, device_id: str, settings: ScanSettings) -> PIL.Image: """Perform document scan and return image""" async def scan_batch(self, device_id: str, count: int) -> List[PIL.Image]: """Batch scan multiple documents""" ``` ### **TWAIN Support (Cross-Platform Compatibility)** **Fallback and Advanced Features:** - **Industry Standard**: Supported by 99% of scanner manufacturers - **Cross-Platform**: Windows, macOS, Linux compatibility - **Advanced Features**: Custom driver features, specialized scanners - **Vendor Extensions**: Manufacturer-specific capabilities ### **Scanner Hardware Support Matrix** | Scanner Type | WIA Support | TWAIN Support | ADF Support | Duplex Support | |-------------|-------------|----------------|-------------|----------------| | **Flatbed** | βœ… Native | βœ… Full | ❌ | ❌ | | **ADF Scanner** | βœ… Native | βœ… Full | βœ… | ❌ | | **Duplex Scanner** | βœ… Native | βœ… Full | βœ… | βœ… | | **Network Scanner** | ⚠️ Limited | βœ… Full | βœ… | βœ… | | **Camera/Photo** | βœ… Native | ⚠️ Limited | ❌ | ❌ | ### **Scan Parameter Configuration** **Resolution Settings:** - **Draft Quality**: 150-200 DPI (fast scanning) - **Standard Quality**: 300 DPI (document scanning) - **High Quality**: 600+ DPI (archival, photos) **Color Modes:** - **Black & White**: 1-bit (text documents) - **Grayscale**: 8-bit (mixed content) - **Color**: 24-bit RGB (photos, color documents) **Paper Size Presets:** - **Letter**: 8.5" Γ— 11" (216mm Γ— 279mm) - **A4**: 210mm Γ— 297mm - **Legal**: 8.5" Γ— 14" (216mm Γ— 356mm) - **Custom**: User-defined dimensions ## 🎨 Key Features ### **Multi-Backend OCR Processing** - **GOT-OCR2.0**: Revolutionary vision-language model for formatted OCR - **Tesseract**: Proven open-source OCR with extensive language support - **PaddleOCR**: Industrial-grade OCR with high accuracy - **EasyOCR**: User-friendly OCR with 80+ language support - **TrOCR**: Microsoft transformer-based OCR ### **Advanced Processing Modes** - **Plain Text OCR**: Basic text extraction - **Formatted OCR**: Preserve layout, tables, columns - **Fine-grained OCR**: Extract specific regions by coordinates - **Multi-crop OCR**: Process large images by intelligent splitting - **Batch Processing**: Process multiple documents efficiently ### **Output Formats** - **Text**: Plain text extraction - **HTML**: Formatted HTML with styling and structure - **Markdown**: Structured markdown with tables and formatting - **JSON**: Structured data with coordinates and confidence scores - **XML**: Industry-standard document markup ### **Specialized Capabilities** - **Document Layout Analysis**: Detect tables, columns, headers - **Language Detection**: Automatic language identification - **Text Orientation**: Auto-rotate and correct orientation - **Quality Assessment**: OCR confidence scoring and validation - **Progress Tracking**: Real-time progress for long operations ## πŸ› οΈ Implementation Plan ### **Phase 1: Core Infrastructure (Week 1-2)** βœ… **COMPLETED** - [x] Project structure setup (FastMCP 2.13+, dependencies) - [x] Basic server skeleton with MCP protocol - [x] Backend manager framework - [x] Model management system - [x] Basic error handling and logging ### **Phase 2: Scanner Control & Image Acquisition (Week 3-4)** βœ… **COMPLETED** - [x] WIA (Windows Image Acquisition) backend implementation - [x] Scanner device discovery and enumeration - [x] Scan parameter configuration (DPI, color mode, paper size) - [x] Batch scanning capabilities - [ ] TWAIN scanner support for broader compatibility - [ ] Image preprocessing pipeline (deskew, enhance, crop) ### **Phase 3: GOT-OCR2.0 Integration (Week 5)** - [ ] GOT-OCR2.0 backend implementation - [ ] Model loading and caching - [ ] Basic OCR processing pipeline - [ ] Plain text and formatted OCR modes - [ ] HTML rendering capability ### **Phase 3: Multi-Backend Support (Week 4)** - [ ] Tesseract backend integration - [ ] EasyOCR backend integration - [ ] Backend auto-selection logic - [ ] Fallback mechanisms - [ ] Performance benchmarking ### **Phase 4: Advanced Features (Week 5)** - [ ] Fine-grained region OCR - [ ] Multi-crop processing - [ ] Batch processing system - [ ] Progress tracking and cancellation - [ ] Quality assessment ### **Phase 5: Specialized Tools (Week 6)** - [ ] Document layout analysis - [ ] Format conversion tools - [ ] Model management tools - [ ] Health check and diagnostics - [ ] Performance monitoring ### **Phase 6: Ecosystem Integration (Week 7)** - [ ] MCPB packaging for easy installation - [ ] Documentation and examples - [ ] Integration guides for different MCP clients - [ ] Docker containerization - [ ] CI/CD pipeline setup ## πŸ“‹ Technical Specifications ### **Dependencies** ```toml [tool.poetry.dependencies] python = "^3.11" fastmcp = {extras = ["server"], version = "^2.13.0"} transformers = "^4.35.0" torch = "^2.1.0" pillow = "^10.0.0" accelerate = "^0.24.0" pytesseract = "^0.3.10" easyocr = "^1.7.0" paddlepaddle = "^2.5.0" paddleocr = "^2.7.0" ``` ### **Model Requirements** - **GOT-OCR2.0**: ~7GB VRAM for base model, ~14GB for large - **Tesseract**: Minimal resources, CPU-only - **EasyOCR**: ~2GB VRAM, supports CPU fallback - **PaddleOCR**: ~4GB VRAM, GPU acceleration recommended ### **API Design** #### **Main OCR Tool** ```python @mcp.tool() async def process_document( image_path: str, backend: str = "auto", # auto, got-ocr, tesseract, easyocr, paddle mode: str = "text", # text, format, fine-grained, layout output_format: str = "text", # text, html, markdown, json, xml region: Optional[List[int]] = None, # [x1,y1,x2,y2] for fine-grained language: Optional[str] = None, enhance_image: bool = True ) -> Dict[str, Any]: """Process document with specified OCR backend and mode.""" ``` #### **Batch Processing Tool** ```python @mcp.tool() async def process_batch( image_paths: List[str], backend: str = "auto", mode: str = "text", output_format: str = "json", max_concurrent: int = 4, progress_callback: Optional[str] = None ) -> Dict[str, Any]: """Process multiple documents in batch with progress tracking.""" ``` ## 🎯 Use Cases & Integration Points ### **MCP Ecosystem Integration** - **CalibreMCP**: Enhanced OCR backend (already integrated) - **Document MCP**: General document processing - **Research MCP**: Academic paper OCR and analysis - **Image MCP**: Image-to-text conversion - **Note-taking MCP**: Scanned note digitization ### **Real-World Applications** - **Digital Library Management**: OCR scanned books and documents - **Document Digitization**: Convert paper documents to searchable text - **Research Paper Processing**: Extract text from academic PDFs - **Receipt/Invoice Processing**: Automated data extraction - **Accessibility Tools**: Convert images to readable text for screen readers ### **Performance Targets** - **Single Image**: < 5 seconds for GOT-OCR2.0, < 2 seconds for Tesseract - **Batch Processing**: < 30 seconds for 10 images (concurrent processing) - **Memory Usage**: < 8GB VRAM for typical operations - **Accuracy**: > 95% for clean documents, > 85% for complex layouts ## πŸ—οΈ Dual-Repository Architecture ### **Repository Structure** ``` ocr-mcp-ecosystem/ β”œβ”€β”€ ocr-mcp-server/ # This repo - FastMCP backend β”‚ β”œβ”€β”€ src/ocr_mcp/ β”‚ β”‚ β”œβ”€β”€ core/ # Backend management, config β”‚ β”‚ β”œβ”€β”€ backends/ # OCR engines (GOT-OCR, Tesseract, etc.) β”‚ β”‚ β”œβ”€β”€ scanners/ # Scanner control (WIA, TWAIN) β”‚ β”‚ └── tools/ # MCP tool definitions β”‚ └── pyproject.toml β”‚ └── ocr-mcp-frontend/ # Future repo - User interface β”œβ”€β”€ webapp/ # React-based web application β”œβ”€β”€ desktop/ # Electron/Tauri Windows app β”œβ”€β”€ mobile/ # React Native iOS/Android app └── shared/ # Shared components and utilities ``` ### **OCR-MCP Server (This Repository)** **Role**: FastMCP backend providing scanner control and OCR processing - **Scanner Hardware Integration**: Direct WIA/TWAIN control of flatbed scanners - **Multi-OCR Backend Support**: GOT-OCR2.0, Tesseract, EasyOCR, PaddleOCR - **Image Processing Pipeline**: Acquisition β†’ Preprocessing β†’ OCR β†’ Output - **MCP Ecosystem Integration**: Serve any MCP client with OCR capabilities ### **OCR-MCP Frontend (Future Repository)** **Role**: User interfaces for scanner control and OCR workflow management - **Web Application**: Modern React interface for browser-based scanning - **Windows Desktop App**: Native Windows application with full scanner integration - **Mobile Apps**: iOS/Android apps for mobile scanning workflows - **Shared Components**: Reusable UI components and utilities ## πŸ”§ Development Roadmap ### **Week 1-2: Foundation** - [ ] Initialize repository with FastMCP 2.13+ - [ ] Set up project structure and dependencies - [ ] Implement basic MCP server with health check - [ ] Create backend manager framework - [ ] Add comprehensive logging and error handling ### **Week 3: GOT-OCR2.0 Core** - [ ] Implement GOTOCRProcessor class - [ ] Add model loading and caching - [ ] Create basic OCR processing pipeline - [ ] Support plain text and formatted OCR modes - [ ] Add HTML rendering for formatted results ### **Week 4: Multi-Backend Expansion** - [ ] Integrate Tesseract backend - [ ] Add EasyOCR support - [ ] Implement auto-selection logic - [ ] Add backend fallback mechanisms - [ ] Performance testing and optimization ### **Week 5: Advanced Capabilities** - [ ] Fine-grained region OCR implementation - [ ] Multi-crop processing for large images - [ ] Batch processing with progress tracking - [ ] Quality assessment and confidence scoring - [ ] Document layout analysis ### **Week 6: Tool Ecosystem** - [ ] Specialized OCR tools for different backends - [ ] Format conversion utilities - [ ] Model management and monitoring tools - [ ] Health check and diagnostic tools - [ ] Performance benchmarking suite ### **Week 7: Production Ready** - [ ] Comprehensive documentation - [ ] MCPB packaging for distribution - [ ] Docker containerization - [ ] CI/CD pipeline with automated testing - [ ] Integration examples for different MCP clients ## πŸ“Š Success Metrics ### **Technical Metrics** - **Tool Count**: 11 specialized OCR tools - **Backend Support**: 5 OCR engines (GOT-OCR2.0, Tesseract, EasyOCR, PaddleOCR, TrOCR) - **Processing Modes**: 4 modes (text, format, fine-grained, layout) - **Output Formats**: 5 formats (text, HTML, Markdown, JSON, XML) - **Performance**: < 5 seconds average processing time ### **Adoption Metrics** - **GitHub Stars**: Target 500+ stars within 6 months - **Downloads**: 10,000+ MCPB package downloads - **Ecosystem Integration**: 5+ MCP servers using OCR-MCP - **Community Contributions**: 10+ community backend integrations ### **Quality Metrics** - **Test Coverage**: > 90% code coverage - **Documentation**: 100% tool documentation compliance - **Error Rate**: < 1% processing failures - **User Satisfaction**: > 4.5/5 average rating ## πŸŽ‰ Vision & Impact **OCR-MCP represents the future of document understanding in the MCP ecosystem:** - **Democratization**: Bring state-of-the-art OCR to every MCP client - **Innovation**: Enable new document processing workflows and applications - **Integration**: Seamlessly connect with existing MCP servers and tools - **Performance**: Optimized for both speed and accuracy - **Open Source**: Community-driven development and improvement **By creating a dedicated OCR-MCP server, we transform document understanding from a niche capability into a fundamental building block of the MCP ecosystem, enabling powerful new workflows across research, content management, accessibility, and automation domains.** --- ## πŸ“· Scanner Control Implementation ### **WIA Backend Architecture** ```python # Core WIA Implementation class WIABackend: def __init__(self): self._wia_manager = None self._devices = {} def discover_scanners(self) -> List[ScannerInfo]: """Enumerate all WIA-compatible scanners""" def configure_scan(self, device_id: str, settings: ScanSettings): """Set DPI, color mode, paper size, brightness/contrast""" def scan_document(self, device_id: str) -> PIL.Image: """Acquire image from scanner and return PIL Image""" async def scan_batch(self, device_id: str, count: int) -> List[PIL.Image]: """Batch scan multiple documents with ADF support""" ``` ### **Scanner Tool Integration** - **`list_scanners`**: Discover and enumerate connected scanners - **`scanner_properties`**: Get detailed scanner capabilities - **`configure_scan`**: Set scan parameters (DPI, color, size) - **`scan_document`**: Single document scanning - **`scan_batch`**: Multi-document batch scanning - **`preview_scan`**: Low-resolution preview for positioning ### **Supported Scanner Types** - **Flatbed Scanners**: Epson, Canon, HP, Brother, etc. - **ADF Scanners**: Automatic Document Feeders for batch scanning - **Duplex Scanners**: Two-sided scanning capability - **Network Scanners**: IP-connected multifunction devices ### **Scan Workflow Integration** ``` User Request β†’ MCP Tool β†’ Scanner Manager β†’ WIA Backend β†’ Hardware β†’ Image Processing β†’ OCR β†’ Results ``` --- ## πŸ“Š Current Implementation Status ### **βœ… Completed Features** - **OCR Backends**: GOT-OCR2.0, Tesseract (EasyOCR pending installation) - **Scanner Control**: Complete WIA implementation for Windows flatbed scanners - **Tool Ecosystem**: 9 MCP tools registered and functional - **Architecture**: Modular backend system with unified interfaces ### **πŸ”§ Tool Count Breakdown** - **OCR Tools (2)**: `process_document`, `ocr_health_check` - **Scanner Tools (6)**: `list_scanners`, `scanner_properties`, `configure_scan`, `scan_document`, `scan_batch`, `preview_scan` - **Utility Tools (1)**: `list_backends` - **Total**: 9 tools registered and ready for use ### **🎯 Next Development Phases** **Phase 3: Enhanced OCR & Document Processing (Week 5-6)** βœ… **COMPLETED** - [x] Document processor backend for PDFs, CBZ, CBR, images - [x] Multi-format input handling (scanner, PDF, CBZ, JPG, etc.) - [x] Comic/manga processing modes with GOT-OCR2.0 optimization - [x] Advanced manga features (scaffold separation, panel analysis) - [x] Batch document processing with concurrency control - [x] File type auto-detection and routing - [ ] Complete GOT-OCR2.0 model integration with actual inference - [ ] Image preprocessing pipeline (deskew, enhance, crop) - [ ] TWAIN scanner support for broader compatibility **Phase 4: Production Readiness (Week 7-8)** - [ ] Comprehensive testing and error handling - [ ] Documentation and usage examples - [ ] Performance benchmarking - [ ] Docker containerization - [ ] MCPB packaging for distribution --- **Status**: Phase 2 Complete - OCR-MCP server functional with scanner control **Timeline**: 6-8 weeks to production-ready dual-repo ecosystem **Priority**: High - Unique scanner + OCR combination creates market differentiation **Next Step**: Begin Phase 3: Complete GOT-OCR2.0 integration and image preprocessing

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sandraschi/ocr-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server