Skip to main content
Glama

OCR-MCP: Advanced Document Processing Server

Python FastMCP GOT-OCR2.0 License Status

FastMCP 2.13+ server providing advanced OCR capabilities including GOT-OCR2.0 integration, WIA scanner control, and multi-format document processing.

📋 Table of Contents

What is OCR-MCP?

OCR-MCP is a FastMCP server that provides comprehensive OCR (Optical Character Recognition) capabilities to MCP clients. It processes various document formats and integrates with scanner hardware.

State-of-the-Art OCR Integration

OCR-MCP integrates multiple current state-of-the-art OCR models for comprehensive document processing:

Primary OCR Engines

🔥 DeepSeek-OCR (October 2025) - Current State-of-the-Art

🎯 Florence-2 (June 2024) - Microsoft's Vision Foundation Model

  • Architecture: Unified vision-language model for various vision tasks

  • OCR Capabilities: Excellent text extraction and layout understanding

  • Strengths: Multi-task learning, fine-grained text recognition

  • Repository: https://huggingface.co/microsoft/Florence-2-base

📊 DOTS.OCR (July 2025) - Document Understanding Specialist

🚀 PP-OCRv5 (2025) - Industrial-Grade OCR

🎨 Qwen-Image-Layered (December 2025) - Advanced Image Decomposition

  • Technology: Decomposes images into multiple independent RGBA layers

  • OCR Integration: Isolate text, background, and structural elements for better OCR

  • Capabilities: Layer-independent editing, resizing, repositioning, recoloring

  • Repository: https://huggingface.co/Qwen/Qwen-Image-Layered

  • Paper: https://arxiv.org/abs/2512.15603

  • Use Case: Pre-process complex documents by separating text layers from backgrounds

OCR Capabilities

  • Plain Text OCR: Standard text extraction from images

  • Formatted Text OCR: Preserves layout and formatting structure

  • Fine-Grained OCR: Extract text from specific regions with coordinate precision

  • Multi-Crop OCR: Process documents with complex layouts by dividing into regions

  • HTML Rendering: Generate HTML output with visual layout preservation

  • Document Understanding: Table extraction, formula recognition, layout analysis

Auto-Backend Selection

OCR-MCP automatically selects the best backend based on:

  • Document Type: PDF, image, scanned document, or comic

  • Content Complexity: Plain text vs. structured documents

  • Language Requirements: Multilingual content detection

  • Performance Needs: Speed vs. accuracy trade-offs

Advanced Document Pre-processing

Qwen-Image-Layered Integration revolutionizes OCR through intelligent image decomposition:

  • Layer Separation: Decompose documents into independent RGBA layers (text, background, images, graphics)

  • Selective OCR: Process text layers independently for improved accuracy on complex documents

  • Noise Reduction: Isolate and remove background noise, watermarks, and interfering elements

  • Content Isolation: Separate handwritten notes, stamps, and annotations from main text

  • Layout Preservation: Maintain document structure while enabling targeted OCR processing

  • Multi-modal Enhancement: Combine with traditional OCR for hybrid processing pipelines

Community & Industry Adoption

Current OCR landscape shows rapid evolution:

  • DeepSeek-OCR: Leading downloads indicate community preference

  • Florence-2: Academic and research adoption

  • DOTS.OCR: Document processing industry standard

  • PP-OCRv5: Production deployment in enterprise applications

Key Features

  • Multiple OCR Backends: GOT-OCR2.0, Tesseract, EasyOCR

  • Processing Modes: Plain text, formatted text, layout preservation, HTML rendering, fine-grained region extraction

  • Document Formats: PDF, CBZ/CBR comic archives, JPG/PNG/TIFF images, scanner input

  • Scanner Integration: Direct WIA control for Windows flatbed scanners

  • Batch Processing: Concurrent processing of multiple documents

  • Output Formats: Text, HTML, Markdown, JSON, XML

🏗️ Architecture

Backend Support Matrix

Backend

Plain OCR

Formatted OCR

Multi-language

GPU Support

Offline

GOT-OCR2.0

Tesseract

EasyOCR

PaddleOCR

TrOCR

Tool Ecosystem

  • process_document - Main OCR processing with backend selection

  • process_batch - Batch document processing with progress tracking

  • extract_regions - Fine-grained region-based OCR

  • analyze_layout - Document structure and layout analysis

  • convert_format - OCR result format conversion

  • ocr_health_check - Backend availability and diagnostics

🚀 Quick Start

Prerequisites

  • Python 3.11+

  • GPU recommended (for GOT-OCR2.0 and other ML models)

  • 8GB+ VRAM for optimal performance

Installation

# Clone the repository git clone https://github.com/sandraschi/ocr-mcp.git cd ocr-mcp # Install dependencies with Poetry (recommended) poetry install # For GPU support (optional but recommended) poetry run pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

MCP Configuration

Add to your claude_desktop_config.json:

{ "mcpServers": { "ocr-mcp": { "command": "python", "args": ["-m", "ocr_mcp.server"], "env": { "OCR_CACHE_DIR": "/path/to/model/cache", "OCR_DEVICE": "cuda" } } } }

WebApp Mode

OCR-MCP includes a full-featured web interface for document processing:

# Run the web application poetry run ocr-mcp-webapp # Or use the script directly python scripts/run_webapp.py

The web interface provides:

  • 📤 Drag & drop file upload - Support for PDF, images, CBZ

  • 🔄 Real-time processing - Live status updates and progress

  • 📷 Scanner integration - Direct scanner control via web interface

  • 📊 Batch processing - Process multiple documents simultaneously

  • 🎨 OCR backend selection - Choose from 5 different OCR engines

  • 📋 Results visualization - Text, JSON, and HTML output formats

Access the webapp at: http://localhost:8000

🌐 WebApp Interface

OCR-MCP provides a modern web interface for document processing and scanner control:

Features

  • 📤 File Upload: Drag & drop interface supporting PDF, PNG, JPG, TIFF, BMP, CBZ, CBR

  • 🔄 Live Processing: Real-time status updates with progress indicators

  • 📷 Scanner Control: Discover and control WIA-compatible scanners

  • 📊 Batch Operations: Process multiple documents simultaneously

  • 🎨 Backend Selection: Choose from 5 different OCR engines per task

  • 📋 Multi-format Output: View results as plain text, JSON, or HTML

  • 💾 Export Options: Download results or copy to clipboard

Interface Sections

Upload & Process Tab

  • Single document processing with drag-and-drop upload

  • OCR backend selection (DeepSeek-OCR, Florence-2, DOTS.OCR, PP-OCRv5, Qwen-Image-Layered)

  • Processing mode selection (Text, Formatted, Fine-grained)

  • Real-time processing status and results display

Scanner Control Tab

  • Automatic scanner discovery

  • Scanner properties configuration (DPI, color mode, paper size)

  • Single document scanning

  • Direct integration with OCR processing

Batch Processing Tab

  • Multiple file selection and management

  • Concurrent processing with progress tracking

  • Batch results aggregation

Settings Tab

  • System health monitoring

  • OCR backend availability status

  • Configuration diagnostics

WebApp Architecture

The webapp consists of:

  • FastAPI Backend: RESTful API server with async processing

  • MCP Integration: Direct communication with OCR-MCP server

  • Modern Frontend: Responsive HTML/CSS/JavaScript interface

  • File Management: Secure temporary file handling

  • Real-time Updates: WebSocket-like status polling

💡 Usage Examples

Basic OCR Processing

# Auto-select best available backend result = await process_document( image_path="/path/to/document.png" ) print(result["text"]) # Extracted text

Formatted OCR with HTML Output

# GOT-OCR2.0 formatted text preservation result = await process_document( image_path="/path/to/scanned_page.png", backend="got-ocr", mode="format", output_format="html" ) # Returns: HTML with preserved layout and formatting

Fine-grained Region Extraction

# Extract text from specific coordinates result = await extract_regions( image_path="/path/to/document.png", regions=[ {"x1": 100, "y1": 200, "x2": 400, "y2": 300, "label": "title"}, {"x1": 100, "y1": 350, "x2": 500, "y2": 600, "label": "content"} ] ) # Returns: Structured text extraction by region

Batch Processing

# Process multiple documents results = await process_batch( image_paths=[ "/path/to/doc1.png", "/path/to/doc2.png", "/path/to/doc3.png" ], backend="got-ocr", output_format="json" ) # Returns: Array of OCR results with progress tracking

🎨 Advanced Features

Document Layout Analysis

# Analyze document structure layout = await analyze_layout( image_path="/path/to/complex_document.png" ) # Returns: Detected tables, columns, headers, text blocks

Multi-Backend Comparison

# Compare OCR accuracy across backends comparison = await compare_backends( image_path="/path/to/test_image.png", backends=["got-ocr", "tesseract", "easyocr"] ) # Returns: Accuracy scores, processing times, confidence metrics

Format Conversion

# Convert OCR results between formats html_result = await convert_format( ocr_result=raw_result, from_format="text", to_format="html", preserve_layout=True )

🔧 Configuration Options

Environment Variables

  • OCR_CACHE_DIR: Model cache directory (default: ~/.cache/ocr-mcp)

  • OCR_DEVICE: Computing device (cuda, cpu, auto)

  • OCR_MAX_MEMORY: Maximum GPU memory usage in GB

  • OCR_DEFAULT_BACKEND: Default OCR backend (got-ocr, tesseract, etc.)

  • OCR_BATCH_SIZE: Default batch processing size

Backend-Specific Settings

# config/ocr_config.yaml backends: got_ocr: model_size: "base" # or "large" cache_dir: "/models/got-ocr" device: "cuda:0" tesseract: language: "eng+fra+deu" config: "--psm 6" easyocr: languages: ["en", "fr", "de"] gpu: true

📊 Performance Benchmarks

Single Image Processing (GTX 3080)

Backend

Plain OCR

Formatted OCR

Fine-grained

GOT-OCR2.0

2.3s

3.1s

4.2s

Tesseract

0.8s

N/A

1.2s

EasyOCR

1.5s

N/A

2.1s

PaddleOCR

1.8s

2.9s

3.5s

Accuracy Comparison (Clean Documents)

Backend

Print Text

Handwriting

Mixed Content

GOT-OCR2.0

97.2%

89.1%

94.8%

Tesseract

92.1%

45.3%

78.9%

EasyOCR

94.7%

78.2%

88.5%

PaddleOCR

95.8%

82.1%

91.2%

🛠️ Development Status

  • Planning: Complete master plan and architecture

  • 🟡 Phase 1: Core infrastructure (In Progress)

  • Phase 2: GOT-OCR2.0 integration

  • Phase 3: Multi-backend support

  • Phase 4: Advanced features

  • Phase 5: Specialized tools

  • Phase 6: Production deployment

See OCR-MCP_MASTER_PLAN.md for detailed roadmap.

🤝 Integration with Existing MCP Servers

CalibreMCP Integration

OCR-MCP enhances CalibreMCP's OCR capabilities:

# CalibreMCP can now use OCR-MCP for advanced processing result = await calibre_ocr( source="/path/to/scanned_book.pdf", provider="ocr-mcp", # New option! mode="format", render_html=True )

Document Processing Workflows

  • Research Papers: Extract structured text from academic PDFs

  • Receipt Processing: Automated data extraction from scanned receipts

  • Book Digitization: High-quality OCR for scanned books

  • Accessibility: Convert images to readable text for screen readers

📈 Roadmap

Immediate (Next 4 weeks)

  • Complete core infrastructure

  • GOT-OCR2.0 integration

  • Basic tool implementation

  • Documentation and examples

Medium-term (2-3 months)

  • Multi-backend support

  • Advanced processing modes

  • Batch processing optimization

  • Performance benchmarking

Long-term (6+ months)

  • Community backend integrations

  • Specialized domain models

  • Real-time processing capabilities

  • Mobile app integration

🤝 Contributing

OCR-MCP welcomes contributions! Areas of particular interest:

  • New OCR Backends: Integration of additional OCR engines

  • Performance Optimization: GPU memory management, batch processing

  • Specialized Models: Domain-specific OCR improvements

  • Documentation: Usage examples, integration guides

  • Testing: Comprehensive test coverage and benchmarks

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

  • GOT-OCR2.0 Team (UCAS): Revolutionary OCR model that inspired this project

  • FastMCP Community: Excellent framework for MCP server development

  • Open Source OCR Community: Tesseract, EasyOCR, PaddleOCR, and others


OCR-MCP: Democratizing state-of-the-art document understanding for the MCP ecosystem! 🌟

See OCR-MCP_MASTER_PLAN.md for technical details and implementation roadmap.

-
security - not tested
A
license - permissive license
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sandraschi/ocr-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server