Skip to main content
Glama
lihongwen

PDF Reader MCP Server

by lihongwen

PDF Reader MCP Server

An MCP (Model Context Protocol) server that provides comprehensive PDF processing capabilities with 18 powerful tools for text extraction, OCR, image conversion, metadata management, and optimization.

๐Ÿš€ Latest Updates

  • โœ… All 18 tools fully tested and working (September 2025)

  • ๐Ÿ”ง Fixed JSON serialization issues - 100% compatibility achieved

  • โšก Enhanced performance with intelligent caching system

  • ๐ŸŒ Multi-language OCR support with Chinese and English optimization

Features

๐Ÿ“– Smart Text Extraction

  • Intelligent PDF parsing with pdfplumber for high-quality text extraction

  • Automatic quality detection to identify when OCR is needed

  • Page-wise processing with flexible page range syntax

๐Ÿ” Advanced OCR Support

  • Tesseract integration for scanned documents and image-based PDFs

  • Multi-language support with focus on Chinese and English

  • Confidence scoring for OCR quality assessment

  • Windows-friendly installation and setup

โšก Performance Optimized

  • Smart caching system to avoid reprocessing unchanged files

  • Chunking strategies for handling large documents

  • Parallel page processing for improved performance

๐ŸŽฏ Flexible Page Selection

Support for complex page ranges:

  • "1,3,5" - Specific pages

  • "1-10" - Page ranges

  • "-1" - Last page

  • "1,3,5-10,-1" - Combined syntax

Installation

Install and run with uvx (easiest method):

# Install and run directly with uvx (no setup required)
uvx pdfmcp-tools

# Or install globally for repeated use
uv tool install pdfmcp-tools
pdfmcp-tools

Install from PyPI with pip:

# Install from PyPI
pip install pdfmcp-tools

# Run the server (both commands work)
pdfmcp-tools
# or
pdfreadermcp

Prerequisites

  • Python 3.11+ (automatically handled by uvx/pip)

  • Tesseract OCR engine (for OCR functionality)

Install Tesseract OCR Engine

macOS:

# Using Homebrew (recommended)
brew install tesseract tesseract-lang

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra

Windows:

  1. Download from: https://github.com/UB-Mannheim/tesseract/wiki

  2. Install the latest version (recommended: tesseract-ocr-w64-setup-v5.3.3.20231005.exe)

  3. During installation, select "Additional Language Data" and install Chinese language packs

  4. Add Tesseract to your PATH, or note the installation path for configuration

Development Installation (Advanced)

For development or local modification:

  1. Install uv package manager (if not already installed):

    macOS/Linux:

    curl -LsSf https://astral.sh/uv/install.sh | sh

    Windows:

    powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
  2. Clone and install for development:

    git clone https://github.com/lihongwen/pdfreadermcp.git
    cd pdfreadermcp
    uv sync --dev
    uv run pdfreadermcp

Usage

๐Ÿš€ Running the Server

With uvx (recommended):

# Run directly (auto-downloads and starts)
uvx pdfreadermcp

# Or if globally installed
pdfreadermcp

With pip installation:

# After pip install pdfreadermcp
pdfreadermcp

Development mode:

# In project directory
uv run pdfreadermcp

Integration with Claude Desktop

Add to your Claude Desktop MCP configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

Option 1: Using uvx (recommended):

{
  "mcpServers": {
    "pdfreadermcp": {
      "command": "uvx",
      "args": ["pdfmcp-tools"]
    }
  }
}

Option 2: Using global installation:

{
  "mcpServers": {
    "pdfreadermcp": {
      "command": "pdfmcp-tools"
    }
  }
}

Option 3: Development/local installation:

{
  "mcpServers": {
    "pdfreadermcp": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/your/pdfreadermcp",
        "run",
        "pdfreadermcp"
      ]
    }
  }
}

๐Ÿ“š Complete Tool Suite (18 Tools)

All tools have been thoroughly tested and are fully functional. The server provides comprehensive PDF processing capabilities across multiple categories:

๐Ÿ“– Text Processing Tools (5 tools)

  • read_pdf - Intelligent text extraction with chunking

  • extract_page_text - Single page text extraction with multiple modes

  • search_pdf_text - Advanced text search with regex and context support

  • find_and_highlight_text - Text search with highlighting coordinates

  • get_pdf_metadata - Comprehensive metadata reading with XMP support

๐Ÿ“„ Document Operations Tools (5 tools)

  • split_pdf - Split PDFs into multiple files by page ranges

  • extract_pages - Extract specific pages to new PDF file

  • merge_pdfs - Combine multiple PDFs into single document

  • set_pdf_metadata - Write/update PDF metadata fields

  • remove_pdf_metadata - Remove specific or all metadata fields

๐Ÿ–ผ๏ธ Image Conversion Tools (3 tools)

  • pdf_to_images - Convert PDF pages to high-quality images

  • images_to_pdf - Convert multiple images to single PDF

  • extract_pdf_images - Extract embedded images from PDF pages

๐Ÿ” OCR Tool (1 tool)

  • ocr_pdf - Advanced OCR with multi-language support and confidence scoring

โšก Optimization Tools (4 tools)

  • optimize_pdf - Comprehensive PDF optimization with multiple levels

  • compress_pdf_images - Image compression within PDF documents

  • remove_pdf_content - Remove specific content to reduce file size

  • analyze_pdf_size - File size analysis and optimization recommendations

Tools

read_pdf - Text Extraction Tool

Extracts text from PDF files with intelligent processing.

Parameters:

  • file_path (required): Path to PDF file

  • pages (optional): Page range string (e.g., "1,3,5-10,-1")

  • chunk_size (optional): Maximum chunk size (default: 1000)

  • chunk_overlap (optional): Chunk overlap (default: 100)

Example:

Extract text from document.pdf, pages 1-5 and last page

ocr_pdf - OCR Recognition Tool

Performs OCR on PDF pages using Tesseract for scanned documents and image-based PDFs.

Parameters:

  • file_path (required): Path to PDF file

  • pages (optional): Page range string (e.g., "1,3,5-10,-1")

  • language (optional): OCR language code (default: "chi_sim" for Chinese)

  • chunk_size (optional): Maximum chunk size (default: 1000)

  • chunk_overlap (optional): Chunk overlap (default: 100)

  • dpi (optional): DPI for PDF to image conversion (default: 200)

Supported Languages:

  • chi_sim: Simplified Chinese (้ป˜่ฎค)

  • chi_tra: Traditional Chinese

  • eng: English

  • chi_sim+eng: Chinese and English mixed

Example:

Perform OCR on scanned_doc.pdf with Chinese text recognition

split_pdf - PDF Splitting Tool

Split PDF into multiple files based on page ranges.

Parameters:

  • file_path (required): Path to source PDF file

  • split_ranges (required): List of page ranges (e.g., ["1-5", "6-10", "11-15"])

  • output_dir (optional): Output directory (defaults to source file directory)

  • prefix (optional): Output file prefix (defaults to source filename)

Example:

Split document.pdf into multiple files: pages 1-10, 11-20, 21-30

extract_pages - Page Extraction Tool

Extract specific pages from PDF to a new file.

Parameters:

  • file_path (required): Path to source PDF file

  • pages (required): Page range (e.g., "1,3,5-7")

  • output_file (optional): Output filename (auto-generated if not provided)

  • output_dir (optional): Output directory (defaults to source file directory)

Example:

Extract pages 1, 5-8, and 15 from document.pdf

merge_pdfs - PDF Merging Tool

Merge multiple PDF files into a single file.

Parameters:

  • file_paths (required): List of PDF file paths to merge

  • output_file (optional): Output filename (auto-generated if not provided)

  • output_dir (optional): Output directory (defaults to first file's directory)

Example:

Merge file1.pdf, file2.pdf, and file3.pdf into a single document

pdf_to_images - PDF to Images Converter

Convert PDF pages to high-quality images using pdf2image.

Parameters:

  • file_path (required): Path to PDF file

  • pages (optional): Page range (e.g., "1,3,5-10,-1")

  • dpi (optional): Resolution for conversion (default: 200)

  • image_format (optional): Output format ('PNG', 'JPEG', etc.)

  • output_dir (optional): Directory to save images

  • save_to_disk (optional): Save to disk or keep in memory (default: True)

Example:

Convert first 5 pages of document.pdf to PNG images at 300 DPI

images_to_pdf - Images to PDF Converter

Convert multiple images into a single PDF document.

Parameters:

  • image_paths (required): List of image file paths

  • output_file (required): Output PDF file path

  • page_size (optional): Page size ('A4', 'Letter', 'Legal', 'auto')

  • quality (optional): JPEG compression quality (1-100, default: 95)

  • title (optional): PDF document title

  • author (optional): PDF document author

Example:

Convert scan1.jpg, scan2.jpg, scan3.jpg to a single PDF with A4 pages

extract_pdf_images - PDF Image Extractor

Extract all embedded images from PDF pages.

Parameters:

  • file_path (required): Path to PDF file

  • pages (optional): Page range (e.g., "1,3,5-10,-1")

  • min_size (optional): Minimum image size ("WIDTHxHEIGHT", default: "100x100")

  • output_dir (optional): Directory to save extracted images

Example:

Extract all images larger than 200x200 pixels from PDF pages 1-10

get_pdf_metadata - PDF Metadata Reader

Read comprehensive metadata information from PDF documents.

Parameters:

  • file_path (required): Path to PDF file

  • include_xmp (optional): Include advanced XMP metadata (default: False)

Example:

Read all metadata from document.pdf including title, author, creation date

set_pdf_metadata - PDF Metadata Writer

Write or update PDF metadata fields.

Parameters:

  • file_path (required): Path to source PDF file

  • output_file (optional): Output PDF file path

  • title (optional): Document title

  • author (optional): Document author

  • subject (optional): Document subject

  • creator (optional): Creator application name

  • producer (optional): Producer application name

  • keywords (optional): Keywords or tags

  • preserve_existing (optional): Preserve existing metadata (default: True)

Example:

Set metadata for report.pdf with title "Annual Report 2024" and author "John Doe"

remove_pdf_metadata - PDF Metadata Remover

Remove specific metadata fields or all metadata from PDF.

Parameters:

  • file_path (required): Path to source PDF file

  • output_file (optional): Output PDF file path

  • fields_to_remove (optional): List of specific fields to remove

  • remove_all (optional): Remove all metadata (default: False)

Example:

Remove author and title metadata from sensitive_document.pdf

search_pdf_text - PDF Text Search Engine

Search for text content across PDF pages with detailed match information.

Parameters:

  • file_path (required): Path to PDF file

  • query (required): Text to search for (or regex pattern)

  • pages (optional): Page range (e.g., "1,3,5-10,-1")

  • case_sensitive (optional): Case-sensitive search (default: False)

  • regex_search (optional): Treat query as regex pattern (default: False)

  • context_chars (optional): Context characters around matches (default: 100)

  • max_matches (optional): Maximum matches to return (default: 100)

Example:

Search for "financial report" in document.pdf with case-insensitive matching

extract_page_text - Single Page Text Extractor

Extract text from a specific PDF page with various extraction options.

Parameters:

  • file_path (required): Path to PDF file

  • page_number (required): Page number to extract (1-based)

  • extraction_mode (optional): Extraction mode ("default", "layout", "simple")

Example:

Extract text from page 5 of document.pdf with layout preservation

find_and_highlight_text - Text Highlighting Tool

Find text and return information for highlighting matches.

Parameters:

  • file_path (required): Path to PDF file

  • query (required): Text to search for

  • pages (optional): Page range (e.g., "1,3,5-10,-1")

  • case_sensitive (optional): Case-sensitive search (default: False)

Example:

Find all instances of "important" in document.pdf for highlighting

optimize_pdf - PDF Optimization Tool

Optimize PDF file using various compression techniques.

Parameters:

  • file_path (required): Path to source PDF file

  • output_file (optional): Output PDF file path

  • optimization_level (optional): Optimization preset ("light", "medium", "heavy", "maximum")

Example:

Optimize large_document.pdf using medium compression level

compress_pdf_images - PDF Image Compression

Compress images in PDF while preserving document structure.

Parameters:

  • file_path (required): Path to source PDF file

  • output_file (optional): Output PDF file path

  • quality (optional): Image compression quality (1-100, default: 80)

Example:

Compress images in photo_heavy.pdf to 60% quality

remove_pdf_content - PDF Content Remover

Remove specific content from PDF to reduce file size.

Parameters:

  • file_path (required): Path to source PDF file

  • output_file (optional): Output PDF file path

  • remove_images (optional): Remove all images (default: False)

  • remove_annotations (optional): Remove annotations (default: False)

  • compress_streams (optional): Compress content streams (default: True)

Example:

Remove all images and annotations from document.pdf to reduce size

analyze_pdf_size - PDF Size Analysis Tool

Analyze PDF file to identify optimization opportunities.

Parameters:

  • file_path (required): Path to PDF file to analyze

Example:

Analyze large_file.pdf to get optimization recommendations

Output Format

All tools return structured JSON containing relevant data. Text extraction and OCR tools return:

{
  "success": true,
  "file_path": "/path/to/file.pdf",
  "total_pages": 10,
  "processed_pages": [1, 2, 3],
  "chunks": [
    {
      "content": "Extracted text...",
      "page_number": 1,
      "chunk_index": 0,
      "metadata": {
        "quality_score": 0.95,
        "word_count": 150
      }
    }
  ],
  "summary": {
    "total_chunks": 5,
    "total_chars": 2500,
    "pages": [1, 2, 3]
  },
  "extraction_method": "text_extraction"
}

Language Support

OCR Languages

The ocr_pdf tool supports multiple languages via Tesseract:

  • Chinese: chi_sim (Simplified), chi_tra (Traditional)

  • English: eng

  • Combined: chi_sim+eng (mixed Chinese and English)

  • Others: Available based on your Tesseract installation

Performance Features

Caching System

  • File-based invalidation - Cache automatically invalidates when files change

  • Operation-specific caching - Different cache entries for different operations

  • Memory management - Configurable cache size and TTL

Text Quality Analysis

The system automatically analyzes extracted text quality using:

  • Character-to-word ratios

  • Sentence structure analysis

  • Letter-to-character ratios

  • Special character detection

Low-quality text triggers OCR recommendations.

Chunking Strategy

  • Recursive character splitting with semantic separators

  • Configurable overlap to preserve context

  • Metadata preservation including page numbers and positions

Error Handling

The server provides detailed error information:

  • Missing file errors

  • Invalid page range errors

  • OCR engine initialization errors

  • Processing timeout errors

Development

Project Structure

pdfreadermcp/
 pyproject.toml              # uv project configuration
 README.md
 src/pdfreadermcp/
     __init__.py
     __main__.py             # Entry point
     server.py               # MCP server implementation
     tools/
        pdf_reader.py       # Text extraction tool
        pdf_ocr.py          # OCR processing tool
          pdf_operations.py   # PDF splitting, merging, extraction
          pdf_image_converter.py  # PDF-image conversion tools
          pdf_metadata.py         # PDF metadata management
          pdf_text_search.py      # PDF text search and highlighting
          pdf_optimizer.py        # PDF compression and optimization
     utils/
         chunker.py          # Text chunking utilities
         cache.py            # Caching system
         file_handler.py     # File operations

Running Tests

# Install with dev dependencies
uv sync --dev

# Run tests (when available)
uv run pytest

Dependencies

Core Dependencies

  • mcp - Model Context Protocol server framework

  • pypdf - PDF text extraction and manipulation

  • pdf2image - PDF to image conversion

  • pytesseract - Python wrapper for Tesseract OCR

  • tesseract - OCR engine

  • pillow - Image processing and manipulation

System Requirements

  • For OCR: Tesseract OCR engine must be installed

  • For PDF conversion: poppler-utils may be required on some systems

Troubleshooting

Common Issues

1. Tesseract OCR Installation Issues

If Tesseract is not found, you may see errors like "TesseractNotFoundError". Solutions:

Windows:

  • Ensure Tesseract is installed and added to PATH

  • Or set the path manually in your environment:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

macOS/Linux:

  • Install via package manager: brew install tesseract (macOS) or apt install tesseract-ocr (Ubuntu)

  • Make sure Chinese language packs are installed

2. pdf2image Dependencies On Linux, you may need to install poppler:

# Ubuntu/Debian
sudo apt-get install poppler-utils

# CentOS/RHEL  
sudo yum install poppler-utils

3. Chinese Language Pack Issues

If OCR fails for Chinese text or produces poor results:

  • Windows: During Tesseract installation, select "Additional Language Data" and install Chinese packs

  • macOS: brew install tesseract-lang

  • Linux: sudo apt install tesseract-ocr-chi-sim tesseract-ocr-chi-tra

Verify language packs are installed:

tesseract --list-langs

4. Memory Issues with Large PDFs

  • Reduce chunk_size parameter

  • Process pages in smaller ranges

  • Ensure sufficient system memory

  • Lower dpi parameter for faster processing

Performance Tips

  1. Use caching - The same file with same parameters will use cached results

  2. Process specific pages - Use page ranges instead of processing entire documents

  3. Adjust chunk sizes - Smaller chunks for memory-constrained environments

  4. Choose appropriate tools - Use read_pdf first, then ocr_pdf if needed

  5. OCR optimization:

    • Lower dpi (150-200) for faster processing

    • Use chi_sim only if document is purely Chinese

    • Process problematic pages only, not entire document

๐Ÿงช Testing & Quality Assurance

This project has been thoroughly tested with comprehensive test coverage:

  • โœ… 18/18 tools fully functional (100% success rate)

  • ๐Ÿ”ง All JSON serialization issues resolved

  • ๐Ÿ“‹ Extensive testing with real PDF documents

  • โšก Performance validation with caching system

  • ๐ŸŒ Multi-language OCR testing (Chinese/English)

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit issues and enhancement requests.

Support

For questions and support:

Install Server
A
license - permissive license
A
quality
D
maintenance

Maintenance

โ€“Maintainers
โ€“Response time
โ€“Release cycle
โ€“Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lihongwen/pdfreadermcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server