Skip to main content
Glama

πŸ“„ MCP PDF

A FastMCP server for PDF processing

46 tools for text extraction, OCR, tables, forms, annotations, and more

Python 3.11+ FastMCP License: MIT PyPI

Works great with MCP Office Tools


What It Does

MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

Core capabilities:

  • Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)

  • Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)

  • OCR for scanned documents via Tesseract

  • Form handling - extract, fill, and create PDF forms

  • Document assembly - merge, split, reorder pages

  • Annotations - sticky notes, highlights, stamps

  • Vector graphics - extract to SVG for schematics and technical drawings


Quick Start

# Install from PyPI
uvx mcp-pdf

# Or add to Claude Code
claude mcp add pdf-tools uvx mcp-pdf
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
uv sync

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript

# Verify
uv run python examples/verify_installation.py

Tools

Content Extraction

Tool

What it does

extract_text

Pull text from PDF pages with automatic chunking for large files

extract_tables

Extract tables to JSON, CSV, or Markdown

extract_images

Extract embedded images

extract_links

Get all hyperlinks with page filtering

pdf_to_markdown

Convert PDF to markdown preserving structure

ocr_pdf

OCR scanned documents using Tesseract

extract_vector_graphics

Export vector graphics to SVG (schematics, charts, drawings)

Document Analysis

Tool

What it does

extract_metadata

Get title, author, creation date, page count, etc.

get_document_structure

Extract table of contents and bookmarks

analyze_layout

Detect columns, headers, footers

is_scanned_pdf

Check if PDF needs OCR

compare_pdfs

Diff two PDFs by text, structure, or metadata

analyze_pdf_health

Check for corruption, optimization opportunities

analyze_pdf_security

Report encryption, permissions, signatures

Forms

Tool

What it does

extract_form_data

Get form field names and values

fill_form_pdf

Fill form fields from JSON

create_form_pdf

Create new forms with text fields, checkboxes, dropdowns

add_form_fields

Add fields to existing PDFs

Permit Forms (Coordinate-Based)

For scanned PDFs or forms without interactive fields. Draws text at (x, y) coordinates.

Tool

What it does

fill_permit_form

Fill any PDF by drawing at coordinates (works with scanned forms)

get_field_schema

Get field definitions for validation or UI generation

validate_permit_form_data

Check data against field schema before filling

preview_field_positions

Generate PDF showing field boundaries (debugging)

insert_attachment_pages

Insert image/text pages with "See page X" references

Requires: pip install mcp-pdf[forms] (adds reportlab dependency)

Document Assembly

Tool

What it does

merge_pdfs

Combine multiple PDFs with bookmark preservation

split_pdf_by_pages

Split by page ranges

split_pdf_by_bookmarks

Split at chapter/section boundaries

reorder_pdf_pages

Rearrange pages in custom order

Annotations

Tool

What it does

add_sticky_notes

Add comment annotations

add_highlights

Highlight text regions

add_stamps

Add Approved/Draft/Confidential stamps

extract_all_annotations

Export annotations to JSON


How Fallbacks Work

The server tries multiple libraries for each operation:

Text extraction:

  1. PyMuPDF (fastest)

  2. pdfplumber (better for complex layouts)

  3. pypdf (most compatible)

Table extraction:

  1. Camelot (best accuracy, requires Ghostscript)

  2. pdfplumber (no dependencies)

  3. Tabula (requires Java)

If a PDF fails with one library, the next is tried automatically.


Token Management

Large PDFs can overflow MCP response limits. The server handles this:

  • Automatic chunking splits large documents into page groups

  • Table row limits prevent huge tables from blowing up responses

  • Summary mode returns structure without full content

# Get first 10 pages
result = await extract_text("huge.pdf", pages="1-10")

# Limit table rows
tables = await extract_tables("data.pdf", max_rows_per_table=50)

# Structure only
tables = await extract_tables("data.pdf", summary_only=True)

URL Processing

PDFs can be fetched directly from HTTPS URLs:

result = await extract_text("https://example.com/report.pdf")

Files are cached locally for subsequent operations.


System Dependencies

Some features require system packages:

Feature

Dependency

OCR

tesseract-ocr

Camelot tables

ghostscript

Tabula tables

default-jre-headless

PDF to images

poppler-utils

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless

Configuration

Optional environment variables:

Variable

Purpose

MCP_PDF_ALLOWED_PATHS

Colon-separated directories for file output

PDF_TEMP_DIR

Temp directory for processing (default: /tmp/mcp-pdf-processing)

TESSDATA_PREFIX

Tesseract language data location


Development

# Run tests
uv run pytest

# With coverage
uv run pytest --cov=mcp_pdf

# Format
uv run black src/ tests/

# Lint
uv run ruff check src/ tests/

License

MIT

-
security - not tested
A
license - permissive license
-
quality - not tested

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rsp2k/mcp-pdf'

If you have feedback or need assistance with the MCP directory API, please join our Discord server