Skip to main content
Glama

📄 MCP PDF

A FastMCP server for PDF processing

46 tools for text extraction, OCR, tables, forms, annotations, and more

Python 3.11+ FastMCP License: MIT PyPI

Works great with


What It Does

MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

Core capabilities:

  • Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)

  • Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)

  • OCR for scanned documents via Tesseract

  • Form handling - extract, fill, and create PDF forms

  • Document assembly - merge, split, reorder pages

  • Annotations - sticky notes, highlights, stamps

  • Vector graphics - extract to SVG for schematics and technical drawings


Quick Start

# Install from PyPI uvx mcp-pdf # Or add to Claude Code claude mcp add pdf-tools uvx mcp-pdf
git clone https://github.com/rsp2k/mcp-pdf cd mcp-pdf uv sync # System dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript # Verify uv run python examples/verify_installation.py

Tools

Content Extraction

Tool

What it does

extract_text

Pull text from PDF pages with automatic chunking for large files

extract_tables

Extract tables to JSON, CSV, or Markdown

extract_images

Extract embedded images

extract_links

Get all hyperlinks with page filtering

pdf_to_markdown

Convert PDF to markdown preserving structure

ocr_pdf

OCR scanned documents using Tesseract

extract_vector_graphics

Export vector graphics to SVG (schematics, charts, drawings)

Document Analysis

Tool

What it does

extract_metadata

Get title, author, creation date, page count, etc.

get_document_structure

Extract table of contents and bookmarks

analyze_layout

Detect columns, headers, footers

is_scanned_pdf

Check if PDF needs OCR

compare_pdfs

Diff two PDFs by text, structure, or metadata

analyze_pdf_health

Check for corruption, optimization opportunities

analyze_pdf_security

Report encryption, permissions, signatures

Forms

Tool

What it does

extract_form_data

Get form field names and values

fill_form_pdf

Fill form fields from JSON

create_form_pdf

Create new forms with text fields, checkboxes, dropdowns

add_form_fields

Add fields to existing PDFs

Permit Forms (Coordinate-Based)

For scanned PDFs or forms without interactive fields. Draws text at (x, y) coordinates.

Tool

What it does

fill_permit_form

Fill any PDF by drawing at coordinates (works with scanned forms)

get_field_schema

Get field definitions for validation or UI generation

validate_permit_form_data

Check data against field schema before filling

preview_field_positions

Generate PDF showing field boundaries (debugging)

insert_attachment_pages

Insert image/text pages with "See page X" references

Requires: pip install mcp-pdf[forms] (adds reportlab dependency)

Document Assembly

Tool

What it does

merge_pdfs

Combine multiple PDFs with bookmark preservation

split_pdf_by_pages

Split by page ranges

split_pdf_by_bookmarks

Split at chapter/section boundaries

reorder_pdf_pages

Rearrange pages in custom order

Annotations

Tool

What it does

add_sticky_notes

Add comment annotations

add_highlights

Highlight text regions

add_stamps

Add Approved/Draft/Confidential stamps

extract_all_annotations

Export annotations to JSON


How Fallbacks Work

The server tries multiple libraries for each operation:

Text extraction:

  1. PyMuPDF (fastest)

  2. pdfplumber (better for complex layouts)

  3. pypdf (most compatible)

Table extraction:

  1. Camelot (best accuracy, requires Ghostscript)

  2. pdfplumber (no dependencies)

  3. Tabula (requires Java)

If a PDF fails with one library, the next is tried automatically.


Token Management

Large PDFs can overflow MCP response limits. The server handles this:

  • Automatic chunking splits large documents into page groups

  • Table row limits prevent huge tables from blowing up responses

  • Summary mode returns structure without full content

# Get first 10 pages result = await extract_text("huge.pdf", pages="1-10") # Limit table rows tables = await extract_tables("data.pdf", max_rows_per_table=50) # Structure only tables = await extract_tables("data.pdf", summary_only=True)

URL Processing

PDFs can be fetched directly from HTTPS URLs:

result = await extract_text("https://example.com/report.pdf")

Files are cached locally for subsequent operations.


System Dependencies

Some features require system packages:

Feature

Dependency

OCR

tesseract-ocr

Camelot tables

ghostscript

Tabula tables

default-jre-headless

PDF to images

poppler-utils

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless

Configuration

Optional environment variables:

Variable

Purpose

MCP_PDF_ALLOWED_PATHS

Colon-separated directories for file output

PDF_TEMP_DIR

Temp directory for processing (default: /tmp/mcp-pdf-processing)

TESSDATA_PREFIX

Tesseract language data location


Development

# Run tests uv run pytest # With coverage uv run pytest --cov=mcp_pdf # Format uv run black src/ tests/ # Lint uv run ruff check src/ tests/

License

MIT

-
security - not tested
A
license - permissive license
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rsp2k/mcp-pdf'

If you have feedback or need assistance with the MCP directory API, please join our Discord server