Which integrations are available for this server?

Supports converting PDF documents and extracted tables into structured Markdown content to preserve document layout in text-based environments. Enables the extraction of vector graphics, such as schematics, charts, and technical drawings, from PDF files into SVG format.

How do I use MCP PDF?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@MCP PDF summarize the quarterly report PDF for me" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

📄 MCP PDF

A FastMCP server for PDF processing

46 tools for text extraction, OCR, tables, forms, annotations, and more

Python 3.11+ FastMCP License: MIT PyPI

Works great with

What It Does

MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

Core capabilities:

Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)
OCR for scanned documents via Tesseract
Form handling - extract, fill, and create PDF forms
Document assembly - merge, split, reorder pages
Annotations - sticky notes, highlights, stamps
Vector graphics - extract to SVG for schematics and technical drawings

Quick Start

# Install from PyPI uvx mcp-pdf # Or add to Claude Code claude mcp add pdf-tools uvx mcp-pdf

git clone https://github.com/rsp2k/mcp-pdf cd mcp-pdf uv sync # System dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript # Verify uv run python examples/verify_installation.py

Tools

Content Extraction

Tool	What it does
`extract_text`	Pull text from PDF pages with automatic chunking for large files
`extract_tables`	Extract tables to JSON, CSV, or Markdown
`extract_images`	Extract embedded images
`extract_links`	Get all hyperlinks with page filtering
`pdf_to_markdown`	Convert PDF to markdown preserving structure
`ocr_pdf`	OCR scanned documents using Tesseract
`extract_vector_graphics`	Export vector graphics to SVG (schematics, charts, drawings)

Document Analysis

Tool	What it does
`extract_metadata`	Get title, author, creation date, page count, etc.
`get_document_structure`	Extract table of contents and bookmarks
`analyze_layout`	Detect columns, headers, footers
`is_scanned_pdf`	Check if PDF needs OCR
`compare_pdfs`	Diff two PDFs by text, structure, or metadata
`analyze_pdf_health`	Check for corruption, optimization opportunities
`analyze_pdf_security`	Report encryption, permissions, signatures

Forms

Tool	What it does
`extract_form_data`	Get form field names and values
`fill_form_pdf`	Fill form fields from JSON
`create_form_pdf`	Create new forms with text fields, checkboxes, dropdowns
`add_form_fields`	Add fields to existing PDFs

Permit Forms (Coordinate-Based)

For scanned PDFs or forms without interactive fields. Draws text at (x, y) coordinates.

Tool	What it does
`fill_permit_form`	Fill any PDF by drawing at coordinates (works with scanned forms)
`get_field_schema`	Get field definitions for validation or UI generation
`validate_permit_form_data`	Check data against field schema before filling
`preview_field_positions`	Generate PDF showing field boundaries (debugging)
`insert_attachment_pages`	Insert image/text pages with "See page X" references

Requires: pip install mcp-pdf[forms] (adds reportlab dependency)

Document Assembly

Tool	What it does
`merge_pdfs`	Combine multiple PDFs with bookmark preservation
`split_pdf_by_pages`	Split by page ranges
`split_pdf_by_bookmarks`	Split at chapter/section boundaries
`reorder_pdf_pages`	Rearrange pages in custom order

Annotations

Tool	What it does
`add_sticky_notes`	Add comment annotations
`add_highlights`	Highlight text regions
`add_stamps`	Add Approved/Draft/Confidential stamps
`extract_all_annotations`	Export annotations to JSON

How Fallbacks Work

The server tries multiple libraries for each operation:

Text extraction:

PyMuPDF (fastest)
pdfplumber (better for complex layouts)
pypdf (most compatible)

Table extraction:

Camelot (best accuracy, requires Ghostscript)
pdfplumber (no dependencies)
Tabula (requires Java)

If a PDF fails with one library, the next is tried automatically.

Token Management

Large PDFs can overflow MCP response limits. The server handles this:

Automatic chunking splits large documents into page groups
Table row limits prevent huge tables from blowing up responses
Summary mode returns structure without full content

# Get first 10 pages result = await extract_text("huge.pdf", pages="1-10") # Limit table rows tables = await extract_tables("data.pdf", max_rows_per_table=50) # Structure only tables = await extract_tables("data.pdf", summary_only=True)

URL Processing

PDFs can be fetched directly from HTTPS URLs:

result = await extract_text("https://example.com/report.pdf")

Files are cached locally for subsequent operations.

System Dependencies

Some features require system packages:

Feature	Dependency
OCR	`tesseract-ocr`
Camelot tables	`ghostscript`
Tabula tables	`default-jre-headless`
PDF to images	`poppler-utils`

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless

Configuration

Optional environment variables:

Variable	Purpose
`MCP_PDF_ALLOWED_PATHS`	Colon-separated directories for file output
`PDF_TEMP_DIR`	Temp directory for processing (default: `/tmp/mcp-pdf-processing`)
`TESSDATA_PREFIX`	Tesseract language data location

Development

# Run tests uv run pytest # With coverage uv run pytest --cov=mcp_pdf # Format uv run black src/ tests/ # Lint uv run ruff check src/ tests/

License

MIT

This server cannot be installed

-

security - not tested

A

license - permissive license

-

quality - not tested

How are these scores calculated?

Resources

GitHub Repository

Need Help?

Report Issue

Related Servers

MCP PDF