Which integrations are available for this server?

Enables the extraction of full text, metadata, and images from research papers hosted on arXiv by processing their PDF URLs.

How do I use PDF Reader MCP?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@PDF Reader MCP extract the text from pages 1-5 of documents/report.pdf" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

PDF Reader MCP 📄

Production-ready PDF processing server for AI agents

CI/CD codecov npm version coverage Downloads License

5-10x faster parallel processing • Y-coordinate content ordering • 94%+ test coverage • 103 tests passing

基于原项目: 此项目基于 pdf-reader-mcp 修改而来

🚀 Overview

PDF Reader MCP is a production-ready Model Context Protocol server that empowers AI agents with enterprise-grade PDF processing capabilities. Extract text, images, and metadata with unmatched performance and reliability.

The Problem:

// Traditional PDF processing - Sequential page processing (slow) - No natural content ordering - Complex path handling - Poor error isolation

The Solution:

// PDF Reader MCP - 5-10x faster parallel processing ⚡ - Y-coordinate based ordering 📐 - Flexible path support (absolute/relative) 🎯 - Per-page error resilience 🛡️ - 94%+ test coverage ✅

Result: Production-ready PDF processing that scales.

⚡ Key Features

Performance

🚀 5-10x faster than sequential with automatic parallelization
⚡ 12,933 ops/sec error handling, 5,575 ops/sec text extraction
💨 Process 50-page PDFs in seconds with multi-core utilization
📦 Lightweight with minimal dependencies

Developer Experience

🎯 Path Flexibility - Absolute & relative paths, Windows/Unix support (v1.3.0)
🖼️ Smart Ordering - Y-coordinate based content preserves document layout
🛡️ Type Safe - Full TypeScript with strict mode enabled
📚 Battle-tested - 103 tests, 94%+ coverage, 98%+ function coverage
🎨 Simple API - Single tool handles all operations elegantly

📊 Performance Benchmarks

Real-world performance from production testing:

Operation	Ops/sec	Performance	Use Case
Error handling	12,933	⚡⚡⚡⚡⚡	Validation & safety
Extract full text	5,575	⚡⚡⚡⚡	Document analysis
Extract page	5,329	⚡⚡⚡⚡	Single page ops
Multiple pages	5,242	⚡⚡⚡⚡	Batch processing
Metadata only	4,912	⚡⚡⚡	Quick inspection

Parallel Processing Speedup

Document	Sequential	Parallel	Speedup
10-page PDF	~2s	~0.3s	5-8x faster
50-page PDF	~10s	~1s	10x faster
100+ pages	~20s	~2s	Linear scaling with CPU cores

Benchmarks vary based on PDF complexity and system resources.

📦 Installation

# Quick start - zero installation npx @sylphx/pdf-reader-mcp # Using pnpm (recommended) pnpm add @sylphx/pdf-reader-mcp # Using npm npm install @sylphx/pdf-reader-mcp # Using yarn yarn add @sylphx/pdf-reader-mcp # For Claude Desktop (easiest) npx -y @smithery/cli install @sylphx/pdf-reader-mcp --client claude

🎯 Quick Start

Configuration

Add to your MCP client (claude_desktop_config.json, Cursor, Cline):

{ "mcpServers": { "pdf-reader-mcp": { "command": "npx", "args": ["@bachstudio/pdf-reader-mcp"] } } }

Basic Usage

{ "sources": [{ "path": "documents/report.pdf" }], "include_full_text": true, "include_metadata": true, "include_page_count": true }

Result:

✅ Full text content extracted
✅ PDF metadata (author, title, dates)
✅ Total page count
✅ Structural sharing - unchanged parts preserved

Extract Specific Pages

{ "sources": [{ "path": "documents/manual.pdf", "pages": "1-5,10,15-20" }], "include_full_text": true }

Absolute Paths (v1.3.0+)

// Windows - Both formats work! { "sources": [{ "path": "C:\\Users\\John\\Documents\\report.pdf" }], "include_full_text": true } // Unix/Mac { "sources": [{ "path": "/home/user/documents/contract.pdf" }], "include_full_text": true }

No more "Absolute paths are not allowed" errors!

Extract Images with Natural Ordering

{ "sources": [{ "path": "presentation.pdf", "pages": [1, 2, 3] }], "include_images": true, "include_full_text": true }

Response includes:

Text and images in exact document order (Y-coordinate sorted)
Base64-encoded images with metadata (width, height, format)
Natural reading flow preserved for AI comprehension

Batch Processing

{ "sources": [ { "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" }, { "path": "/home/user/Q2.pdf", "pages": "1-10" }, { "url": "https://example.com/Q3.pdf" } ], "include_full_text": true }

⚡ All PDFs processed in parallel automatically!

✨ Features

Core Capabilities

✅ Text Extraction - Full document or specific pages with intelligent parsing
✅ Image Extraction - Base64-encoded with complete metadata (width, height, format)
✅ Content Ordering - Y-coordinate based layout preservation for natural reading flow
✅ Metadata Extraction - Author, title, creation date, and custom properties
✅ Page Counting - Fast enumeration without loading full content
✅ Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
✅ Batch Processing - Multiple PDFs processed concurrently

Advanced Features

⚡ 5-10x Performance - Parallel page processing with Promise.all
🎯 Smart Pagination - Extract ranges like "1-5,10-15,20"
🖼️ Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
🛡️ Path Flexibility - Windows, Unix, and relative paths all supported (v1.3.0)
🔍 Error Resilience - Per-page error isolation with detailed messages
📏 Large File Support - Efficient streaming and memory management
📝 Type Safe - Full TypeScript with strict mode enabled

🆕 What's New in v1.3.0

🎉 Absolute Paths Now Supported!

// ✅ Windows { "path": "C:\\Users\\John\\Documents\\report.pdf" } { "path": "C:/Users/John/Documents/report.pdf" } // ✅ Unix/Mac { "path": "/home/john/documents/report.pdf" } { "path": "/Users/john/Documents/report.pdf" } // ✅ Relative (still works) { "path": "documents/report.pdf" }

Other Improvements:

🐛 Fixed Zod validation error handling
📦 Updated all dependencies to latest versions
✅ 103 tests passing, 94%+ coverage maintained

v1.2.0 - Content Ordering

Y-coordinate based text and image ordering
Natural reading flow for AI models
Intelligent line grouping

v1.1.0 - Image Extraction & Performance

Base64-encoded image extraction
10x speedup with parallel processing
Comprehensive test coverage (94%+)

View Full Changelog →

📖 API Reference

`read_pdf` Tool

The single tool that handles all PDF operations.

Parameters

Parameter	Type	Description	Default
`sources`	Array	List of PDF sources to process	Required
`include_full_text`	boolean	Extract full text content	`false`
`include_metadata`	boolean	Extract PDF metadata	`true`
`include_page_count`	boolean	Include total page count	`true`
`include_images`	boolean	Extract embedded images	`false`

Source Object

{ path?: string; // Local file path (absolute or relative) url?: string; // HTTP/HTTPS URL to PDF pages?: string | number[]; // Pages to extract: "1-5,10" or [1,2,3] }

Examples

Metadata only (fast):

{ "sources": [{ "path": "large.pdf" }], "include_metadata": true, "include_page_count": true, "include_full_text": false }

From URL:

{ "sources": [{ "url": "https://arxiv.org/pdf/2301.00001.pdf" }], "include_full_text": true }

Page ranges:

{ "sources": [{ "path": "manual.pdf", "pages": "1-5,10-15,20" // Pages 1,2,3,4,5,10,11,12,13,14,15,20 }] }

🔧 Advanced Usage

Content is returned in natural reading order based on Y-coordinates:

Document Layout: ┌─────────────────────┐ │ [Title] Y:100 │ │ [Image] Y:150 │ │ [Text] Y:400 │ │ [Photo A] Y:500 │ │ [Photo B] Y:550 │ └─────────────────────┘ Response Order: [ { type: "text", text: "Title..." }, { type: "image", data: "..." }, { type: "text", text: "..." }, { type: "image", data: "..." }, { type: "image", data: "..." } ]

Benefits:

AI understands spatial relationships
Natural document comprehension
Perfect for vision-enabled models
Automatic multi-line text grouping

Enable extraction:

{ "sources": [{ "path": "manual.pdf" }], "include_images": true }

Response format:

{ "images": [{ "page": 1, "index": 0, "width": 1920, "height": 1080, "format": "rgb", "data": "base64-encoded-png..." }] }

Supported formats: RGB, RGBA, Grayscale Auto-detected: JPEG, PNG, and other embedded formats

Absolute paths (v1.3.0+) - Direct file access:

{ "path": "C:\\Users\\John\\file.pdf" } { "path": "/home/user/file.pdf" }

Relative paths - Workspace files:

{ "path": "docs/report.pdf" } { "path": "./2024/Q1.pdf" }

Configure working directory:

{ "mcpServers": { "pdf-reader-mcp": { "command": "npx", "args": ["@sylphx/pdf-reader-mcp"], "cwd": "/path/to/documents" } } }

Strategy 1: Page ranges

{ "sources": [{ "path": "big.pdf", "pages": "1-20" }] }

Strategy 2: Progressive loading

// Step 1: Get page count { "sources": [{ "path": "big.pdf" }], "include_full_text": false } // Step 2: Extract sections { "sources": [{ "path": "big.pdf", "pages": "50-75" }] }

Strategy 3: Parallel batching

{ "sources": [ { "path": "big.pdf", "pages": "1-50" }, { "path": "big.pdf", "pages": "51-100" } ] }

🔧 Troubleshooting

"Absolute paths are not allowed"

Solution: Upgrade to v1.3.0+

npm update @sylphx/pdf-reader-mcp

Restart your MCP client completely.

"File not found"

Causes:

File doesn't exist at path
Wrong working directory
Permission issues

Solutions:

Use absolute path:

{ "path": "C:\\Full\\Path\\file.pdf" }

Or configure cwd:

{ "pdf-reader-mcp": { "command": "npx", "args": ["@sylphx/pdf-reader-mcp"], "cwd": "/path/to/docs" } }

"No tools showing up"

Solution:

npm cache clean --force rm -rf node_modules package-lock.json npm install @sylphx/pdf-reader-mcp@latest

Restart MCP client completely.

🏗️ Architecture

Tech Stack

Component	Technology
Runtime	Node.js 22+ ESM
PDF Engine	PDF.js (Mozilla)
Validation	Zod + JSON Schema
Protocol	MCP SDK
Language	TypeScript (strict)
Testing	Vitest (103 tests)
Quality	Biome (50x faster)
CI/CD	GitHub Actions

Design Principles

🔒 Security First - Flexible paths with secure defaults
🎯 Simple Interface - One tool, all operations
⚡ Performance - Parallel processing, efficient memory
🛡️ Reliability - Per-page isolation, detailed errors
🧪 Quality - 94%+ coverage, strict TypeScript
📝 Type Safety - No any types, strict mode
🔄 Backward Compatible - Smooth upgrades always

🧪 Development

Prerequisites:

Node.js >= 22.0.0
pnpm (recommended) or npm

Setup:

git clone https://github.com/SylphxAI/pdf-reader-mcp.git cd pdf-reader-mcp pnpm install && pnpm build

Scripts:

pnpm run build # Build TypeScript pnpm run test # Run 103 tests pnpm run test:cov # Coverage (94%+) pnpm run check # Lint + format pnpm run check:fix # Auto-fix pnpm run benchmark # Performance tests

Quality:

✅ 103 tests
✅ 94%+ coverage
✅ 98%+ function coverage
✅ Zero lint errors
✅ Strict TypeScript

Quick Start:

Fork repository
Create branch: git checkout -b feature/awesome
Make changes: pnpm test
Format: pnpm run check:fix
Commit: Use Conventional Commits
Open PR

Commit Format:

feat(images): add WebP support fix(paths): handle UNC paths docs(readme): update examples

See CONTRIBUTING.md

📚 Documentation

📖 Full Docs - Complete guides
🚀 Getting Started - Quick start
📘 API Reference - Detailed API
🏗️ Design - Architecture
⚡ Performance - Benchmarks
🔍 Comparison - vs. alternatives

🗺️ Roadmap

✅ Completed

Image extraction (v1.1.0)
5-10x parallel speedup (v1.1.0)
Y-coordinate ordering (v1.2.0)
Absolute paths (v1.3.0)
94%+ test coverage (v1.3.0)

🚀 Next

OCR for scanned PDFs
Annotation extraction
Form field extraction
Table detection
100+ MB streaming
Advanced caching
PDF generation

Vote at Discussions

🏆 Recognition

Featured on:

Smithery - MCP directory
Glama - AI marketplace
MseeP.ai - Security validated

Trusted worldwide • Enterprise adoption • Battle-tested

🤝 Support

GitHub Issues Discord

Show Your Support: ⭐ Star • 👀 Watch • 🐛 Report bugs • 💡 Suggest features • 🔀 Contribute

📊 Stats

Stars Forks Downloads Contributors

103 Tests • 94%+ Coverage • Production Ready

📄 License

MIT © Sylphx

🙏 Credits

Built with:

PDF.js - Mozilla PDF engine
MCP SDK - Model Context Protocol
Vitest - Fast testing framework

Special thanks to the open source community ❤️