Enables the extraction of full text, metadata, and images from research papers hosted on arXiv by processing their PDF URLs.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@PDF Reader MCPextract the text from pages 1-5 of documents/report.pdf"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
PDF Reader MCP ๐
Production-ready PDF processing server for AI agents
5-10x faster parallel processing โข Y-coordinate content ordering โข 94%+ test coverage โข 103 tests passing
ๅบไบๅ้กน็ฎ: ๆญค้กน็ฎๅบไบ pdf-reader-mcp ไฟฎๆน่ๆฅ
๐ Overview
PDF Reader MCP is a production-ready Model Context Protocol server that empowers AI agents with enterprise-grade PDF processing capabilities. Extract text, images, and metadata with unmatched performance and reliability.
The Problem:
The Solution:
Result: Production-ready PDF processing that scales.
โก Key Features
Performance
๐ 5-10x faster than sequential with automatic parallelization
โก 12,933 ops/sec error handling, 5,575 ops/sec text extraction
๐จ Process 50-page PDFs in seconds with multi-core utilization
๐ฆ Lightweight with minimal dependencies
Developer Experience
๐ฏ Path Flexibility - Absolute & relative paths, Windows/Unix support (v1.3.0)
๐ผ๏ธ Smart Ordering - Y-coordinate based content preserves document layout
๐ก๏ธ Type Safe - Full TypeScript with strict mode enabled
๐ Battle-tested - 103 tests, 94%+ coverage, 98%+ function coverage
๐จ Simple API - Single tool handles all operations elegantly
๐ Performance Benchmarks
Real-world performance from production testing:
Operation | Ops/sec | Performance | Use Case |
Error handling | 12,933 | โกโกโกโกโก | Validation & safety |
Extract full text | 5,575 | โกโกโกโก | Document analysis |
Extract page | 5,329 | โกโกโกโก | Single page ops |
Multiple pages | 5,242 | โกโกโกโก | Batch processing |
Metadata only | 4,912 | โกโกโก | Quick inspection |
Parallel Processing Speedup
Document | Sequential | Parallel | Speedup |
10-page PDF | ~2s | ~0.3s | 5-8x faster |
50-page PDF | ~10s | ~1s | 10x faster |
100+ pages | ~20s | ~2s | Linear scaling with CPU cores |
Benchmarks vary based on PDF complexity and system resources.
๐ฆ Installation
๐ฏ Quick Start
Configuration
Add to your MCP client (claude_desktop_config.json, Cursor, Cline):
Basic Usage
Result:
โ Full text content extracted
โ PDF metadata (author, title, dates)
โ Total page count
โ Structural sharing - unchanged parts preserved
Extract Specific Pages
Absolute Paths (v1.3.0+)
No more "Absolute paths are not allowed" errors!
Extract Images with Natural Ordering
Response includes:
Text and images in exact document order (Y-coordinate sorted)
Base64-encoded images with metadata (width, height, format)
Natural reading flow preserved for AI comprehension
Batch Processing
โก All PDFs processed in parallel automatically!
โจ Features
Core Capabilities
โ Text Extraction - Full document or specific pages with intelligent parsing
โ Image Extraction - Base64-encoded with complete metadata (width, height, format)
โ Content Ordering - Y-coordinate based layout preservation for natural reading flow
โ Metadata Extraction - Author, title, creation date, and custom properties
โ Page Counting - Fast enumeration without loading full content
โ Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
โ Batch Processing - Multiple PDFs processed concurrently
Advanced Features
โก 5-10x Performance - Parallel page processing with Promise.all
๐ฏ Smart Pagination - Extract ranges like "1-5,10-15,20"
๐ผ๏ธ Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
๐ก๏ธ Path Flexibility - Windows, Unix, and relative paths all supported (v1.3.0)
๐ Error Resilience - Per-page error isolation with detailed messages
๐ Large File Support - Efficient streaming and memory management
๐ Type Safe - Full TypeScript with strict mode enabled
๐ What's New in v1.3.0
๐ Absolute Paths Now Supported!
Other Improvements:
๐ Fixed Zod validation error handling
๐ฆ Updated all dependencies to latest versions
โ 103 tests passing, 94%+ coverage maintained
v1.2.0 - Content Ordering
Y-coordinate based text and image ordering
Natural reading flow for AI models
Intelligent line grouping
v1.1.0 - Image Extraction & Performance
Base64-encoded image extraction
10x speedup with parallel processing
Comprehensive test coverage (94%+)
๐ API Reference
read_pdf Tool
The single tool that handles all PDF operations.
Parameters
Parameter | Type | Description | Default |
| Array | List of PDF sources to process | Required |
| boolean | Extract full text content |
|
| boolean | Extract PDF metadata |
|
| boolean | Include total page count |
|
| boolean | Extract embedded images |
|
Source Object
Examples
Metadata only (fast):
From URL:
Page ranges:
๐ง Advanced Usage
Content is returned in natural reading order based on Y-coordinates:
Benefits:
AI understands spatial relationships
Natural document comprehension
Perfect for vision-enabled models
Automatic multi-line text grouping
Enable extraction:
Response format:
Supported formats: RGB, RGBA, Grayscale Auto-detected: JPEG, PNG, and other embedded formats
Absolute paths (v1.3.0+) - Direct file access:
Relative paths - Workspace files:
Configure working directory:
Strategy 1: Page ranges
Strategy 2: Progressive loading
Strategy 3: Parallel batching
๐ง Troubleshooting
"Absolute paths are not allowed"
Solution: Upgrade to v1.3.0+
Restart your MCP client completely.
"File not found"
Causes:
File doesn't exist at path
Wrong working directory
Permission issues
Solutions:
Use absolute path:
Or configure cwd:
"No tools showing up"
Solution:
Restart MCP client completely.
๐๏ธ Architecture
Tech Stack
Component | Technology |
Runtime | Node.js 22+ ESM |
PDF Engine | PDF.js (Mozilla) |
Validation | Zod + JSON Schema |
Protocol | MCP SDK |
Language | TypeScript (strict) |
Testing | Vitest (103 tests) |
Quality | Biome (50x faster) |
CI/CD | GitHub Actions |
Design Principles
๐ Security First - Flexible paths with secure defaults
๐ฏ Simple Interface - One tool, all operations
โก Performance - Parallel processing, efficient memory
๐ก๏ธ Reliability - Per-page isolation, detailed errors
๐งช Quality - 94%+ coverage, strict TypeScript
๐ Type Safety - No
anytypes, strict mode๐ Backward Compatible - Smooth upgrades always
๐งช Development
Prerequisites:
Node.js >= 22.0.0
pnpm (recommended) or npm
Setup:
Scripts:
Quality:
โ 103 tests
โ 94%+ coverage
โ 98%+ function coverage
โ Zero lint errors
โ Strict TypeScript
Quick Start:
Fork repository
Create branch:
git checkout -b feature/awesomeMake changes:
pnpm testFormat:
pnpm run check:fixCommit: Use Conventional Commits
Open PR
Commit Format:
See CONTRIBUTING.md
๐ Documentation
๐ Full Docs - Complete guides
๐ Getting Started - Quick start
๐ API Reference - Detailed API
๐๏ธ Design - Architecture
โก Performance - Benchmarks
๐ Comparison - vs. alternatives
๐บ๏ธ Roadmap
โ Completed
Image extraction (v1.1.0)
5-10x parallel speedup (v1.1.0)
Y-coordinate ordering (v1.2.0)
Absolute paths (v1.3.0)
94%+ test coverage (v1.3.0)
๐ Next
OCR for scanned PDFs
Annotation extraction
Form field extraction
Table detection
100+ MB streaming
Advanced caching
PDF generation
Vote at Discussions
๐ Recognition
Featured on:
Trusted worldwide โข Enterprise adoption โข Battle-tested
๐ค Support
๐ Bug Reports
๐ฌ Discussions
๐ Documentation
๐ง Email
Show Your Support: โญ Star โข ๐ Watch โข ๐ Report bugs โข ๐ก Suggest features โข ๐ Contribute
๐ Stats
103 Tests โข 94%+ Coverage โข Production Ready
๐ License
MIT ยฉ Sylphx
๐ Credits
Built with:
Special thanks to the open source community โค๏ธ