PDFtotext MCP Server
A reliable Model Context Protocol (MCP) server for PDF text extraction using the proven pdftotext
utility from poppler-utils.
🚀 Why This Server?
Unlike other PDF MCP servers that suffer from logging interference, complex dependencies, and reliability issues, pdftotext-mcp
is:
- ✅ Actually works - Clean JSON-RPC communication without stdout pollution
- ✅ Reliable - Built on mature
pdftotext
from poppler-utils (used by millions) - ✅ Lightweight - Minimal dependencies, maximum compatibility
- ✅ Production tested - Successfully tested with Claude Desktop and other MCP clients
- ✅ Feature complete - Page-specific extraction, layout preservation, encoding options
- ✅ Error handling - Comprehensive validation and helpful error messages
📋 Features
- 📄 Extract text from entire PDF documents or specific pages
- 🎨 Preserve original layout formatting (optional)
- 🔤 Multiple text encoding support (UTF-8, Latin1, ASCII)
- 📊 Comprehensive metadata in responses (word count, file info, etc.)
- 🛡️ File validation and security checks
- ⚡ Fast processing with configurable timeouts
- 🔍 Detailed error reporting with troubleshooting hints
🔧 Prerequisites
You must have pdftotext
installed on your system:
Ubuntu/Debian
macOS
Windows
Verify Installation
📦 Installation
Option 1: Global Installation (Recommended)
Option 2: Use with npx (No Installation)
Option 3: Local Development
⚙️ Configuration
Add to your MCP client configuration:
Claude Desktop
Add to claude_desktop_config.json
:
Or with npx:
Other MCP Clients
The server works with any MCP-compatible client. Use pdftotext-mcp
as the command.
🎯 Usage
The server provides a single, powerful tool: read_pdf_text
Basic Usage
Extract entire document
Extract specific page
Preserve layout formatting
Custom encoding
Response Format
Success Response
Error Response
📚 API Reference
Tool: read_pdf_text
Extracts text content from PDF files using pdftotext.
Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
path | string | ✅ | - | Path to PDF file (relative or absolute) |
page | number | ❌ | all pages | Specific page to extract (1-based) |
layout | boolean | ❌ | false | Preserve original text layout |
encoding | string | ❌ | "UTF-8" | Output text encoding |
Supported Encodings
UTF-8
(default)Latin1
ASCII
Error Types
FILE_NOT_FOUND
- PDF file doesn't existPERMISSION_DENIED
- Cannot read the fileINVALID_PDF
- File is not a valid PDFPDFTOTEXT_ERROR
- pdftotext utility errorUNKNOWN_ERROR
- Unexpected error
🔧 Troubleshooting
"pdftotext is not available"
Solution: Install poppler-utils (see Prerequisites)
"File not found"
Solutions:
- Use absolute paths:
/home/user/document.pdf
- Check file exists:
ls -la /path/to/file.pdf
- Verify MCP server working directory
"Permission denied"
Solutions:
- Check file permissions:
chmod 644 document.pdf
- Ensure directory is readable:
chmod 755 /path/to/directory/
"File is not a valid PDF"
Solutions:
- Verify file is actually a PDF:
file document.pdf
- Check for file corruption
- Try with a different PDF file
MCP Connection Issues
Solutions:
- Restart your MCP client completely
- Check configuration syntax in config file
- Verify
pdftotext-mcp
is accessible in PATH - Check MCP client logs for detailed errors
🧪 Testing
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Setup
Running Locally
Code Style
This project uses ESLint. Run npm run lint
to check code style.
📄 License
MIT - see LICENSE file for details.
🙏 Acknowledgments
- Built for the Model Context Protocol ecosystem
- Uses poppler-utils
pdftotext
utility - Inspired by the need for reliable PDF processing in MCP environments
🔗 Related
Made for the MCP community
Tools
A reliable server for extracting text from PDF documents using the poppler-utils' pdftotext utility, compatible with any Model Context Protocol client.
Related MCP Servers
- -securityFlicense-qualityProvides tools for reading and extracting text from PDF files, supporting both local files and URLs.Last updated -25Python
- AsecurityFlicenseAqualityAn MCP server that provides a tool to extract text content from local PDF files, supporting both standard PDF reading and OCR capabilities with optional page selection.Last updated -117Python
- -securityFlicense-qualityA PDF processing server that extracts text via normal parsing or OCR, and retrieves images from PDF files through the MCP protocol with a built-in web debugger.Last updated -26Python
- -securityAlicense-qualityA Model Context Protocol (MCP) based server that efficiently manages PDF files, allowing AI coding tools like Cursor to read, summarize, and extract information from PDF datasheets to assist embedded development work.Last updated -6Apache 2.0