PDF MCP Server

README.md•3.11 KiB

# PDF MCP Server An MCP server that enables reading PDF file contents, allowing PDF documents to be used as a knowledge base for LLMs. ## Features - **High-Quality Extraction**: Uses [marker-pdf](https://github.com/VikParuchuri/marker) (via a Python backend) to extract text with layout awareness and high-fidelity LaTeX equation recognition. - **Robust Fallback**: Automatically switches to a Node.js-based parser (`pdf-parse`) if the Python environment is unavailable or fails, ensuring extraction always succeeds (albeit with lower formatting quality). - **Smart Filtering**: Supports page range extraction to process only relevant sections of large documents. ## Installation ### Prerequisites - **Node.js** (v18+) - **Python** (v3.10+) and `pip` (for high-quality extraction) ### Setup 1. **Install Node.js dependencies:** ```bash npm install ``` 2. **Install Python dependencies (Recommended):** To enable high-quality extraction (especially for scientific papers with math), install the Python dependencies. ```bash # Create or activate a virtual environment if desired python3 -m pip install -r python/requirements.txt ``` > **Note:** The first time you run the tool with the Python backend, it will download necessary AI models (OCR, layout analysis, etc.) to a local cache. **This download is approximately 3.3GB.** Ensure you have a stable internet connection. 3. **Build the server:** ```bash npm run build ``` ## Usage ### Configuration for Claude/MCP Clients Add this to your MCP settings configuration: ```json { "mcpServers": { "pdf-reader": { "command": "node", "args": ["/absolute/path/to/mcpPdf/dist/index.js"], "env": { // Optional: Override where python is found if not in venv or path // "PYTHON_PATH": "/path/to/python" } } } } ``` ### Tool: `read_pdf` Reads and extracts text content from a PDF file. **Inputs:** - `path` (string): Absolute path to the PDF file. - `start_page` (number, optional): Starting page number (1-based). - `end_page` (number, optional): Ending page number (1-based). **How it works:** 1. **Attempt 1 (Python/Marker)**: The server tries to run the internal `convert.py` script. - If successfully configured, this loads the `marker` models from the local cache (`.cache` directory in the project). - It accurately converts equations to LaTeX and preserves document structure. 2. **Attempt 2 (Fallback)**: If the Python script fails (e.g., missing dependencies, runtime error), the server catches the error and uses `pdf-parse` (a native Node.js library). - This extracts raw text. Equations may appear as linearized text, and layout may be less preserved. ## Troubleshooting - **Permission Errors**: The project is configured to use a local `.cache` directory for models to avoid system permission issues. If you encounter errors, ensure the project directory is writable. - **Slow Performance**: The high-quality extraction uses deep learning models. It can be slow on large documents without a GPU. Use the `start_page` and `end_page` arguments to extract only what you need.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wowuz/mcpPdf'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•3.11 KiB