# PDF MCP Server
An MCP server that enables reading PDF file contents, allowing PDF documents to be used as a knowledge base for LLMs.
## Features
- **High-Quality Extraction**: Uses [marker-pdf](https://github.com/VikParuchuri/marker) (via a Python backend) to extract text with layout awareness and high-fidelity LaTeX equation recognition.
- **Robust Fallback**: Automatically switches to a Node.js-based parser (`pdf-parse`) if the Python environment is unavailable or fails, ensuring extraction always succeeds (albeit with lower formatting quality).
- **Smart Filtering**: Supports page range extraction to process only relevant sections of large documents.
## Installation
### Prerequisites
- **Node.js** (v18+)
- **Python** (v3.10+) and `pip` (for high-quality extraction)
### Setup
1. **Install Node.js dependencies:**
```bash
npm install
```
2. **Install Python dependencies (Recommended):**
To enable high-quality extraction (especially for scientific papers with math), install the Python dependencies.
```bash
# Create or activate a virtual environment if desired
python3 -m pip install -r python/requirements.txt
```
> **Note:** The first time you run the tool with the Python backend, it will download necessary AI models (OCR, layout analysis, etc.) to a local cache. **This download is approximately 3.3GB.** Ensure you have a stable internet connection.
3. **Build the server:**
```bash
npm run build
```
## Usage
### Configuration for Claude/MCP Clients
Add this to your MCP settings configuration:
```json
{
"mcpServers": {
"pdf-reader": {
"command": "node",
"args": ["/absolute/path/to/mcpPdf/dist/index.js"],
"env": {
// Optional: Override where python is found if not in venv or path
// "PYTHON_PATH": "/path/to/python"
}
}
}
}
```
### Tool: `read_pdf`
Reads and extracts text content from a PDF file.
**Inputs:**
- `path` (string): Absolute path to the PDF file.
- `start_page` (number, optional): Starting page number (1-based).
- `end_page` (number, optional): Ending page number (1-based).
**How it works:**
1. **Attempt 1 (Python/Marker)**: The server tries to run the internal `convert.py` script.
- If successfully configured, this loads the `marker` models from the local cache (`.cache` directory in the project).
- It accurately converts equations to LaTeX and preserves document structure.
2. **Attempt 2 (Fallback)**: If the Python script fails (e.g., missing dependencies, runtime error), the server catches the error and uses `pdf-parse` (a native Node.js library).
- This extracts raw text. Equations may appear as linearized text, and layout may be less preserved.
## Troubleshooting
- **Permission Errors**: The project is configured to use a local `.cache` directory for models to avoid system permission issues. If you encounter errors, ensure the project directory is writable.
- **Slow Performance**: The high-quality extraction uses deep learning models. It can be slow on large documents without a GPU. Use the `start_page` and `end_page` arguments to extract only what you need.