Skip to main content
Glama

PDF MCP Server

An MCP server that enables reading PDF file contents, allowing PDF documents to be used as a knowledge base for LLMs.

Features

  • High-Quality Extraction: Uses marker-pdf (via a Python backend) to extract text with layout awareness and high-fidelity LaTeX equation recognition.

  • Robust Fallback: Automatically switches to a Node.js-based parser (pdf-parse) if the Python environment is unavailable or fails, ensuring extraction always succeeds (albeit with lower formatting quality).

  • Smart Filtering: Supports page range extraction to process only relevant sections of large documents.

Installation

Prerequisites

  • Node.js (v18+)

  • Python (v3.10+) and pip (for high-quality extraction)

Setup

  1. Install Node.js dependencies:

    npm install
  2. Install Python dependencies (Recommended): To enable high-quality extraction (especially for scientific papers with math), install the Python dependencies.

    # Create or activate a virtual environment if desired python3 -m pip install -r python/requirements.txt

    Note: The first time you run the tool with the Python backend, it will download necessary AI models (OCR, layout analysis, etc.) to a local cache. This download is approximately 3.3GB. Ensure you have a stable internet connection.

  3. Build the server:

    npm run build

Usage

Configuration for Claude/MCP Clients

Add this to your MCP settings configuration:

{ "mcpServers": { "pdf-reader": { "command": "node", "args": ["/absolute/path/to/mcpPdf/dist/index.js"], "env": { // Optional: Override where python is found if not in venv or path // "PYTHON_PATH": "/path/to/python" } } } }

Tool: read_pdf

Reads and extracts text content from a PDF file.

Inputs:

  • path (string): Absolute path to the PDF file.

  • start_page (number, optional): Starting page number (1-based).

  • end_page (number, optional): Ending page number (1-based).

How it works:

  1. Attempt 1 (Python/Marker): The server tries to run the internal convert.py script.

    • If successfully configured, this loads the marker models from the local cache (.cache directory in the project).

    • It accurately converts equations to LaTeX and preserves document structure.

  2. Attempt 2 (Fallback): If the Python script fails (e.g., missing dependencies, runtime error), the server catches the error and uses pdf-parse (a native Node.js library).

    • This extracts raw text. Equations may appear as linearized text, and layout may be less preserved.

Troubleshooting

  • Permission Errors: The project is configured to use a local .cache directory for models to avoid system permission issues. If you encounter errors, ensure the project directory is writable.

  • Slow Performance: The high-quality extraction uses deep learning models. It can be slow on large documents without a GPU. Use the start_page and end_page arguments to extract only what you need.

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wowuz/mcpPdf'

If you have feedback or need assistance with the MCP directory API, please join our Discord server