pdf2mcp
Uses OpenAI embedding models (e.g., text-embedding-3-small) to generate vector embeddings for PDF content, enabling semantic search.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@pdf2mcpsearch for invoices from March 2025"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
pdf2mcp
██████╗ ██████╗ ███████╗██████╗ ███╗ ███╗ ██████╗██████╗
██╔══██╗██╔══██╗██╔════╝╚════██╗████╗ ████║██╔════╝██╔══██╗
██████╔╝██║ ██║█████╗ █████╔╝██╔████╔██║██║ ██████╔╝
██╔═══╝ ██║ ██║██╔══╝ ██╔═══╝ ██║╚██╔╝██║██║ ██╔═══╝
██║ ██████╔╝██║ ███████╗██║ ╚═╝ ██║╚██████╗██║
╚═╝ ╚═════╝ ╚═╝ ╚══════╝╚═╝ ╚═╝ ╚═════╝╚═╝
Turn any PDF folder into a searchable MCP server with semantic, hybrid, or keyword search.
Installation
From PyPI (recommended)
pip install pdf2mcpOr with uv:
uv tool install pdf2mcpFrom source
git clone https://github.com/iSamBa/pdf2mcp.git
uv tool install ./pdf2mcpTo update after pulling new changes:
uv tool install --force ./pdf2mcpOptional: Tesseract OCR
Tesseract is only needed if you want to extract text from scanned or image-only PDFs. Without it, pdf2mcp works fine for text-based PDFs — image-only pages are simply skipped with a warning.
macOS:
brew install tesseractUbuntu / Debian:
sudo apt-get install tesseract-ocrWindows:
Download the installer from UB-Mannheim/tesseract.
Additional languages: install language packs for non-English PDFs:
# Example: French and German
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu
# or on macOS
brew install tesseract-langThen set PDF2MCP_OCR_LANGUAGE to the appropriate language code (e.g., fra, deu).
Verify
pdf2mcp --versionRelated MCP server: MCP Server Knowledge Engine
Quick Start
Interactive Setup (recommended)
pdf2mcp init -i ./my-projectThe interactive wizard walks you through all configuration in 6 steps:
Project directory — confirm or change the target path
OpenAI API key — securely enter your key (masked input) and optional base URL
Documents directory — where your PDFs live (default:
docs)Embedding settings — choose model, chunk size, and overlap
Server settings — name, transport, host, and port
OCR settings — enable/disable OCR for scanned PDFs
After setup, the wizard optionally offers to ingest any PDFs found in your docs directory and generate ready-to-paste MCP client config snippets.
Manual Setup
# 1. Scaffold a project (creates docs/ and .env template)
pdf2mcp init ./my-project
cd my-project
# 2. Add your PDFs to docs/ and set OPENAI_API_KEY in .env
# 3. Ingest
pdf2mcp ingest
# 4. Start the server
pdf2mcp serve
# 5. Get config snippets for your MCP client
pdf2mcp configArchitecture
pdf2mcp separates server and client concerns:
Server (
pdf2mcp serve) — runs independently, handles PDF ingestion, embedding, and search. Configured viaPDF2MCP_*environment variables.Client (Claude Code, Cursor, VS Code, etc.) — connects to a running server over HTTP. Only needs the server URL.
The default transport is streamable-http. The server listens on http://127.0.0.1:8000/mcp and shuts down gracefully on SIGINT/SIGTERM.
OCR / Scanned PDF Support
pdf2mcp automatically detects image-only pages in PDFs and falls back to Tesseract OCR when available:
Per-page strategy: text pages are extracted via pymupdf4llm; image-only pages are OCR'd via Tesseract.
Automatic detection: each page is checked for extractable text (via
_page_has_text) and image dominance (via_is_image_dominant). Pages without sufficient text are classified as image-only.Graceful degradation: if Tesseract is not installed or OCR is disabled, image-only pages are skipped with a warning — text-based pages are still extracted normally.
Configuration: use
PDF2MCP_OCR_ENABLED,PDF2MCP_OCR_LANGUAGE, andPDF2MCP_OCR_DPIenvironment variables (see Environment Variables).
Commands
Command | Description |
| Scaffold a working directory with |
| Launch the interactive setup wizard |
| Parse PDFs, chunk, embed, and store in vector DB |
| Start the MCP server (HTTP by default) |
| Print ready-to-paste config for MCP clients |
| Display index statistics (doc count, chunks, DB size) |
| Search the index from the command line |
| Delete a document from the index |
Common Flags
# Override docs directory
pdf2mcp ingest --docs-dir ./my-pdfs
pdf2mcp serve --docs-dir ./my-pdfs
# Force re-ingestion (clears DB and re-ingests all documents)
pdf2mcp ingest --force
# Enable debug logging
pdf2mcp ingest -v
pdf2mcp serve --verbose
# Use stdio transport (for clients that spawn the server)
pdf2mcp serve --transport stdio
# Custom host/port
pdf2mcp serve --host 0.0.0.0 --port 9000
# Custom server name
pdf2mcp serve --name my-docs
# Config for a specific client
pdf2mcp config --client cursor
pdf2mcp config --client claude-desktop --transport stdio
# Interactive setup wizard
pdf2mcp init -i ./my-project
pdf2mcp init --interactive
# View index statistics
pdf2mcp stats
# Search the index from CLI
pdf2mcp search "safety requirements"
pdf2mcp search "torque settings" --filename manual.pdf
pdf2mcp search "installation" -n 10
# Delete a document from the index
pdf2mcp delete old-manual.pdf
pdf2mcp delete old-manual.pdf -y # skip confirmationClient Configuration
pdf2mcp config generates ready-to-paste JSON for all supported clients. The default is HTTP — clients just need the server URL:
{
"mcpServers": {
"pdf-docs": {
"type": "http",
"url": "http://127.0.0.1:8000/mcp"
}
}
}Client | Config File | Top-level Key | HTTP Support |
Claude Code |
|
| Yes |
Claude Desktop |
|
| No (stdio only) |
Cursor |
|
| Yes |
VS Code / Copilot |
|
| Yes |
Use --transport stdio for clients that need to spawn the server process (e.g., Claude Desktop):
{
"mcpServers": {
"pdf-docs": {
"command": "uv",
"args": ["run", "pdf2mcp", "serve"]
}
}
}Environment Variables
Server settings (PDF2MCP_*)
These configure the server process. MCP clients never need these.
Variable | Default | Description |
| (required) | OpenAI API key for embeddings |
|
| OpenAI API base URL (for Azure, local proxies, or compatible providers) |
|
| Directory containing PDF files |
|
| Directory for vector database |
|
| OpenAI embedding model |
|
| Target chunk size in tokens |
|
| Overlap between chunks in tokens |
|
| Default search results count |
|
| MCP server name |
|
| Transport protocol |
|
| Host to bind to |
|
| Port to bind to |
|
| Search mode: |
|
| Enable OCR for scanned/image-only pages |
|
| Tesseract language code |
|
| DPI for OCR rendering |
Search Modes
pdf2mcp supports three search modes, controlled by the PDF2MCP_SEARCH_MODE environment variable:
Mode | Description | When to use |
| Pure vector similarity search | General natural-language queries |
| Full-text search (no embeddings needed) | Exact terms, acronyms, error codes |
| Combines vector + full-text search | Best of both worlds |
To switch modes, set PDF2MCP_SEARCH_MODE in your .env and re-ingest:
# In .env
PDF2MCP_SEARCH_MODE=hybrid
# Re-ingest to build the FTS index
pdf2mcp ingest --forceHybrid and keyword modes automatically create a full-text search index. If you switch modes without re-ingesting, the FTS index is created lazily on the first query.
MCP Tools
The server exposes six tools:
Tool | Description |
| Search across all ingested PDFs |
| Search scoped to a single document |
| List all ingested documents with chunk counts |
| Get section headings for a specific document |
| Read the full content of a specific page |
| Read the full content of a named section |
Typical workflow
list_docs— discover available documentsget_sections— browse a document's structureread_sectionorread_page— read specific contentsearch_docsorsearch_in_doc— find information by query
MCP Prompts
The server provides five prompts that guide LLMs through multi-tool workflows:
Prompt | Args | Description |
|
| Read all sections and synthesize a summary |
|
| Side-by-side comparison of two documents |
|
| Extract conclusions, recommendations, and key findings |
|
| Exhaustive analysis of a specific topic |
|
| Structured table of contents with brief descriptions |
Prompts return step-by-step instructions that reference the existing tools, enabling LLMs to perform complex multi-step document analysis automatically.
MCP Resources
Resource URI | Description |
| Server status: document count, chunk count, embedding model, and docs directory |
Development
git clone https://github.com/iSamBa/pdf2mcp.git
cd pdf2mcp
uv sync --all-extras
uv run pytest
uv run ruff check src/
uv run mypy src/License
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/iSamBa/pdf2mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server