Skip to main content
Glama

PDF MCP Server

A Model Context Protocol (MCP) server for querying PDF documents using AI-powered retrieval-augmented generation (RAG).

๐Ÿš€ Features

  • PDF Document Processing: Automatic parsing and indexing of PDF files using Docling

  • Hybrid Retrieval: Combines BM25 and vector search for accurate information retrieval

  • Structured Responses: Returns JSON with answers, source citations, and confidence scores

  • MCP Integration: Exposes query_pdf tool via FastMCP for seamless integration

๐Ÿ“‹ Prerequisites

  • Python 3.11 or later

  • OpenAI API key

  • PDF documents to query

๐Ÿ› ๏ธ Installation

1. Clone the Repository (if not already done)

git clone <repository-url> cd pdf_mcpserver

2. Install Dependencies with uv

uv sync

This will automatically:

  • Create a virtual environment (.venv)

  • Install all dependencies from pyproject.toml

  • Set up the project

3. Configure Environment

Copy the example environment file and add your OpenAI API key:

cp .env.example .env

Edit .env and set your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key_here PDF_DOCUMENTS_DIR=./documents CHROMA_DB_DIR=./chroma_db LOG_LEVEL=INFO

5. Add PDF Documents

Create a documents directory and add your PDF files:

mkdir documents # Copy your PDF files to the documents/ directory

๐ŸŽฏ Usage

Running the Server

uv run python main.py

Or activate the virtual environment first:

source .venv/bin/activate # On Windows: .venv\Scripts\activate python main.py

The server will:

  1. Validate configuration

  2. Load and index all PDF files from the documents/ directory

  3. Build hybrid retriever (BM25 + Vector Search)

  4. Start the MCP server

Using the query_pdf Tool

The server exposes a single MCP tool: query_pdf(question: str) -> str

Example Query:

query_pdf("What is the main topic of this document?")

Example Response:

{ "answer": "The main topic is artificial intelligence and machine learning...", "sources": [ { "document_name": "ai_research.pdf", "page_number": 1, "chunk_text": "Artificial intelligence (AI) is the simulation of human intelligence..." } ], "confidence_score": 0.85 }

Response Structure

Field

Type

Description

answer

string

Generated answer to the question

sources

array

List of source citations with document name, page number, and relevant text

confidence_score

float

Estimated confidence (0.0 to 1.0)

๐Ÿ—๏ธ Architecture

pdf_mcpserver/ โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ config.py # Configuration management โ”‚ โ”œโ”€โ”€ models.py # Pydantic models for responses โ”‚ โ”œโ”€โ”€ pdf_processor.py # PDF loading and indexing โ”‚ โ””โ”€โ”€ query_handler.py # Query processing and LLM integration โ”œโ”€โ”€ main.py # MCP server entry point โ”œโ”€โ”€ requirements.txt # Python dependencies โ””โ”€โ”€ .env # Environment configuration

Key Components

  • PDFProcessor: Singleton class that loads PDFs, converts to Markdown using Docling, and builds hybrid retriever

  • QueryHandler: Processes queries, retrieves relevant chunks, and generates answers using OpenAI GPT-4o-mini

  • FastMCP: MCP server framework that exposes the query_pdf tool

๐Ÿ”ง Configuration

Environment Variables

Variable

Default

Description

OPENAI_API_KEY

(required)

Your OpenAI API key

PDF_DOCUMENTS_DIR

./documents

Directory containing PDF files

CHROMA_DB_DIR

./chroma_db

ChromaDB storage directory

LOG_LEVEL

INFO

Logging level (DEBUG, INFO, WARNING, ERROR)

๐Ÿงช Testing

Run unit tests:

uv run pytest tests/

๐Ÿ“ Troubleshooting

No PDF files found

Error: No PDF files found in ./documents

Solution: Add PDF files to the documents/ directory or update PDF_DOCUMENTS_DIR in .env

OpenAI API key missing

Error: OPENAI_API_KEY is required

Solution: Set your OpenAI API key in the .env file

Import errors

Error: ModuleNotFoundError: No module named 'docling'

Solution: Ensure all dependencies are installed: uv sync

๐Ÿ“š Dependencies

  • fastmcp: MCP server framework

  • docling: Document processing and parsing

  • chromadb: Vector database for embeddings

  • langchain: RAG framework

  • openai: LLM provider

  • loguru: Logging

๐Ÿค Contributing

This is a Proof of Concept (PoC) implementation. For production use, consider:

  • Adding caching for processed documents

  • Implementing multi-agent workflow with fact verification

  • Supporting additional document formats (DOCX, TXT, etc.)

  • Adding authentication and rate limiting

๐Ÿ“„ License

[Your License Here]

๐Ÿ™ Acknowledgments

Based on the docchat-docling architecture.

-
security - not tested
F
license - not found
-
quality - not tested

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rhuanca/pdf_mcpserver'

If you have feedback or need assistance with the MCP directory API, please join our Discord server