Skip to main content
Glama

PDF RAG MCP Server

PDF RAG MCP Server

A Model Context Protocol (MCP) server that provides powerful RAG (Retrieval-Augmented Generation) capabilities for PDF documents. This server uses ChromaDB for vector storage, sentence-transformers for embeddings, and semantic chunking for intelligent text segmentation.

Features

  • Semantic Chunking: Intelligently groups sentences together instead of splitting at arbitrary character limits

  • Vector Search: Find semantically similar content using embeddings

  • Keyword Search: Traditional keyword-based search for exact terms

  • OCR Support: Automatic detection and OCR processing for scanned/image-based PDFs

  • Source Tracking: Maintains document names and page numbers for all chunks

  • Add/Remove PDFs: Easily manage your document collection

  • Persistent Storage: ChromaDB persists your embeddings to disk

  • Multiple Output Formats: Get results in Markdown or JSON format

  • Progress Reporting: Real-time feedback during long operations

Architecture

  • Embedding Model: multi-qa-mpnet-base-dot-v1 (optimized for question-answering)

  • Vector Database: ChromaDB with cosine similarity

  • Chunking Strategy: Semantic chunking with configurable sentence grouping and overlap

  • PDF Extraction: PyMuPDF for text extraction with OCR fallback for scanned PDFs

Installation

1. Install Python Dependencies

pip install -r requirements.txt

2. Download NLTK Data (Automatic)

The server automatically downloads required NLTK punkt tokenizer data on first run.

3. Install Tesseract (Optional - for OCR)

For scanned PDF support, install Tesseract:

The server automatically detects scanned pages and uses OCR when Tesseract is available.

4. Test the Server

python pdf_rag_mcp.py --help

Configuration

Database Location

The server stores its ChromaDB database in a configurable location. You can specify the database path using the --db-path command line argument:

# Use default location (~/.dotfiles/files/mcps/pdfrag/chroma_db) python pdf_rag_mcp.py # Use custom database location python pdf_rag_mcp.py --db-path /path/to/your/database

Chunking Parameters

Default chunking settings:

  • Chunk Size: 3 sentences per chunk

  • Overlap: 1 sentence overlap between chunks

These can be customized when adding PDFs:

{ "pdf_path": "/path/to/document.pdf", "chunk_size": 5, # Use 5 sentences per chunk "overlap": 2 # 2 sentences overlap }

Character Limit

Responses are limited to 25,000 characters by default. If exceeded, results are automatically truncated with a warning message.

MCP Tools

1. pdf_add

Add a PDF document to the RAG database.

Input:

{ "pdf_path": "/absolute/path/to/document.pdf", "chunk_size": 3, // optional, default: 3 "overlap": 1 // optional, default: 1 }

Output:

{ "status": "success", "message": "Successfully added 'document.pdf' to the database", "document_id": "a1b2c3d4...", "filename": "document.pdf", "pages": 15, "chunks": 127, "chunk_size": 3, "overlap": 1 }

Example Use Cases:

  • Adding research papers for reference

  • Indexing documentation

  • Building a searchable knowledge base

2. pdf_remove

Remove a PDF document from the database.

Input:

{ "document_id": "a1b2c3d4..." // Get from pdf_list }

Output:

{ "status": "success", "message": "Successfully removed 'document.pdf' from the database", "document_id": "a1b2c3d4...", "removed_chunks": 127 }

3. pdf_list

List all PDF documents in the database.

Input:

{ "response_format": "markdown" // or "json" }

Output (Markdown):

# PDF Documents (2 total) ## research_paper.pdf **Document ID:** a1b2c3d4... **Chunks:** 127 **Added:** N/A ## documentation.pdf **Document ID:** e5f6g7h8... **Chunks:** 89 **Added:** N/A

Output (JSON):

{ "count": 2, "documents": [ { "document_id": "a1b2c3d4...", "filename": "research_paper.pdf", "chunk_count": 127 }, { "document_id": "e5f6g7h8...", "filename": "documentation.pdf", "chunk_count": 89 } ] }

4. pdf_search_similarity

Search using semantic similarity (vector search).

Input:

{ "query": "machine learning techniques for text classification", "top_k": 5, // optional, default: 5 "document_filter": null, // optional, search specific doc "response_format": "markdown" // optional, default: markdown }

Output (Markdown):

# Search Results for: 'machine learning techniques for text classification' Found 5 relevant chunks: ## Result 1 **Document:** research_paper.pdf **Page:** 7 **Similarity Score:** 0.8754 **Content:** Machine learning approaches to text classification have evolved significantly... ---

Use Cases:

  • Finding relevant information without exact keywords

  • Discovering related concepts

  • Question answering over documents

5. pdf_search_keywords

Search using keyword matching.

Input:

{ "keywords": "neural network backpropagation", "top_k": 5, // optional, default: 5 "document_filter": null, // optional "response_format": "markdown" // optional, default: markdown }

Output: Similar to pdf_search_similarity, but ranked by keyword occurrence count.

Use Cases:

  • Finding specific technical terms

  • Locating exact phrases or terminology

  • Verifying presence of keywords in documents

Usage with Claude Desktop

1. Add to Claude Desktop Configuration

Edit your Claude Desktop config file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

Add the server:

{ "mcpServers": { "pdf-rag": { "command": "python", "args": [ "/absolute/path/to/pdf_rag_mcp.py" ] } } }

Custom Database Location:

To use a custom database path with Claude Desktop:

{ "mcpServers": { "pdf-rag": { "command": "python", "args": [ "/absolute/path/to/pdf_rag_mcp.py", "--db-path", "/custom/path/to/database" ] } } }

2. Restart Claude Desktop

After adding the configuration, restart Claude Desktop to load the MCP server.

3. Test the Connection

In Claude Desktop, try:

Can you list the PDFs in the RAG database?

Claude will use the pdf_list tool to show available documents.

Example Workflows

Building a Research Database

1. Add documents: "Add these PDFs to the database: /research/paper1.pdf, /research/paper2.pdf" 2. Search for concepts: "Search for information about 'gradient descent optimization' in the database" 3. Find specific terms: "Search for the keyword 'convolutional neural network' and show me the pages"

Document Q&A

1. Add documentation: "Add this user manual: /docs/product_manual.pdf" 2. Ask questions: "How do I configure the network settings according to the manual?" 3. Find references: "Which page discusses troubleshooting connection errors?"

Knowledge Base Management

1. List documents: "Show me all documents in the RAG database" 2. Remove outdated docs: "Remove the document with ID a1b2c3d4..." 3. Search across all: "Search all documents for information about API authentication"

Advanced Configuration

Custom Chunk Sizes

For different document types:

Technical Documents (code, APIs):

  • Smaller chunks (2-3 sentences)

  • Minimal overlap (0-1 sentences)

  • Preserves code structure

Narrative Documents (articles, books):

  • Larger chunks (5-7 sentences)

  • More overlap (2-3 sentences)

  • Maintains context flow

Scientific Papers:

  • Medium chunks (3-5 sentences)

  • Moderate overlap (1-2 sentences)

  • Balances detail and context

Document Filtering

Search within specific documents:

{ "query": "data preprocessing", "document_filter": "a1b2c3d4..." // Only search this doc }

Output Format Selection

Choose format based on use case:

Markdown: Best for human reading, Claude's analysis JSON: Best for programmatic processing, data extraction

Troubleshooting

"File not found" Error

Ensure you're using absolute paths:

"/home/user/documents/paper.pdf" ✅ "~/documents/paper.pdf" ❌ (needs expansion) "./paper.pdf" ❌ (relative path)

Empty PDF Results / Scanned PDFs

The server automatically detects and processes scanned PDFs using OCR. If you get an error about no text being extracted:

  1. Install Tesseract (if not already installed):

  2. Retry adding the PDF - the server will automatically use OCR for pages with minimal text

The error message will indicate if OCR is needed: "ensure tesseract is installed for scanned PDFs"

Out of Memory

If processing large PDFs causes memory issues:

  1. Reduce chunk_size to create more, smaller chunks

  2. Process documents one at a time

  3. Increase system swap space

ChromaDB Errors

If ChromaDB complains about existing collections:

# Remove the database directory rm -rf ./chroma_db # Restart the server

Performance Considerations

Embedding Generation

The first time you add a document, the model will be downloaded (~400MB). Subsequent operations are faster.

Typical Times:

  • 10-page PDF: ~5-10 seconds

  • 100-page PDF: ~30-60 seconds

  • 1000-page PDF: ~5-10 minutes

Search Performance

  • Similarity Search: Fast (< 1 second for most queries)

  • Keyword Search: Slower for large collections (scales with document count)

Storage

  • Embeddings: ~1.5KB per chunk (768-dimensional vectors)

  • Text Storage: Depends on chunk size

  • Example: 1000 chunks ≈ 1.5MB in ChromaDB

Best Practices

1. Organize Documents

Use descriptive filenames:

research_ml_2024.pdf ✅ document (1).pdf ❌

2. Test Chunk Sizes

Different documents benefit from different chunking:

# Try multiple chunk sizes for the same document pdf_add(path="doc.pdf", chunk_size=3, overlap=1) # Test 1 pdf_remove(document_id="...") # Remove pdf_add(path="doc.pdf", chunk_size=5, overlap=2) # Test 2

3. Use Document Filters

When searching specific documents:

# More focused, faster results pdf_search_similarity( query="...", document_filter="specific_doc_id" )

4. Combine Search Types

Use both search methods for comprehensive results:

  1. Semantic search for concepts

  2. Keyword search for exact terms

Security Notes

  • File Access: Server can read any PDF the Python process can access

  • Storage: Embeddings and text stored unencrypted in ChromaDB

  • No Authentication: MCP servers trust the client (Claude Desktop)

For production use:

  • Restrict file system permissions

  • Use dedicated database directories

  • Consider encryption for sensitive documents

Contributing

To extend this server:

  1. Add New Tools: Follow the @mcp.tool() decorator pattern

  2. Custom Chunking: Implement in semantic_chunking() function

  3. Additional Embeddings: Swap models in initialization

  4. Metadata: Extend metadatas dict in pdf_add()

License

MIT License - See LICENSE file for details

Acknowledgments

  • Anthropic: MCP Protocol and SDK

  • ChromaDB: Vector database

  • Sentence Transformers: Embedding models

  • PyMuPDF: PDF text extraction and OCR support

Support

For issues or questions:

  1. Check the troubleshooting section

  2. Review MCP documentation: https://modelcontextprotocol.io

  3. Check ChromaDB docs: https://docs.trychroma.com


Built with ❤️ using the Model Context Protocol

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wesleygriffin/pdfrag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server