PDF RAG MCP Server
A Model Context Protocol (MCP) server that provides powerful RAG (Retrieval-Augmented Generation) capabilities for PDF documents. This server uses ChromaDB for vector storage, sentence-transformers for embeddings, and semantic chunking for intelligent text segmentation.
Features
✅ Semantic Chunking: Intelligently groups sentences together instead of splitting at arbitrary character limits
✅ Vector Search: Find semantically similar content using embeddings
✅ Keyword Search: Traditional keyword-based search for exact terms
✅ OCR Support: Automatic detection and OCR processing for scanned/image-based PDFs
✅ Source Tracking: Maintains document names and page numbers for all chunks
✅ Add/Remove PDFs: Easily manage your document collection
✅ Persistent Storage: ChromaDB persists your embeddings to disk
✅ Multiple Output Formats: Get results in Markdown or JSON format
✅ Progress Reporting: Real-time feedback during long operations
Architecture
Embedding Model:
multi-qa-mpnet-base-dot-v1(optimized for question-answering)Vector Database: ChromaDB with cosine similarity
Chunking Strategy: Semantic chunking with configurable sentence grouping and overlap
PDF Extraction: PyMuPDF for text extraction with OCR fallback for scanned PDFs
Installation
1. Install Python Dependencies
2. Download NLTK Data (Automatic)
The server automatically downloads required NLTK punkt tokenizer data on first run.
3. Install Tesseract (Optional - for OCR)
For scanned PDF support, install Tesseract:
macOS:
brew install tesseractUbuntu/Debian:
sudo apt-get install tesseract-ocrWindows: Download from https://github.com/UB-Mannheim/tesseract/wiki
The server automatically detects scanned pages and uses OCR when Tesseract is available.
4. Test the Server
Configuration
Database Location
The server stores its ChromaDB database in a configurable location. You can specify the database path using the --db-path command line argument:
Chunking Parameters
Default chunking settings:
Chunk Size: 3 sentences per chunk
Overlap: 1 sentence overlap between chunks
These can be customized when adding PDFs:
Character Limit
Responses are limited to 25,000 characters by default. If exceeded, results are automatically truncated with a warning message.
MCP Tools
1. pdf_add
Add a PDF document to the RAG database.
Input:
Output:
Example Use Cases:
Adding research papers for reference
Indexing documentation
Building a searchable knowledge base
2. pdf_remove
Remove a PDF document from the database.
Input:
Output:
3. pdf_list
List all PDF documents in the database.
Input:
Output (Markdown):
Output (JSON):
4. pdf_search_similarity
Search using semantic similarity (vector search).
Input:
Output (Markdown):
Use Cases:
Finding relevant information without exact keywords
Discovering related concepts
Question answering over documents
5. pdf_search_keywords
Search using keyword matching.
Input:
Output:
Similar to pdf_search_similarity, but ranked by keyword occurrence count.
Use Cases:
Finding specific technical terms
Locating exact phrases or terminology
Verifying presence of keywords in documents
Usage with Claude Desktop
1. Add to Claude Desktop Configuration
Edit your Claude Desktop config file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Add the server:
Custom Database Location:
To use a custom database path with Claude Desktop:
2. Restart Claude Desktop
After adding the configuration, restart Claude Desktop to load the MCP server.
3. Test the Connection
In Claude Desktop, try:
Claude will use the pdf_list tool to show available documents.
Example Workflows
Building a Research Database
Document Q&A
Knowledge Base Management
Advanced Configuration
Custom Chunk Sizes
For different document types:
Technical Documents (code, APIs):
Smaller chunks (2-3 sentences)
Minimal overlap (0-1 sentences)
Preserves code structure
Narrative Documents (articles, books):
Larger chunks (5-7 sentences)
More overlap (2-3 sentences)
Maintains context flow
Scientific Papers:
Medium chunks (3-5 sentences)
Moderate overlap (1-2 sentences)
Balances detail and context
Document Filtering
Search within specific documents:
Output Format Selection
Choose format based on use case:
Markdown: Best for human reading, Claude's analysis JSON: Best for programmatic processing, data extraction
Troubleshooting
"File not found" Error
Ensure you're using absolute paths:
Empty PDF Results / Scanned PDFs
The server automatically detects and processes scanned PDFs using OCR. If you get an error about no text being extracted:
Install Tesseract (if not already installed):
macOS:
brew install tesseractUbuntu/Debian:
sudo apt-get install tesseract-ocrWindows: Download from https://github.com/UB-Mannheim/tesseract/wiki
Retry adding the PDF - the server will automatically use OCR for pages with minimal text
The error message will indicate if OCR is needed: "ensure tesseract is installed for scanned PDFs"
Out of Memory
If processing large PDFs causes memory issues:
Reduce
chunk_sizeto create more, smaller chunksProcess documents one at a time
Increase system swap space
ChromaDB Errors
If ChromaDB complains about existing collections:
Performance Considerations
Embedding Generation
The first time you add a document, the model will be downloaded (~400MB). Subsequent operations are faster.
Typical Times:
10-page PDF: ~5-10 seconds
100-page PDF: ~30-60 seconds
1000-page PDF: ~5-10 minutes
Search Performance
Similarity Search: Fast (< 1 second for most queries)
Keyword Search: Slower for large collections (scales with document count)
Storage
Embeddings: ~1.5KB per chunk (768-dimensional vectors)
Text Storage: Depends on chunk size
Example: 1000 chunks ≈ 1.5MB in ChromaDB
Best Practices
1. Organize Documents
Use descriptive filenames:
2. Test Chunk Sizes
Different documents benefit from different chunking:
3. Use Document Filters
When searching specific documents:
4. Combine Search Types
Use both search methods for comprehensive results:
Semantic search for concepts
Keyword search for exact terms
Security Notes
File Access: Server can read any PDF the Python process can access
Storage: Embeddings and text stored unencrypted in ChromaDB
No Authentication: MCP servers trust the client (Claude Desktop)
For production use:
Restrict file system permissions
Use dedicated database directories
Consider encryption for sensitive documents
Contributing
To extend this server:
Add New Tools: Follow the
@mcp.tool()decorator patternCustom Chunking: Implement in
semantic_chunking()functionAdditional Embeddings: Swap models in initialization
Metadata: Extend
metadatasdict inpdf_add()
License
MIT License - See LICENSE file for details
Acknowledgments
Anthropic: MCP Protocol and SDK
ChromaDB: Vector database
Sentence Transformers: Embedding models
PyMuPDF: PDF text extraction and OCR support
Support
For issues or questions:
Check the troubleshooting section
Review MCP documentation: https://modelcontextprotocol.io
Check ChromaDB docs: https://docs.trychroma.com
Built with ❤️ using the Model Context Protocol
This server cannot be installed
local-only server
The server can only run on the client's local machine because it depends on local resources.
Enables intelligent search and question-answering over PDF documents using semantic similarity and keyword search. Supports OCR for scanned PDFs, persistent vector storage with ChromaDB, and maintains source tracking with page numbers.