Skip to main content
Glama

PDF RAG MCP Server

README.md12.3 kB
# PDF RAG MCP Server A Model Context Protocol (MCP) server that provides powerful RAG (Retrieval-Augmented Generation) capabilities for PDF documents. This server uses ChromaDB for vector storage, sentence-transformers for embeddings, and semantic chunking for intelligent text segmentation. ## Features - ✅ **Semantic Chunking**: Intelligently groups sentences together instead of splitting at arbitrary character limits - ✅ **Vector Search**: Find semantically similar content using embeddings - ✅ **Keyword Search**: Traditional keyword-based search for exact terms - ✅ **OCR Support**: Automatic detection and OCR processing for scanned/image-based PDFs - ✅ **Source Tracking**: Maintains document names and page numbers for all chunks - ✅ **Add/Remove PDFs**: Easily manage your document collection - ✅ **Persistent Storage**: ChromaDB persists your embeddings to disk - ✅ **Multiple Output Formats**: Get results in Markdown or JSON format - ✅ **Progress Reporting**: Real-time feedback during long operations ## Architecture - **Embedding Model**: `multi-qa-mpnet-base-dot-v1` (optimized for question-answering) - **Vector Database**: ChromaDB with cosine similarity - **Chunking Strategy**: Semantic chunking with configurable sentence grouping and overlap - **PDF Extraction**: PyMuPDF for text extraction with OCR fallback for scanned PDFs ## Installation ### 1. Install Python Dependencies ```bash pip install -r requirements.txt ``` ### 2. Download NLTK Data (Automatic) The server automatically downloads required NLTK punkt tokenizer data on first run. ### 3. Install Tesseract (Optional - for OCR) For scanned PDF support, install Tesseract: - **macOS:** `brew install tesseract` - **Ubuntu/Debian:** `sudo apt-get install tesseract-ocr` - **Windows:** Download from https://github.com/UB-Mannheim/tesseract/wiki The server automatically detects scanned pages and uses OCR when Tesseract is available. ### 4. Test the Server ```bash python pdf_rag_mcp.py --help ``` ## Configuration ### Database Location The server stores its ChromaDB database in a configurable location. You can specify the database path using the `--db-path` command line argument: ```bash # Use default location (~/.dotfiles/files/mcps/pdfrag/chroma_db) python pdf_rag_mcp.py # Use custom database location python pdf_rag_mcp.py --db-path /path/to/your/database ``` ### Chunking Parameters Default chunking settings: - **Chunk Size**: 3 sentences per chunk - **Overlap**: 1 sentence overlap between chunks These can be customized when adding PDFs: ```python { "pdf_path": "/path/to/document.pdf", "chunk_size": 5, # Use 5 sentences per chunk "overlap": 2 # 2 sentences overlap } ``` ### Character Limit Responses are limited to 25,000 characters by default. If exceeded, results are automatically truncated with a warning message. ## MCP Tools ### 1. pdf_add Add a PDF document to the RAG database. **Input:** ```json { "pdf_path": "/absolute/path/to/document.pdf", "chunk_size": 3, // optional, default: 3 "overlap": 1 // optional, default: 1 } ``` **Output:** ```json { "status": "success", "message": "Successfully added 'document.pdf' to the database", "document_id": "a1b2c3d4...", "filename": "document.pdf", "pages": 15, "chunks": 127, "chunk_size": 3, "overlap": 1 } ``` **Example Use Cases:** - Adding research papers for reference - Indexing documentation - Building a searchable knowledge base ### 2. pdf_remove Remove a PDF document from the database. **Input:** ```json { "document_id": "a1b2c3d4..." // Get from pdf_list } ``` **Output:** ```json { "status": "success", "message": "Successfully removed 'document.pdf' from the database", "document_id": "a1b2c3d4...", "removed_chunks": 127 } ``` ### 3. pdf_list List all PDF documents in the database. **Input:** ```json { "response_format": "markdown" // or "json" } ``` **Output (Markdown):** ```markdown # PDF Documents (2 total) ## research_paper.pdf **Document ID:** a1b2c3d4... **Chunks:** 127 **Added:** N/A ## documentation.pdf **Document ID:** e5f6g7h8... **Chunks:** 89 **Added:** N/A ``` **Output (JSON):** ```json { "count": 2, "documents": [ { "document_id": "a1b2c3d4...", "filename": "research_paper.pdf", "chunk_count": 127 }, { "document_id": "e5f6g7h8...", "filename": "documentation.pdf", "chunk_count": 89 } ] } ``` ### 4. pdf_search_similarity Search using semantic similarity (vector search). **Input:** ```json { "query": "machine learning techniques for text classification", "top_k": 5, // optional, default: 5 "document_filter": null, // optional, search specific doc "response_format": "markdown" // optional, default: markdown } ``` **Output (Markdown):** ```markdown # Search Results for: 'machine learning techniques for text classification' Found 5 relevant chunks: ## Result 1 **Document:** research_paper.pdf **Page:** 7 **Similarity Score:** 0.8754 **Content:** Machine learning approaches to text classification have evolved significantly... --- ``` **Use Cases:** - Finding relevant information without exact keywords - Discovering related concepts - Question answering over documents ### 5. pdf_search_keywords Search using keyword matching. **Input:** ```json { "keywords": "neural network backpropagation", "top_k": 5, // optional, default: 5 "document_filter": null, // optional "response_format": "markdown" // optional, default: markdown } ``` **Output:** Similar to `pdf_search_similarity`, but ranked by keyword occurrence count. **Use Cases:** - Finding specific technical terms - Locating exact phrases or terminology - Verifying presence of keywords in documents ## Usage with Claude Desktop ### 1. Add to Claude Desktop Configuration Edit your Claude Desktop config file: **macOS:** `~/Library/Application Support/Claude/claude_desktop_config.json` **Windows:** `%APPDATA%\Claude\claude_desktop_config.json` Add the server: ```json { "mcpServers": { "pdf-rag": { "command": "python", "args": [ "/absolute/path/to/pdf_rag_mcp.py" ] } } } ``` **Custom Database Location:** To use a custom database path with Claude Desktop: ```json { "mcpServers": { "pdf-rag": { "command": "python", "args": [ "/absolute/path/to/pdf_rag_mcp.py", "--db-path", "/custom/path/to/database" ] } } } ``` ### 2. Restart Claude Desktop After adding the configuration, restart Claude Desktop to load the MCP server. ### 3. Test the Connection In Claude Desktop, try: ``` Can you list the PDFs in the RAG database? ``` Claude will use the `pdf_list` tool to show available documents. ## Example Workflows ### Building a Research Database ``` 1. Add documents: "Add these PDFs to the database: /research/paper1.pdf, /research/paper2.pdf" 2. Search for concepts: "Search for information about 'gradient descent optimization' in the database" 3. Find specific terms: "Search for the keyword 'convolutional neural network' and show me the pages" ``` ### Document Q&A ``` 1. Add documentation: "Add this user manual: /docs/product_manual.pdf" 2. Ask questions: "How do I configure the network settings according to the manual?" 3. Find references: "Which page discusses troubleshooting connection errors?" ``` ### Knowledge Base Management ``` 1. List documents: "Show me all documents in the RAG database" 2. Remove outdated docs: "Remove the document with ID a1b2c3d4..." 3. Search across all: "Search all documents for information about API authentication" ``` ## Advanced Configuration ### Custom Chunk Sizes For different document types: **Technical Documents** (code, APIs): - Smaller chunks (2-3 sentences) - Minimal overlap (0-1 sentences) - Preserves code structure **Narrative Documents** (articles, books): - Larger chunks (5-7 sentences) - More overlap (2-3 sentences) - Maintains context flow **Scientific Papers**: - Medium chunks (3-5 sentences) - Moderate overlap (1-2 sentences) - Balances detail and context ### Document Filtering Search within specific documents: ```json { "query": "data preprocessing", "document_filter": "a1b2c3d4..." // Only search this doc } ``` ### Output Format Selection Choose format based on use case: **Markdown**: Best for human reading, Claude's analysis **JSON**: Best for programmatic processing, data extraction ## Troubleshooting ### "File not found" Error Ensure you're using absolute paths: ```python "/home/user/documents/paper.pdf" ✅ "~/documents/paper.pdf" ❌ (needs expansion) "./paper.pdf" ❌ (relative path) ``` ### Empty PDF Results / Scanned PDFs The server automatically detects and processes scanned PDFs using OCR. If you get an error about no text being extracted: 1. **Install Tesseract** (if not already installed): - macOS: `brew install tesseract` - Ubuntu/Debian: `sudo apt-get install tesseract-ocr` - Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki 2. **Retry adding the PDF** - the server will automatically use OCR for pages with minimal text The error message will indicate if OCR is needed: "ensure tesseract is installed for scanned PDFs" ### Out of Memory If processing large PDFs causes memory issues: 1. Reduce `chunk_size` to create more, smaller chunks 2. Process documents one at a time 3. Increase system swap space ### ChromaDB Errors If ChromaDB complains about existing collections: ```bash # Remove the database directory rm -rf ./chroma_db # Restart the server ``` ## Performance Considerations ### Embedding Generation The first time you add a document, the model will be downloaded (~400MB). Subsequent operations are faster. **Typical Times:** - 10-page PDF: ~5-10 seconds - 100-page PDF: ~30-60 seconds - 1000-page PDF: ~5-10 minutes ### Search Performance - **Similarity Search**: Fast (< 1 second for most queries) - **Keyword Search**: Slower for large collections (scales with document count) ### Storage - **Embeddings**: ~1.5KB per chunk (768-dimensional vectors) - **Text Storage**: Depends on chunk size - **Example**: 1000 chunks ≈ 1.5MB in ChromaDB ## Best Practices ### 1. Organize Documents Use descriptive filenames: ``` research_ml_2024.pdf ✅ document (1).pdf ❌ ``` ### 2. Test Chunk Sizes Different documents benefit from different chunking: ```python # Try multiple chunk sizes for the same document pdf_add(path="doc.pdf", chunk_size=3, overlap=1) # Test 1 pdf_remove(document_id="...") # Remove pdf_add(path="doc.pdf", chunk_size=5, overlap=2) # Test 2 ``` ### 3. Use Document Filters When searching specific documents: ```python # More focused, faster results pdf_search_similarity( query="...", document_filter="specific_doc_id" ) ``` ### 4. Combine Search Types Use both search methods for comprehensive results: 1. Semantic search for concepts 2. Keyword search for exact terms ## Security Notes - **File Access**: Server can read any PDF the Python process can access - **Storage**: Embeddings and text stored unencrypted in ChromaDB - **No Authentication**: MCP servers trust the client (Claude Desktop) For production use: - Restrict file system permissions - Use dedicated database directories - Consider encryption for sensitive documents ## Contributing To extend this server: 1. **Add New Tools**: Follow the `@mcp.tool()` decorator pattern 2. **Custom Chunking**: Implement in `semantic_chunking()` function 3. **Additional Embeddings**: Swap models in initialization 4. **Metadata**: Extend `metadatas` dict in `pdf_add()` ## License MIT License - See LICENSE file for details ## Acknowledgments - **Anthropic**: MCP Protocol and SDK - **ChromaDB**: Vector database - **Sentence Transformers**: Embedding models - **PyMuPDF**: PDF text extraction and OCR support ## Support For issues or questions: 1. Check the troubleshooting section 2. Review MCP documentation: https://modelcontextprotocol.io 3. Check ChromaDB docs: https://docs.trychroma.com --- **Built with ❤️ using the Model Context Protocol**

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wesleygriffin/pdfrag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server