PDF RAG MCP Server

MIT License

Overview InspectNew Endpoints Schema Related Servers Reviews Score

pdfrag

INDEX.md•8.37 kB

# 📦 PDF RAG MCP Server - Complete Project ## 🎯 Overview A production-ready MCP server for PDF-based Retrieval-Augmented Generation (RAG). Built with Python, ChromaDB, and sentence-transformers, following Anthropic's MCP best practices. **Version:** 1.0.0 **License:** MIT **Python:** 3.8+ ## 📁 Project Files ### Essential Files (Start Here!) | File | Size | Purpose | Start Here? | |------|------|---------|-------------| | **GETTING_STARTED.md** | 7.3 KB | Quick start guide with first steps | ⭐ YES | | **pdf_rag_mcp.py** | 29 KB | Main MCP server implementation | After setup | | **requirements.txt** | 267 B | Python dependencies | For installation | ### Documentation | File | Size | Purpose | Read When? | |------|------|---------|------------| | **QUICKSTART.md** | 5.2 KB | 5-minute setup guide | Setting up | | **README.md** | 11 KB | Complete documentation | After basics | | **PROJECT_OVERVIEW.md** | 12 KB | Architecture & design | For deep dive | ### Configuration & Testing | File | Size | Purpose | Use When? | |------|------|---------|-----------| | **claude_desktop_config.json** | 199 B | Config example for Claude Desktop | Configuring | | **test_pdf_rag.py** | 4.5 KB | Local testing script | Testing locally | ### Project Files | File | Size | Purpose | |------|------|---------| | **LICENSE** | 1.1 KB | MIT License | | **.gitignore** | 407 B | Git ignore rules | **Total Project Size:** ~70 KB (excluding dependencies) ## 🚀 Quick Start Path ``` 1. Read GETTING_STARTED.md (5 minutes) 2. Install dependencies (2 minutes) 3. Configure Claude Desktop (2 minutes) 4. Test with your first PDF (1 minute) ───────── 10 minutes total ``` ## 📚 Reading Order (Optional Deep Dive) ### For Beginners 1. GETTING_STARTED.md - Start here! 2. QUICKSTART.md - Detailed setup 3. README.md - Full documentation ### For Developers 1. GETTING_STARTED.md - Quick overview 2. PROJECT_OVERVIEW.md - Architecture 3. pdf_rag_mcp.py - Source code 4. test_pdf_rag.py - Test examples ### For Advanced Users 1. PROJECT_OVERVIEW.md - Design decisions 2. pdf_rag_mcp.py - Implementation details 3. README.md - API reference ## 🎯 File Purposes Explained ### GETTING_STARTED.md (START HERE!) **Your entry point.** Everything you need to get up and running in 10 minutes. Includes: - Quick setup steps - First PDF workflow - Example use cases - Troubleshooting basics ### QUICKSTART.md **Detailed setup guide.** Step-by-step instructions with more detail than GETTING_STARTED. Includes: - Installation walkthrough - Configuration examples - Testing procedures - Common commands ### README.md **Complete reference.** Comprehensive documentation covering: - All features and capabilities - Tool API documentation - Usage examples and workflows - Troubleshooting guide - Best practices - Performance considerations ### PROJECT_OVERVIEW.md **Technical deep dive.** For understanding the architecture: - System architecture diagrams - Technology stack details - Design decisions and rationale - Performance characteristics - Extension ideas - Security considerations ### pdf_rag_mcp.py **The MCP server.** Main implementation featuring: - 5 comprehensive tools - Semantic chunking engine - ChromaDB integration - Progress reporting - Error handling - 850+ lines of production code ### requirements.txt **Dependencies list.** Install everything with: ```bash pip install -r requirements.txt ``` Includes: - MCP SDK - ChromaDB - sentence-transformers - pypdf - NLTK - Pydantic ### test_pdf_rag.py **Testing script.** Test the server locally without MCP: ```bash python test_pdf_rag.py /path/to/test.pdf ``` Features: - Tests semantic chunking - Validates PDF extraction - No MCP server needed - Useful for debugging ### claude_desktop_config.json **Configuration example.** Copy this to your Claude Desktop config: ```json { "mcpServers": { "pdf-rag": { "command": "python", "args": ["/absolute/path/to/pdf_rag_mcp.py"] } } } ``` ### .gitignore **Git configuration.** Excludes: - Python cache files - ChromaDB database - Virtual environments - IDE files - Test PDFs ### LICENSE **MIT License.** Free to use, modify, and distribute. ## 🛠️ Technology Stack | Component | Technology | Purpose | |-----------|-----------|---------| | **MCP Framework** | FastMCP | Server implementation | | **Vector Database** | ChromaDB | Persistent storage | | **Embeddings** | sentence-transformers | Semantic search | | **PDF Processing** | pypdf | Text extraction | | **Text Processing** | NLTK | Sentence tokenization | | **Validation** | Pydantic v2 | Input validation | ## 🎯 Use Cases ### Research & Academia - Index research papers - Answer questions about papers - Find related work - Track citations ### Documentation - Searchable user manuals - API documentation lookup - Troubleshooting guides - How-to references ### Business & Legal - Contract search - Policy documents - Compliance references - Report analysis ### Personal Knowledge - Book notes - Article collection - Study materials - Reference library ## 🌟 Key Features - ✅ **Semantic Chunking** - Intelligent sentence-based splitting - ✅ **AI Embeddings** - 768-dimensional vectors for similarity - ✅ **Dual Search** - Both semantic and keyword search - ✅ **Source Tracking** - Document name + page numbers - ✅ **Progress Reports** - Real-time operation status - ✅ **Error Handling** - Smart, educational error messages - ✅ **Multiple Formats** - Markdown or JSON output - ✅ **Character Limits** - Automatic truncation - ✅ **Document Management** - Add, remove, list PDFs - ✅ **Persistent Storage** - Data survives restarts ## 📊 Quick Stats - **Lines of Code:** 850+ (main server) - **Tools Provided:** 5 - **Documentation Pages:** 4 (40+ KB) - **Test Coverage:** Comprehensive test script included - **Dependencies:** 7 main packages - **Setup Time:** ~10 minutes - **First PDF Time:** ~5-10 seconds ## 🎓 Learning Path ### Level 1: User (10 minutes) 1. Read GETTING_STARTED.md 2. Install and configure 3. Add first PDF 4. Try searching ### Level 2: Power User (1 hour) 1. Read QUICKSTART.md 2. Read README.md 3. Experiment with chunk sizes 4. Try both search methods 5. Build knowledge base ### Level 3: Developer (3 hours) 1. Read PROJECT_OVERVIEW.md 2. Study pdf_rag_mcp.py code 3. Run test_pdf_rag.py 4. Understand architecture 5. Plan extensions ## 🔍 Finding What You Need **Need to...** - Get started? → GETTING_STARTED.md - Set up the server? → QUICKSTART.md - Understand a tool? → README.md - Learn architecture? → PROJECT_OVERVIEW.md - Test locally? → test_pdf_rag.py - Configure Claude? → claude_desktop_config.json - Understand code? → pdf_rag_mcp.py ## 🚨 Important Notes 1. **Absolute Paths Required** - Use full paths, not relative or ~/ 2. **Python 3.8+** - Minimum Python version required 3. **First Run Slow** - Downloads ~400MB model on first use 4. **Storage Location** - ChromaDB creates ./chroma_db directory 5. **Claude Desktop** - Requires restart after config changes ## 📞 Support Resources - **MCP Documentation:** https://modelcontextprotocol.io - **ChromaDB Docs:** https://docs.trychroma.com - **Sentence Transformers:** https://www.sbert.net - **NLTK:** https://www.nltk.org - **PyPDF:** https://pypdf.readthedocs.io ## 🎉 Ready to Start? You have everything you need! Begin with **GETTING_STARTED.md** and you'll be searching PDFs in minutes. ## ✨ What Makes This Special? This isn't just a basic MCP server - it's a **production-ready** implementation that: - ✅ Follows Anthropic's MCP best practices - ✅ Uses semantic chunking (better than character splitting) - ✅ Provides comprehensive documentation - ✅ Includes testing capabilities - ✅ Has smart error handling - ✅ Reports progress in real-time - ✅ Tracks sources accurately - ✅ Offers multiple output formats - ✅ Handles large documents efficiently - ✅ Is fully extensible ## 🎊 Let's Get Started! Open **GETTING_STARTED.md** and let's build your PDF knowledge base! --- **Project built with ❤️ following MCP best practices** **Last Updated:** October 2024 **Status:** Production Ready **Support:** See documentation files

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wesleygriffin/pdfrag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server