INDEX.md•8.37 kB
# 📦 PDF RAG MCP Server - Complete Project
## 🎯 Overview
A production-ready MCP server for PDF-based Retrieval-Augmented Generation (RAG). Built with Python, ChromaDB, and sentence-transformers, following Anthropic's MCP best practices.
**Version:** 1.0.0
**License:** MIT
**Python:** 3.8+
## 📁 Project Files
### Essential Files (Start Here!)
| File | Size | Purpose | Start Here? |
|------|------|---------|-------------|
| **GETTING_STARTED.md** | 7.3 KB | Quick start guide with first steps | ⭐ YES |
| **pdf_rag_mcp.py** | 29 KB | Main MCP server implementation | After setup |
| **requirements.txt** | 267 B | Python dependencies | For installation |
### Documentation
| File | Size | Purpose | Read When? |
|------|------|---------|------------|
| **QUICKSTART.md** | 5.2 KB | 5-minute setup guide | Setting up |
| **README.md** | 11 KB | Complete documentation | After basics |
| **PROJECT_OVERVIEW.md** | 12 KB | Architecture & design | For deep dive |
### Configuration & Testing
| File | Size | Purpose | Use When? |
|------|------|---------|-----------|
| **claude_desktop_config.json** | 199 B | Config example for Claude Desktop | Configuring |
| **test_pdf_rag.py** | 4.5 KB | Local testing script | Testing locally |
### Project Files
| File | Size | Purpose |
|------|------|---------|
| **LICENSE** | 1.1 KB | MIT License |
| **.gitignore** | 407 B | Git ignore rules |
**Total Project Size:** ~70 KB (excluding dependencies)
## 🚀 Quick Start Path
```
1. Read GETTING_STARTED.md (5 minutes)
2. Install dependencies (2 minutes)
3. Configure Claude Desktop (2 minutes)
4. Test with your first PDF (1 minute)
─────────
10 minutes total
```
## 📚 Reading Order (Optional Deep Dive)
### For Beginners
1. GETTING_STARTED.md - Start here!
2. QUICKSTART.md - Detailed setup
3. README.md - Full documentation
### For Developers
1. GETTING_STARTED.md - Quick overview
2. PROJECT_OVERVIEW.md - Architecture
3. pdf_rag_mcp.py - Source code
4. test_pdf_rag.py - Test examples
### For Advanced Users
1. PROJECT_OVERVIEW.md - Design decisions
2. pdf_rag_mcp.py - Implementation details
3. README.md - API reference
## 🎯 File Purposes Explained
### GETTING_STARTED.md (START HERE!)
**Your entry point.** Everything you need to get up and running in 10 minutes. Includes:
- Quick setup steps
- First PDF workflow
- Example use cases
- Troubleshooting basics
### QUICKSTART.md
**Detailed setup guide.** Step-by-step instructions with more detail than GETTING_STARTED. Includes:
- Installation walkthrough
- Configuration examples
- Testing procedures
- Common commands
### README.md
**Complete reference.** Comprehensive documentation covering:
- All features and capabilities
- Tool API documentation
- Usage examples and workflows
- Troubleshooting guide
- Best practices
- Performance considerations
### PROJECT_OVERVIEW.md
**Technical deep dive.** For understanding the architecture:
- System architecture diagrams
- Technology stack details
- Design decisions and rationale
- Performance characteristics
- Extension ideas
- Security considerations
### pdf_rag_mcp.py
**The MCP server.** Main implementation featuring:
- 5 comprehensive tools
- Semantic chunking engine
- ChromaDB integration
- Progress reporting
- Error handling
- 850+ lines of production code
### requirements.txt
**Dependencies list.** Install everything with:
```bash
pip install -r requirements.txt
```
Includes:
- MCP SDK
- ChromaDB
- sentence-transformers
- pypdf
- NLTK
- Pydantic
### test_pdf_rag.py
**Testing script.** Test the server locally without MCP:
```bash
python test_pdf_rag.py /path/to/test.pdf
```
Features:
- Tests semantic chunking
- Validates PDF extraction
- No MCP server needed
- Useful for debugging
### claude_desktop_config.json
**Configuration example.** Copy this to your Claude Desktop config:
```json
{
"mcpServers": {
"pdf-rag": {
"command": "python",
"args": ["/absolute/path/to/pdf_rag_mcp.py"]
}
}
}
```
### .gitignore
**Git configuration.** Excludes:
- Python cache files
- ChromaDB database
- Virtual environments
- IDE files
- Test PDFs
### LICENSE
**MIT License.** Free to use, modify, and distribute.
## 🛠️ Technology Stack
| Component | Technology | Purpose |
|-----------|-----------|---------|
| **MCP Framework** | FastMCP | Server implementation |
| **Vector Database** | ChromaDB | Persistent storage |
| **Embeddings** | sentence-transformers | Semantic search |
| **PDF Processing** | pypdf | Text extraction |
| **Text Processing** | NLTK | Sentence tokenization |
| **Validation** | Pydantic v2 | Input validation |
## 🎯 Use Cases
### Research & Academia
- Index research papers
- Answer questions about papers
- Find related work
- Track citations
### Documentation
- Searchable user manuals
- API documentation lookup
- Troubleshooting guides
- How-to references
### Business & Legal
- Contract search
- Policy documents
- Compliance references
- Report analysis
### Personal Knowledge
- Book notes
- Article collection
- Study materials
- Reference library
## 🌟 Key Features
- ✅ **Semantic Chunking** - Intelligent sentence-based splitting
- ✅ **AI Embeddings** - 768-dimensional vectors for similarity
- ✅ **Dual Search** - Both semantic and keyword search
- ✅ **Source Tracking** - Document name + page numbers
- ✅ **Progress Reports** - Real-time operation status
- ✅ **Error Handling** - Smart, educational error messages
- ✅ **Multiple Formats** - Markdown or JSON output
- ✅ **Character Limits** - Automatic truncation
- ✅ **Document Management** - Add, remove, list PDFs
- ✅ **Persistent Storage** - Data survives restarts
## 📊 Quick Stats
- **Lines of Code:** 850+ (main server)
- **Tools Provided:** 5
- **Documentation Pages:** 4 (40+ KB)
- **Test Coverage:** Comprehensive test script included
- **Dependencies:** 7 main packages
- **Setup Time:** ~10 minutes
- **First PDF Time:** ~5-10 seconds
## 🎓 Learning Path
### Level 1: User (10 minutes)
1. Read GETTING_STARTED.md
2. Install and configure
3. Add first PDF
4. Try searching
### Level 2: Power User (1 hour)
1. Read QUICKSTART.md
2. Read README.md
3. Experiment with chunk sizes
4. Try both search methods
5. Build knowledge base
### Level 3: Developer (3 hours)
1. Read PROJECT_OVERVIEW.md
2. Study pdf_rag_mcp.py code
3. Run test_pdf_rag.py
4. Understand architecture
5. Plan extensions
## 🔍 Finding What You Need
**Need to...**
- Get started? → GETTING_STARTED.md
- Set up the server? → QUICKSTART.md
- Understand a tool? → README.md
- Learn architecture? → PROJECT_OVERVIEW.md
- Test locally? → test_pdf_rag.py
- Configure Claude? → claude_desktop_config.json
- Understand code? → pdf_rag_mcp.py
## 🚨 Important Notes
1. **Absolute Paths Required** - Use full paths, not relative or ~/
2. **Python 3.8+** - Minimum Python version required
3. **First Run Slow** - Downloads ~400MB model on first use
4. **Storage Location** - ChromaDB creates ./chroma_db directory
5. **Claude Desktop** - Requires restart after config changes
## 📞 Support Resources
- **MCP Documentation:** https://modelcontextprotocol.io
- **ChromaDB Docs:** https://docs.trychroma.com
- **Sentence Transformers:** https://www.sbert.net
- **NLTK:** https://www.nltk.org
- **PyPDF:** https://pypdf.readthedocs.io
## 🎉 Ready to Start?
You have everything you need! Begin with **GETTING_STARTED.md** and you'll be searching PDFs in minutes.
## ✨ What Makes This Special?
This isn't just a basic MCP server - it's a **production-ready** implementation that:
- ✅ Follows Anthropic's MCP best practices
- ✅ Uses semantic chunking (better than character splitting)
- ✅ Provides comprehensive documentation
- ✅ Includes testing capabilities
- ✅ Has smart error handling
- ✅ Reports progress in real-time
- ✅ Tracks sources accurately
- ✅ Offers multiple output formats
- ✅ Handles large documents efficiently
- ✅ Is fully extensible
## 🎊 Let's Get Started!
Open **GETTING_STARTED.md** and let's build your PDF knowledge base!
---
**Project built with ❤️ following MCP best practices**
**Last Updated:** October 2024
**Status:** Production Ready
**Support:** See documentation files