README.mdโข8.2 kB
# PDF Retrieval MCP Server
A **completely free** Model Context Protocol (MCP) server for retrieving relevant chunks from PDF documents using hybrid search (BM25 + Vector Search).
## ๐ Features
- **PDF Document Processing**: Automatic parsing and indexing of PDF files using Docling
- **Hybrid Retrieval**: Combines BM25 (keyword) and vector search (semantic) for accurate retrieval
- **Free Embeddings**: Uses ChromaDB's default sentence-transformers (no API costs!)
- **Pure Retrieval Mode**: Returns raw document chunks for agent processing (no LLM answer generation)
- **Fresh Start**: Clears vector database on each startup for clean indexing
- **MCP Integration**: Exposes `retrieve_pdf_chunks` tool via FastMCP for seamless agent integration
## ๐ Prerequisites
- Python 3.11 or later
- PDF documents to index
- **No API keys required!** โจ
## ๐ ๏ธ Installation
### 1. Clone the Repository (if not already done)
```bash
git clone <repository-url>
cd pdf_mcpserver
```
### 2. Install Dependencies with uv
```bash
uv sync
```
This will automatically:
- Create a virtual environment (`.venv`)
- Install all dependencies from `pyproject.toml`
- Set up the project
### 3. Add PDF Documents
Create a `documents` directory and add your PDF files:
```bash
mkdir documents
# Copy your PDF files to the documents/ directory
```
That's it! No API keys or additional configuration needed.
## ๐ฏ Usage
### Running the Server
```bash
uv run python main.py
```
Or activate the virtual environment first:
```bash
source .venv/bin/activate # On Windows: .venv\Scripts\activate
python main.py
```
The server will:
1. Start immediately (lazy initialization)
2. Load and index PDFs on first query
3. Be ready to retrieve document chunks via MCP
### Using the `retrieve_pdf_chunks` Tool
The server exposes a single MCP tool: `retrieve_pdf_chunks(query: str, max_chunks: int = 5) -> str`
**Example Query:**
```python
retrieve_pdf_chunks("machine learning algorithms", max_chunks=3)
```
**Example Response:**
```json
{
"query": "machine learning algorithms",
"chunks": [
{
"content": "Machine learning algorithms can be categorized into supervised, unsupervised, and reinforcement learning...",
"document_name": "ml_guide.pdf",
"page_number": 12,
"metadata": {"source": "ml_guide.pdf"}
},
{
"content": "Common supervised learning algorithms include linear regression, decision trees, and neural networks...",
"document_name": "ml_guide.pdf",
"page_number": 15,
"metadata": {"source": "ml_guide.pdf"}
}
],
"total_chunks": 2
}
```
### Response Structure
| Field | Type | Description |
|-------|------|-------------|
| `query` | string | The original search query |
| `chunks` | array | List of relevant document chunks |
| `chunks[].content` | string | The text content of the chunk |
| `chunks[].document_name` | string | Source PDF filename |
| `chunks[].page_number` | int | Page number (if available) |
| `chunks[].metadata` | object | Additional metadata |
| `total_chunks` | int | Number of chunks returned |
### How Agents Use This
When an agent (like Claude) calls this tool:
1. Agent sends a search query
2. Server returns relevant document chunks
3. Agent uses chunks in its context to answer questions
**Example Agent Flow:**
```
User: "What are the main ML algorithms discussed?"
โ
Agent calls: retrieve_pdf_chunks("machine learning algorithms")
โ
Server returns: 3 relevant chunks from PDFs
โ
Agent reads chunks and generates answer for user
```
## ๐ Testing with MCP Inspector
The MCP Inspector is a web-based tool for testing and debugging MCP servers interactively.
### Running the Inspector
```bash
npx @modelcontextprotocol/inspector uv run python main.py
```
This command will:
1. Start the MCP Inspector proxy server
2. Launch your PDF Retrieval Server
3. Open a web browser with the Inspector UI
### What You'll See
The Inspector provides:
- **Tool Discovery**: View available tools (`retrieve_pdf_chunks`)
- **Interactive Testing**: Test queries with custom parameters
- **Real-time Responses**: See JSON responses in real-time
- **Request/Response Logs**: Debug the MCP protocol communication
### Example Inspector Workflow
1. **Open the Inspector** - Browser opens automatically at `http://localhost:6274`
2. **Wait for Initialization** - Server loads and indexes PDFs on first query (~1-2 minutes)
3. **Select Tool** - Click on `retrieve_pdf_chunks` in the tools list
4. **Enter Query** - Type your search query (e.g., "machine learning")
5. **Set Parameters** - Optionally adjust `max_chunks` (default: 5)
6. **Execute** - Click "Run" to see the results
7. **View Response** - Inspect the returned chunks and metadata
### Inspector Tips
- **First query is slow**: PDF indexing happens on first query (87 seconds for typical PDFs)
- **Subsequent queries are fast**: Embeddings are cached in ChromaDB
- **Fresh start**: Server clears ChromaDB on each restart for clean indexing
- **Check logs**: Terminal shows detailed logging of the indexing process
## ๐๏ธ Architecture
```
pdf_mcpserver/
โโโ src/
โ โโโ config.py # Configuration management
โ โโโ constants.py # Configuration constants
โ โโโ models.py # Pydantic response models
โ โโโ pdf_processor.py # PDF loading and hybrid retrieval
โ โโโ retrieval_handler.py # Document chunk retrieval
โโโ main.py # MCP server entry point
โโโ pyproject.toml # Project metadata
โโโ .env # Environment configuration
```
### Key Components
- **PDFProcessor**: Singleton class that loads PDFs, converts to Markdown using Docling, and builds hybrid retriever (BM25 + Vector Search)
- **RetrievalHandler**: Retrieves relevant chunks for queries - no LLM answer## ๐ง Configuration
Configuration is managed through environment variables. Create a `.env` file in the project root:
```bash
# Optional: PDF Documents Directory (defaults to ./documents)
PDF_DOCUMENTS_DIR=./documents
# Optional: ChromaDB Directory (defaults to ./chroma_db)
CHROMA_DB_DIR=./chroma_db
# Optional: Log Level (defaults to INFO)
LOG_LEVEL=INFO
```
### Configuration Options
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `PDF_DOCUMENTS_DIR` | No | `./documents` | Directory containing PDF files to index |
| `CHROMA_DB_DIR` | No | `./chroma_db` | Directory for ChromaDB vector storage |
| `LOG_LEVEL` | No | `INFO` | Logging level (DEBUG, INFO, WARNING, ERROR) |
**Note**: No API keys required! ChromaDB uses free local embeddings (sentence-transformers).
## ๐งช Testing
Run unit tests:
```bash
uv run pytest tests/
```
## ๐ Troubleshooting
### No PDF files found
**Error**: `No PDF files found in ./documents`
**Solution**: Add PDF files to the `documents/` directory or update `PDF_DOCUMENTS_DIR` in `.env`
### Import errors
**Error**: `ModuleNotFoundError: No module named 'docling'`
**Solution**: Ensure all dependencies are installed: `uv sync`
### CUDA out of memory
**Error**: `CUDA out of memory`
**Solution**: The server is configured to use CPU-only mode. If you still see this error, check that `CUDA_VISIBLE_DEVICES=""` is set in `src/pdf_processor.py`
## ๐ Dependencies
- **fastmcp**: MCP server framework
- **docling**: Document processing and parsing
- **chromadb**: Vector database with free sentence-transformers embeddings
- **langchain**: RAG framework and retrievers
- **loguru**: Logging
**No paid APIs required!** All embeddings are generated locally using ChromaDB's default model (all-MiniLM-L6-v2).
## ๐ค Contributing
This is a Proof of Concept (PoC) implementation. For production use, consider:
- Adding caching for processed documents
- Implementing multi-agent workflow with fact verification
- Supporting additional document formats (DOCX, TXT, etc.)
- Adding authentication and rate limiting
## ๐ License
[Your License Here]
## ๐ Acknowledgments
Based on the [docchat-docling](https://github.com/HaileyTQuach/docchat-docling) architecture.