Utilizes LangChain for document chunking and processing, with configurable parameters for chunk size and overlap
Converts PDF content to Markdown format for better processing and stores it in a parsing cache for efficient retrieval
Uses OpenAI embeddings for semantic search capabilities, allowing for intelligent document search and retrieval from PDF collections
PDF Knowledgebase MCP Server
A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.
Table of Contents
- 🚀 Quick Start
- 🏗️ Architecture Overview
- 🎯 Parser Selection Guide
- ⚙️ Configuration
- 🖥️ MCP Client Setup
- 📊 Performance & Troubleshooting
- 🔧 Advanced Configuration
- 📚 Appendix
🚀 Quick Start
Step 1: Install the Server
Step 2: Configure Your MCP Client
Claude Desktop (Most Common):
Configuration file locations:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
- Linux:
~/.config/Claude/claude_desktop_config.json
VS Code (Native MCP) - Create .vscode/mcp.json
in workspace:
Step 3: Verify Installation
- Restart your MCP client completely
- Check for PDF KB tools: Look for
add_document
,search_documents
,list_documents
,remove_document
- Test functionality: Try adding a PDF and searching for content
🏗️ Architecture Overview
MCP Integration
Available Tools & Resources
Tools (Actions your client can perform):
add_document(path, metadata?)
- Add PDF to knowledgebasesearch_documents(query, limit=5, metadata_filter?)
- Semantic search across PDFslist_documents(metadata_filter?)
- List all documents with metadataremove_document(document_id)
- Remove document from knowledgebase
Resources (Data your client can access):
pdf://{document_id}
- Full document content as JSONpdf://{document_id}/page/{page_number}
- Specific page contentpdf://list
- List of all documents with metadata
🎯 Parser Selection Guide
Decision Tree
Speed Optimized:
Memory Efficient:
Tier 2: Use Case Specific (15% of users)
Academic Papers:
Business Documents:
Multi-language Documents:
Maximum Quality:
Essential Environment Variables
Variable | Default | Description |
---|---|---|
OPENAI_API_KEY | required | OpenAI API key for embeddings |
KNOWLEDGEBASE_PATH | ./pdfs | Directory containing PDF files |
CACHE_DIR | ./.cache | Cache directory for processing |
PDF_PARSER | marker | Parser: marker , pymupdf4llm , mineru , docling , llm |
CHUNK_SIZE | 1000 | Target chunk size for LangChain chunker |
EMBEDDING_MODEL | text-embedding-3-large | OpenAI embedding model |
🖥️ MCP Client Setup
Claude Desktop
Configuration File Location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
- Linux:
~/.config/Claude/claude_desktop_config.json
Configuration:
Verification:
- Restart Claude Desktop completely
- Look for PDF KB tools in the interface
- Test with "Add a document" or "Search documents"
VS Code with Native MCP Support
Configuration (.vscode/mcp.json
in workspace):
Verification:
- Reload VS Code window
- Check VS Code's MCP server status in Command Palette
- Use MCP tools in Copilot Chat
VS Code with Continue Extension
Configuration (.continue/config.json
):
Verification:
- Reload VS Code window
- Check Continue panel for server connection
- Use
@pdfkb
in Continue chat
Generic MCP Client
Standard Configuration Template:
📊 Performance & Troubleshooting
Common Issues
Server not appearing in MCP client:
Processing too slow:
Memory issues:
Poor table extraction:
Resource Requirements
Configuration | RAM Usage | Processing Speed | Best For |
---|---|---|---|
Speed | 2-4 GB | Fastest | Large collections |
Balanced | 4-6 GB | Medium | Most users |
Quality | 6-12 GB | Medium-Fast | Accuracy priority |
GPU | 8-16 GB | Very Fast | High-volume processing |
🔧 Advanced Configuration
Parser-Specific Options
MinerU Configuration:
LLM Parser Configuration:
Performance Tuning
High-Performance Setup:
Intelligent Caching
The server uses multi-stage caching:
- Parsing Cache: Stores converted markdown (
src/pdfkb/intelligent_cache.py:139
) - Chunking Cache: Stores processed chunks
- Vector Cache: ChromaDB embeddings storage
Cache Invalidation Rules:
- Changing
PDF_PARSER
→ Full reset (parsing + chunking + embeddings) - Changing
PDF_CHUNKER
→ Partial reset (chunking + embeddings) - Changing
EMBEDDING_MODEL
→ Minimal reset (embeddings only)
📚 Appendix
Installation Options
Primary (Recommended):
With Specific Parser Dependencies:
Development Installation:
Complete Environment Variables Reference
Variable | Default | Description |
---|---|---|
OPENAI_API_KEY | required | OpenAI API key for embeddings |
OPENROUTER_API_KEY | optional | Required for LLM parser |
KNOWLEDGEBASE_PATH | ./pdfs | PDF directory path |
CACHE_DIR | ./.cache | Cache directory |
PDF_PARSER | marker | PDF parser selection |
PDF_CHUNKER | unstructured | Chunking strategy |
CHUNK_SIZE | 1000 | LangChain chunk size |
CHUNK_OVERLAP | 200 | LangChain chunk overlap |
EMBEDDING_MODEL | text-embedding-3-large | OpenAI model |
EMBEDDING_BATCH_SIZE | 100 | Embedding batch size |
VECTOR_SEARCH_K | 5 | Default search results |
FILE_SCAN_INTERVAL | 60 | File monitoring interval |
LOG_LEVEL | INFO | Logging level |
Parser Comparison Details
Feature | PyMuPDF4LLM | Marker | MinerU | Docling | LLM |
---|---|---|---|---|---|
Speed | Fastest | Medium | Fast (GPU) | Medium | Slowest |
Memory | Lowest | Medium | High | Medium | Lowest |
Tables | Basic | Good | Excellent | Excellent | Excellent |
Formulas | Basic | Good | Excellent | Good | Excellent |
Images | Basic | Good | Good | Excellent | Excellent |
Setup | Simple | Simple | Moderate | Simple | Simple |
Cost | Free | Free | Free | Free | API costs |
Chunking Strategies
LangChain (PDF_CHUNKER=langchain
):
- Header-aware splitting with
MarkdownHeaderTextSplitter
- Configurable via
CHUNK_SIZE
andCHUNK_OVERLAP
- Best for customizable chunking
Unstructured (PDF_CHUNKER=unstructured
):
- Intelligent semantic chunking with
unstructured
library - Zero configuration required
- Best for document structure awareness
Troubleshooting Guide
API Key Issues:
- Verify key format starts with
sk-
- Check account has sufficient credits
- Test connectivity:
curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models
Parser Installation Issues:
- MinerU:
pip install mineru[all]
and verifymineru --version
- Docling:
pip install docling
for basic,pip install pdfkb-mcp[docling-complete]
for all features - LLM: Requires
OPENROUTER_API_KEY
environment variable
Performance Optimization:
- Speed: Use
pymupdf4llm
parser - Memory: Reduce
EMBEDDING_BATCH_SIZE
andCHUNK_SIZE
- Quality: Use
mineru
(GPU) ordocling
(CPU) - Tables: Use
docling
withDOCLING_TABLE_MODE=ACCURATE
For additional support, see implementation details in src/pdfkb/main.py
and src/pdfkb/config.py
.
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
A Model Context Protocol server that enables intelligent document search and retrieval from PDF collections, providing semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.
Related MCP Servers
- AsecurityAlicenseAqualityA Model Context Protocol server providing vector database capabilities through Chroma, enabling semantic document search, metadata filtering, and document management with persistent storage.Last updated -635PythonMIT License
- -securityFlicense-qualityA Model Context Protocol server for ingesting, chunking and semantically searching documentation files, with support for markdown, Python, OpenAPI, HTML files and URLs.Last updated -Python
- AsecurityAlicenseAqualityA Model Context Protocol (MCP) server for the Open Library API that enables AI assistants to search for book information.Last updated -1926TypeScriptMIT License
- -securityAlicense-qualityA Model Context Protocol server that provides intelligent file reading and semantic search capabilities across multiple document formats with security-first access controls.Last updated -5PythonMIT License