Utilizes LangChain for document chunking and processing, with configurable parameters for chunk size and overlap
Converts PDF content to Markdown format for better processing and stores it in a parsing cache for efficient retrieval
Uses OpenAI embeddings for semantic search capabilities, allowing for intelligent document search and retrieval from PDF collections
PDF Knowledgebase MCP Server
A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides advanced search capabilities powered by local, OpenAI, or HuggingFace embeddings and ChromaDB vector storage.
🆕 NEW Features:
- Reranking Support: Advanced result reranking using Qwen3-Reranker models (standard and GGUF) for improved search relevance
- GGUF Quantized Models: Memory-optimized local embeddings and rerankers with 50-70% smaller models using GGUF quantization
- Qwen3-Embedding Exclusive Support: Optimized support for the advanced Qwen3-Embedding model family only
- HuggingFace Inference Embeddings: Use HuggingFace Inference API with support for custom providers like Nebius
- Custom OpenAI Endpoints: Support for OpenAI-compatible APIs with custom base URLs
- Minimum Chunk Filtering: Automatically filter out short, low-information chunks below configurable character threshold
- Markdown Document Support: Native support for .md files with frontmatter parsing and page boundary detection
- Page-Based Chunking: Preserve document structure with intelligent page-level chunk boundaries
- Semantic Chunking: Advanced content-aware chunking using embedding similarity for better context preservation
- Local Embeddings: Run embeddings locally with HuggingFace models - no API costs, full privacy
- Hybrid Search: Combines semantic similarity with keyword matching (BM25) for superior search quality
- Web Interface: Modern web UI for document management and search alongside the traditional MCP protocol
Table of Contents
- 🚀 Quick Start
- 🌐 Web Interface
- 🏗️ Architecture Overview
- 🤖 Local Embeddings
- 🔄 Reranking
- 🔍 Hybrid Search
- 🔽 Minimum Chunk Filtering
- 🧩 Semantic Chunking
- 🎯 Parser Selection Guide
- ⚙️ Configuration
- 🖥️ MCP Client Setup
- 📊 Performance & Troubleshooting
- 🔧 Advanced Configuration
- 📚 Appendix
🚀 Quick Start
Step 1: Configure Your MCP Client
Option A: Local Embeddings w/ Hybrid Search (No API Key Required)
🆕 Option A2: Local GGUF Embeddings (Memory Optimized, No API Key Required)
🆕 Option A3: Local Embeddings with Reranking (Best Search Quality, No API Key Required)
Option B: OpenAI Embeddings w/ Hybrid Search
🆕 Option C: HuggingFace w/ Custom Provider
🆕 Option D: Custom OpenAI-Compatible API
Step 3: Verify Installation
- Restart your MCP client completely
- Check for PDF KB tools: Look for
add_document
,search_documents
,list_documents
,remove_document
- Test functionality: Try adding a PDF and searching for content
🌐 Web Interface
The PDF Knowledgebase includes a modern web interface for easy document management and search. The web interface is disabled by default and must be explicitly enabled.
Server Modes
1. MCP Only Mode (Default):
- Runs only the MCP server for integration with Claude Desktop, VS Code, etc.
- Most resource-efficient option
- Best for pure MCP integration
2. Integrated Mode (MCP + Web):
- Runs both MCP server AND web interface concurrently
- Web interface available at http://localhost:8080
- Best of both worlds: API integration + web UI
Web Interface Features
Modern web interface showing document collection with search, filtering, and management capabilities
- 📄 Document Upload: Drag & drop PDF files or upload via file picker
- 🔍 Semantic Search: Powerful vector-based search with real-time results
- 📊 Document Management: List, preview, and manage your PDF collection
- 📈 Real-time Status: Live processing updates via WebSocket connections
- 🎯 Chunk Explorer: View and navigate document chunks for detailed analysis
- ⚙️ System Metrics: Monitor server performance and resource usage
Detailed document view showing metadata, chunk analysis, and content preview
Quick Web Setup
- Install and run:
- Open your browser: http://localhost:8080
- Configure environment (create
.env
file):
Web Configuration Options
Environment Variable | Default | Description |
---|---|---|
PDFKB_WEB_ENABLE | false | Enable/disable web interface |
PDFKB_WEB_PORT | 8080 | Web server port |
PDFKB_WEB_HOST | localhost | Web server host |
PDFKB_WEB_CORS_ORIGINS | http://localhost:3000,http://127.0.0.1:3000 | CORS allowed origins |
Command Line Options
The server supports command line arguments:
API Documentation
When running with web interface enabled, comprehensive API documentation is available at:
- Swagger UI: http://localhost:8080/docs
- ReDoc: http://localhost:8080/redoc
🏗️ Architecture Overview
MCP Integration
Internal Architecture
Available Tools & Resources
Tools (Actions your client can perform):
add_document(path, metadata?)
- Add PDF to knowledgebasesearch_documents(query, limit=5, metadata_filter?, search_type?)
- Hybrid search across PDFs (semantic + keyword matching)list_documents(metadata_filter?)
- List all documents with metadataremove_document(document_id)
- Remove document from knowledgebase
Resources (Data your client can access):
pdf://{document_id}
- Full document content as JSONpdf://{document_id}/page/{page_number}
- Specific page contentpdf://list
- List of all documents with metadata
🤖 Embedding Options
The server supports three embedding providers, each with different trade-offs:
1. Local Embeddings (Default)
Run embeddings locally using HuggingFace models, eliminating API costs and keeping your data completely private.
Features:
- Zero API Costs: No external API charges
- Complete Privacy: Documents never leave your machine
- Hardware Acceleration: Automatic detection of Metal (macOS), CUDA (NVIDIA), or CPU
- Smart Caching: LRU cache for frequently embedded texts
- Multiple Model Sizes: Choose based on your hardware capabilities
Local embeddings are enabled by default. No configuration needed for basic usage:
Supported Models
🆕 Qwen3-Embedding Series Only: The server now exclusively supports the Qwen3-Embedding model family, including both standard and quantized GGUF variants for optimized performance.
Standard Models
Model | Size | Dimensions | Max Context | Best For |
---|---|---|---|---|
Qwen/Qwen3-Embedding-0.6B (default) | 1.2GB | 1024 | 32K tokens | Best overall - long docs, fast |
Qwen/Qwen3-Embedding-4B | 8.0GB | 2560 | 32K tokens | High quality, long context |
Qwen/Qwen3-Embedding-8B | 16.0GB | 3584 | 32K tokens | Maximum quality, long context |
🆕 GGUF Quantized Models (Reduced Memory Usage)
Model | Size | Dimensions | Max Context | Best For |
---|---|---|---|---|
Qwen/Qwen3-Embedding-0.6B-GGUF | 0.6GB | 1024 | 32K tokens | Quantized lightweight, 32K context |
Qwen/Qwen3-Embedding-4B-GGUF | 2.4GB | 2560 | 32K tokens | Quantized high quality, 32K context |
Qwen/Qwen3-Embedding-8B-GGUF | 4.8GB | 3584 | 32K tokens | Quantized maximum quality, 32K context |
Configure your preferred model:
🆕 GGUF Quantization Options
When using GGUF models, you can configure the quantization level to balance between model size and quality:
Quantization Recommendations:
- Q6_K (default): Best balance of quality and size
- Q8_0: Near-original quality with moderate compression
- F16: Original quality, minimal compression
- Q4_K_M: Maximum compression, acceptable quality loss
Hardware Optimization
The server automatically detects and uses the best available hardware:
- Apple Silicon (M1/M2/M3): Uses Metal Performance Shaders (MPS)
- NVIDIA GPUs: Uses CUDA acceleration
- CPU Fallback: Optimized for multi-core processing
Force a specific device if needed:
Configuration Options
2. OpenAI Embeddings
Use OpenAI's embedding API or any OpenAI-compatible endpoint for high-quality embeddings with minimal setup.
Features:
- High Quality: State-of-the-art embedding models
- No Local Resources: Runs entirely in the cloud
- Fast: Optimized API with batching support
- 🆕 Custom Endpoints: Support for OpenAI-compatible APIs like Together, Nebius, etc.
Standard OpenAI:
🆕 Custom OpenAI-Compatible Endpoints:
3. HuggingFace Embeddings
🆕 ENHANCED: Use HuggingFace's Inference API with support for custom providers and thousands of embedding models.
Features:
- 🆕 Multiple Providers: Use HuggingFace directly or third-party providers like Nebius
- Wide Model Selection: Access to thousands of embedding models
- Cost-Effective: Many free or low-cost options available
- 🆕 Provider Support: Seamlessly switch between HuggingFace and custom inference providers
Configuration:
Advanced Configuration:
Performance Tips
- Batch Size: Larger batches are faster but use more memory
- Apple Silicon: 32-64 recommended
- NVIDIA GPUs: 64-128 recommended
- CPU: 16-32 recommended
- Model Selection: Choose based on your needs
- Default (Qwen3-0.6B): Best for most users - 32K context, fast, 1.2GB
- GGUF (Qwen3-0.6B-GGUF): Memory-optimized version - 32K context, fast, 0.6GB
- High Quality (Qwen3-4B): Better accuracy - 32K context, 8GB
- GGUF High Quality (Qwen3-4B-GGUF): Memory-optimized high quality - 32K context, 2.4GB
- Maximum Quality (Qwen3-8B): Best accuracy - 32K context, 16GB
- GGUF Maximum Quality (Qwen3-8B-GGUF): Memory-optimized maximum quality - 32K context, 4.8GB
- GGUF Quantization: Choose based on memory constraints
- Q6_K (default): Best balance of quality and size
- Q8_0: Higher quality, larger size
- F16: Near-original quality, largest size
- Q4_K_M: Smallest size, acceptable quality
- Memory Management: The server automatically handles OOM errors by reducing batch size
📝 Markdown Document Support
The server now supports Markdown documents (.md, .markdown) alongside PDFs, perfect for:
- Pre-processed documents where you've already extracted clean markdown
- Technical documentation and notes
- Avoiding complex PDF parsing for better quality content
- Faster processing with no conversion overhead
Features
- Native Processing: Markdown files are read directly without conversion
- Page Boundary Detection: Automatically splits documents on page markers like
--[PAGE: 142]--
- Frontmatter Support: Automatically extracts YAML/TOML frontmatter metadata
- Title Extraction: Intelligently extracts titles from H1 headers or frontmatter
- Same Pipeline: Uses the same chunking, embedding, and search infrastructure as PDFs
- Mixed Collections: Search across both PDFs and Markdown documents seamlessly
Usage
Simply add Markdown files the same way you add PDFs:
Configuration
🔄 Reranking
🆕 NEW: The server now supports advanced reranking using multiple providers to significantly improve search result relevance and quality. Reranking is a post-processing step that re-orders initial search results based on deeper semantic understanding.
Supported Providers
- Local Models: Qwen3-Reranker models (both standard and GGUF quantized variants)
- DeepInfra API: Qwen3-Reranker-8B via DeepInfra's native API
How It Works
- Initial Search: Retrieves
limit + reranker_sample_additional
candidates using hybrid/vector/text search - Reranking: Uses Qwen3-Reranker to deeply analyze query-document relevance and re-score results
- Final Results: Returns the top
limit
results based on reranker scores
Supported Models
Local Models (Qwen3-Reranker Series)
Standard Models
Model | Size | Best For |
---|---|---|
Qwen/Qwen3-Reranker-0.6B (default) | 1.2GB | Lightweight, fast reranking |
Qwen/Qwen3-Reranker-4B | 8.0GB | High quality reranking |
Qwen/Qwen3-Reranker-8B | 16.0GB | Maximum quality reranking |
🆕 GGUF Quantized Models (Reduced Memory Usage)
Model | Size | Best For |
---|---|---|
Mungert/Qwen3-Reranker-0.6B-GGUF | 0.3GB | Quantized lightweight, very fast |
Mungert/Qwen3-Reranker-4B-GGUF | 2.0GB | Quantized high quality |
Mungert/Qwen3-Reranker-8B-GGUF | 4.0GB | Quantized maximum quality |
🆕 DeepInfra Model
Model | Best For |
---|---|
Qwen/Qwen3-Reranker-8B | High-quality cross-encoder reranking via DeepInfra API |
Configuration
Option 1: Local Reranking (Standard Models)
Option 2: GGUF Quantized Local Reranking (Memory Optimized)
🆕 Option 3: DeepInfra Reranking (API-based)
About DeepInfra Reranker:
- Supports three Qwen3-Reranker models:
- 0.6B: Lightweight model, fastest inference
- 4B: Balanced model with good quality and speed
- 8B: Maximum quality model (default)
- Optimized for high-quality cross-encoder relevance scoring
- Pay-per-use pricing model
- Get your API key at https://deepinfra.com
- Note: The API requires equal-length query and document arrays, so the query is duplicated for each document internally
Complete Examples
Local Reranking with GGUF Models
🆕 DeepInfra Reranking with Local Embeddings
Performance Impact
Search Quality: Reranking typically improves search relevance by 15-30% by better understanding query intent and document relevance.
Memory Usage:
- Local standard models: 1.2GB - 16GB depending on model size
- GGUF quantized: 0.3GB - 4GB depending on model and quantization
- DeepInfra: No local memory usage (API-based)
Speed:
- Local models: Adds ~100-500ms per search
- GGUF models: Slightly slower initial load, similar inference
- DeepInfra: Adds ~200-800ms depending on API latency
Cost:
- Local models: Free after initial download
- DeepInfra: Pay-per-use based on token usage
When to Use Reranking
✅ Recommended for:
- High-stakes searches where quality matters most
- Complex queries requiring nuanced understanding
- Large document collections with diverse content
- When you have adequate hardware resources
❌ Skip reranking for:
- Simple keyword-based searches
- Real-time applications requiring sub-100ms responses
- Limited memory/compute environments
- Very small document collections (<100 documents)
GGUF Quantization Recommendations
For GGUF reranker models, choose quantization based on your needs:
- Q6_K (recommended): Best balance of quality and size
- Q8_0: Near-original quality with moderate compression
- F16: Original quality, minimal compression
- Q4_K_M: Maximum compression, acceptable quality loss
- Q4_K_S: Small size, lower quality
- Q5_K_M: Medium compression and quality
- Q5_K_S: Smaller variant of Q5
🔍 Hybrid Search
The server now supports Hybrid Search, which combines the strengths of semantic similarity search (vector embeddings) with traditional keyword matching (BM25) for improved search quality.
How It Works
- Dual Indexing: Documents are indexed in both a vector database (ChromaDB) and a full-text search index (Whoosh)
- Parallel Search: Queries execute both semantic and keyword searches simultaneously
- Reciprocal Rank Fusion (RRF): Results are intelligently merged using RRF algorithm for optimal ranking
Benefits
- Better Recall: Finds documents that match exact keywords even if semantically different
- Improved Precision: Combines conceptual understanding with keyword relevance
- Technical Terms: Excellent for technical documentation, code references, and domain-specific terminology
- Balanced Results: Configurable weights let you adjust the balance between semantic and keyword matching
Configuration
Enable hybrid search by setting:
Installation
To use hybrid search, install with the optional dependency:
Or if using uvx, it's included by default when hybrid search is enabled.
🔽 Minimum Chunk Filtering
NEW: The server now supports Minimum Chunk Filtering, which automatically filters out short, low-information chunks that don't contain enough content to be useful for search and retrieval.
How It Works
Documents are processed normally through parsing and chunking, then chunks below the configured character threshold are automatically filtered out before indexing and embedding.
Benefits
- Improved Search Quality: Eliminates noise from short, uninformative chunks
- Reduced Storage: Less vector storage and faster search by removing low-value content
- Better Context: Search results focus on chunks with substantial, meaningful content
- Configurable: Set custom thresholds based on your document types and use case
Configuration
Or in your MCP client configuration:
Usage Guidelines
- Default (0): No filtering - keeps all chunks for maximum recall
- Conservative (100-150): Good balance - removes very short chunks while preserving content
- Aggressive (200+): Strict filtering - only keeps substantial chunks with rich content
🧩 Semantic Chunking
NEW: The server now supports advanced Semantic Chunking, which uses embedding similarity to identify natural content boundaries, creating more coherent and contextually complete chunks than traditional methods.
How It Works
- Sentence Embedding: Each sentence in the document is embedded using your configured embedding model
- Similarity Analysis: Distances between consecutive sentence embeddings are calculated
- Breakpoint Detection: Natural content boundaries are identified where similarity drops significantly
- Intelligent Grouping: Related sentences are kept together in the same chunk
Benefits
- 40% Better Coherence: Chunks contain semantically related content
- Context Preservation: Important context stays together, reducing information loss
- Improved Retrieval: Better search results due to more meaningful chunks
- Flexible Configuration: Four different breakpoint detection methods for different document types
Quick Start
Enable semantic chunking by setting:
Or in your MCP client configuration:
Breakpoint Detection Methods
Method | Best For | Threshold Range | Description |
---|---|---|---|
percentile (default) | General documents | 90-99 | Split at top N% largest semantic gaps |
standard_deviation | Consistent style docs | 2.0-4.0 | Split at mean + N×σ distance |
interquartile | Noisy documents | 1.0-2.0 | Split at mean + N×IQR, robust to outliers |
gradient | Technical/legal docs | 90-99 | Analyze rate of change in similarity |
Configuration Options
Tuning Guidelines
- For General Documents (default):
- Use
percentile
with95.0
threshold - Good balance between chunk size and coherence
- Use
- For Technical Documentation:
- Use
gradient
with90.0
threshold - Better at detecting technical section boundaries
- Use
- For Academic Papers:
- Use
standard_deviation
with3.0
threshold - Maintains paragraph and section integrity
- Use
- For Mixed Content:
- Use
interquartile
with1.5
threshold - Robust against varying content styles
- Use
Installation
Install with the semantic chunking dependency:
Or if using uvx:
Compatibility
- Works with both local and OpenAI embeddings
- Compatible with all PDF parsers
- Integrates with intelligent caching system
- Falls back to LangChain chunker if dependencies missing
🎯 Parser Selection Guide
Decision Tree
Performance Comparison
Parser | Processing Speed | Memory | Text Quality | Table Quality | Best For |
---|---|---|---|---|---|
PyMuPDF4LLM | Fastest | Low | Good | Basic-Moderate | RAG pipelines, bulk ingestion |
MinerU | Fast with GPU¹ | ~4GB VRAM² | Excellent | Excellent | Scientific/technical PDFs |
Docling | 0.9-2.5 pages/s³ | 2.5-6GB⁴ | Excellent | Excellent | Structured documents, tables |
Marker | ~25 p/s batch⁵ | ~4GB VRAM⁶ | Excellent | Good-Excellent⁷ | Scientific papers, multilingual |
LLM | Slow⁸ | Variable⁹ | Excellent¹⁰ | Excellent | Complex layouts, high-value docs |
Notes: ¹ >10,000 tokens/s on RTX 4090 with sglang ² Reported for <1B parameter model ³ CPU benchmarks: 0.92-1.34 p/s (native), 1.57-2.45 p/s (pypdfium) ⁴ 2.42-2.56GB (pypdfium), 6.16-6.20GB (native backend) ⁵ Projected on H100 GPU in batch mode ⁶ Benchmark configuration on NVIDIA A6000 ⁷ Enhanced with optional LLM mode for table merging ⁸ Order of magnitude slower than traditional parsers ⁹ Depends on token usage and model size ¹⁰ 98.7-100% accuracy when given clean text
⚙️ Configuration
Tier 1: Basic Configurations (80% of users)
Default (Recommended):
Speed Optimized:
Memory Efficient:
Tier 2: Use Case Specific (15% of users)
Academic Papers:
Business Documents:
Multi-language Documents:
Hybrid Search (NEW - Improved Search Quality):
Semantic Chunking (NEW - Context-Aware Chunking):
Maximum Quality:
Essential Environment Variables
Variable | Default | Description |
---|---|---|
PDFKB_OPENAI_API_KEY | required | OpenAI API key for embeddings |
PDFKB_KNOWLEDGEBASE_PATH | ./pdfs | Directory containing PDF files |
PDFKB_CACHE_DIR | ./.cache | Cache directory for processing |
PDFKB_PDF_PARSER | pymupdf4llm | Parser: pymupdf4llm (default), marker , mineru , docling , llm |
PDFKB_PDF_CHUNKER | langchain | Chunking strategy: langchain (default), page , unstructured , semantic |
PDFKB_CHUNK_SIZE | 1000 | Target chunk size for LangChain chunker |
PDFKB_WEB_ENABLE | false | Enable/disable web interface |
PDFKB_WEB_PORT | 8080 | Web server port |
PDFKB_WEB_HOST | localhost | Web server host |
PDFKB_WEB_CORS_ORIGINS | http://localhost:3000,http://127.0.0.1:3000 | CORS allowed origins (comma-separated) |
PDFKB_EMBEDDING_MODEL | text-embedding-3-large | OpenAI embedding model (use text-embedding-3-small for faster processing) |
PDFKB_MIN_CHUNK_SIZE | 0 | Minimum chunk size in characters (0 = disabled, filters out chunks smaller than this size) |
PDFKB_OPENAI_API_BASE | optional | Custom base URL for OpenAI-compatible APIs (e.g., https://api.studio.nebius.com/v1/) |
PDFKB_HUGGINGFACE_EMBEDDING_MODEL | sentence-transformers/all-MiniLM-L6-v2 | HuggingFace model for embeddings when using huggingface provider |
PDFKB_HUGGINGFACE_PROVIDER | optional | HuggingFace provider (e.g., "nebius"), leave empty for default |
PDFKB_ENABLE_HYBRID_SEARCH | true | Enable hybrid search combining semantic and keyword matching |
PDFKB_HYBRID_VECTOR_WEIGHT | 0.6 | Weight for semantic search (0-1, must sum to 1 with text weight) |
PDFKB_HYBRID_TEXT_WEIGHT | 0.4 | Weight for keyword/BM25 search (0-1, must sum to 1 with vector weight) |
PDFKB_RRF_K | 60 | Reciprocal Rank Fusion constant (higher = less emphasis on rank differences) |
PDFKB_LOCAL_EMBEDDING_MODEL | Qwen/Qwen3-Embedding-0.6B | Local embedding model (Qwen3-Embedding series only) |
PDFKB_GGUF_QUANTIZATION | Q6_K | GGUF quantization level (Q8_0, F16, Q6_K, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S) |
PDFKB_ENABLE_RERANKER | false | Enable/disable result reranking for improved search quality |
PDFKB_RERANKER_PROVIDER | local | Reranker provider: 'local' or 'deepinfra' |
PDFKB_RERANKER_MODEL | Qwen/Qwen3-Reranker-0.6B | Reranker model for local provider |
PDFKB_RERANKER_SAMPLE_ADDITIONAL | 5 | Additional results to sample for reranking |
PDFKB_RERANKER_GGUF_QUANTIZATION | optional | GGUF quantization level (Q6_K, Q8_0, etc.) |
PDFKB_DEEPINFRA_API_KEY | required | DeepInfra API key for reranking |
PDFKB_DEEPINFRA_RERANKER_MODEL | Qwen/Qwen3-Reranker-8B | DeepInfra model: 0.6B, 4B, or 8B |
🖥️ MCP Client Setup
Claude Desktop
Configuration File Location:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
- Linux:
~/.config/Claude/claude_desktop_config.json
Configuration:
Verification:
- Restart Claude Desktop completely
- Look for PDF KB tools in the interface
- Test with "Add a document" or "Search documents"
VS Code with Native MCP Support
Configuration (.vscode/mcp.json
in workspace):
Verification:
- Reload VS Code window
- Check VS Code's MCP server status in Command Palette
- Use MCP tools in Copilot Chat
VS Code with Continue Extension
Configuration (.continue/config.json
):
Verification:
- Reload VS Code window
- Check Continue panel for server connection
- Use
@pdfkb
in Continue chat
Generic MCP Client
Standard Configuration Template:
📊 Performance & Troubleshooting
Common Issues
Server not appearing in MCP client:
System overload when processing multiple PDFs:
Processing too slow:
Memory issues:
Poor table extraction:
Resource Requirements
Configuration | RAM Usage | Processing Speed | Best For |
---|---|---|---|
Speed | 2-4 GB | Fastest | Large collections |
Balanced | 4-6 GB | Medium | Most users |
Quality | 6-12 GB | Medium-Fast | Accuracy priority |
GPU | 8-16 GB | Very Fast | High-volume processing |
🔧 Advanced Configuration
Parser-Specific Options
MinerU Configuration:
LLM Parser Configuration:
Performance Tuning
Parallel Processing Configuration:
Control the number of concurrent operations to optimize performance and prevent system overload:
Resource-Optimized Setup (for low-powered systems):
High-Performance Setup (for powerful machines):
Complete High-Performance Setup:
Intelligent Caching
The server uses multi-stage caching:
- Parsing Cache: Stores converted markdown (
src/pdfkb/intelligent_cache.py:139
) - Chunking Cache: Stores processed chunks
- Vector Cache: ChromaDB embeddings storage
Cache Invalidation Rules:
- Changing
PDFKB_PDF_PARSER
→ Full reset (parsing + chunking + embeddings) - Changing
PDFKB_PDF_CHUNKER
→ Partial reset (chunking + embeddings) - Changing
PDFKB_EMBEDDING_MODEL
→ Minimal reset (embeddings only)
📚 Appendix
Installation Options
Primary (Recommended):
With Specific Parser Dependencies:
pip install "pdfkb-mcp[web]" # Enhanced web features Or via pip/pipx:
Development Installation:
Complete Environment Variables Reference
Variable | Default | Description |
---|---|---|
PDFKB_OPENAI_API_KEY | required | OpenAI API key for embeddings |
PDFKB_OPENROUTER_API_KEY | optional | Required for LLM parser |
PDFKB_KNOWLEDGEBASE_PATH | ./pdfs | PDF directory path |
PDFKB_CACHE_DIR | ./.cache | Cache directory |
PDFKB_PDF_PARSER | pymupdf4llm | PDF parser selection |
PDFKB_PDF_CHUNKER | langchain | Chunking strategy: langchain , unstructured , semantic |
PDFKB_CHUNK_SIZE | 1000 | LangChain chunk size |
PDFKB_CHUNK_OVERLAP | 200 | LangChain chunk overlap |
PDFKB_MIN_CHUNK_SIZE | 0 | Minimum chunk size in characters (0 = disabled, filters out chunks smaller than this size) |
PDFKB_EMBEDDING_MODEL | text-embedding-3-large | OpenAI model |
PDFKB_OPENAI_API_BASE | optional | Custom base URL for OpenAI-compatible APIs |
PDFKB_HUGGINGFACE_EMBEDDING_MODEL | sentence-transformers/all-MiniLM-L6-v2 | HuggingFace model |
PDFKB_HUGGINGFACE_PROVIDER | optional | HuggingFace provider (e.g., "nebius") |
PDFKB_LOCAL_EMBEDDING_MODEL | Qwen/Qwen3-Embedding-0.6B | Local embedding model (Qwen3-Embedding series only) |
PDFKB_GGUF_QUANTIZATION | Q6_K | GGUF quantization level (Q8_0, F16, Q6_K, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S) |
PDFKB_EMBEDDING_DEVICE | auto | Hardware device (auto, mps, cuda, cpu) |
PDFKB_USE_MODEL_OPTIMIZATION | true | Enable torch.compile optimization |
PDFKB_EMBEDDING_CACHE_SIZE | 10000 | Number of cached embeddings in LRU cache |
PDFKB_MODEL_CACHE_DIR | ~/.cache/huggingface | Local model cache directory |
PDFKB_ENABLE_RERANKER | false | Enable/disable result reranking |
PDFKB_RERANKER_PROVIDER | local | Reranker provider: 'local' or 'deepinfra' |
PDFKB_RERANKER_MODEL | Qwen/Qwen3-Reranker-0.6B | Reranker model for local provider |
PDFKB_RERANKER_SAMPLE_ADDITIONAL | 5 | Additional results to sample for reranking |
PDFKB_RERANKER_DEVICE | auto | Hardware device for local reranker (auto, mps, cuda, cpu) |
PDFKB_RERANKER_MODEL_CACHE_DIR | ~/.cache/pdfkb-mcp/reranker | Cache directory for local reranker models |
PDFKB_RERANKER_GGUF_QUANTIZATION | optional | GGUF quantization level (Q6_K, Q8_0, etc.) |
PDFKB_DEEPINFRA_API_KEY | required | DeepInfra API key for reranking |
PDFKB_DEEPINFRA_RERANKER_MODEL | Qwen/Qwen3-Reranker-8B | Model: Qwen/Qwen3-Reranker-0.6B, 4B, or 8B |
PDFKB_EMBEDDING_BATCH_SIZE | 100 | Embedding batch size |
PDFKB_MAX_PARALLEL_PARSING | 1 | Max concurrent PDF parsing operations |
PDFKB_MAX_PARALLEL_EMBEDDING | 1 | Max concurrent embedding operations |
PDFKB_BACKGROUND_QUEUE_WORKERS | 2 | Number of background processing workers |
PDFKB_THREAD_POOL_SIZE | 1 | Thread pool size for CPU-intensive tasks |
PDFKB_VECTOR_SEARCH_K | 5 | Default search results |
PDFKB_FILE_SCAN_INTERVAL | 60 | File monitoring interval |
PDFKB_LOG_LEVEL | INFO | Logging level |
PDFKB_WEB_ENABLE | false | Enable/disable web interface |
PDFKB_WEB_PORT | 8080 | Web server port |
PDFKB_WEB_HOST | localhost | Web server host |
PDFKB_WEB_CORS_ORIGINS | http://localhost:3000,http://127.0.0.1:3000 | CORS allowed origins (comma-separated) |
Parser Comparison Details
Feature | PyMuPDF4LLM | Marker | MinerU | Docling | LLM |
---|---|---|---|---|---|
Speed | Fastest | Medium | Fast (GPU) | Medium | Slowest |
Memory | Lowest | Medium | High | Medium | Lowest |
Tables | Basic | Good | Excellent | Excellent | Excellent |
Formulas | Basic | Good | Excellent | Good | Excellent |
Images | Basic | Good | Good | Excellent | Excellent |
Setup | Simple | Simple | Moderate | Simple | Simple |
Cost | Free | Free | Free | Free | API costs |
Chunking Strategies
LangChain (PDFKB_PDF_CHUNKER=langchain
):
- Header-aware splitting with
MarkdownHeaderTextSplitter
- Configurable via
PDFKB_CHUNK_SIZE
andPDFKB_CHUNK_OVERLAP
- Best for customizable chunking
- Default and installed with base package
Page (PDFKB_PDF_CHUNKER=page
) 🆕 NEW:
- Page-based chunking that preserves document page boundaries
- Works with page-aware parsers that output individual pages
- Supports merging small pages and splitting large ones
- Configurable via
PDFKB_PAGE_CHUNKER_MIN_CHUNK_SIZE
andPDFKB_PAGE_CHUNKER_MAX_CHUNK_SIZE
- Best for preserving original document structure and page-level metadata
Semantic (PDFKB_PDF_CHUNKER=semantic
):
- Advanced semantic chunking using LangChain's
SemanticChunker
- Groups semantically related content together using embedding similarity
- Four breakpoint detection methods: percentile, standard_deviation, interquartile, gradient
- Preserves context and improves retrieval quality by 40%
- Install extra:
pip install "pdfkb-mcp[semantic]"
to enable - Configurable via environment variables (see Semantic Chunking section)
- Best for documents requiring high context preservation
Unstructured (PDFKB_PDF_CHUNKER=unstructured
):
- Intelligent semantic chunking with
unstructured
library - Zero configuration required
- Install extra:
pip install "pdfkb-mcp[unstructured_chunker]"
to enable - Best for document structure awareness
First-run notes
- On the first run, the server initializes caches and vector store and logs selected components:
- Parser: PyMuPDF4LLM (default)
- Chunker: LangChain (default)
- Embedding Model: text-embedding-3-large (default)
- If you select a parser/chunker that isn’t installed, the server logs a warning with the exact install command and falls back to the default components instead of exiting.
Troubleshooting Guide
API Key Issues:
- Verify key format starts with
sk-
- Check account has sufficient credits
- Test connectivity:
curl -H "Authorization: Bearer $PDFKB_OPENAI_API_KEY" https://api.openai.com/v1/models
Parser Installation Issues:
- MinerU:
pip install mineru[all]
and verifymineru --version
- Docling:
pip install docling
for basic,pip install pdfkb-mcp[docling-complete]
for all features - LLM: Requires
PDFKB_OPENROUTER_API_KEY
environment variable
Performance Optimization:
- Speed: Use
pymupdf4llm
parser (fastest, low memory footprint) - Memory: Reduce
PDFKB_EMBEDDING_BATCH_SIZE
andPDFKB_CHUNK_SIZE
; use pypdfium backend for Docling - Quality: Use
mineru
with GPU (>10K tokens/s on RTX 4090) ormarker
for balanced quality - Tables: Use
docling
withPDFKB_DOCLING_TABLE_MODE=ACCURATE
ormarker
with LLM mode - Batch Processing: Use
marker
on H100 (~25 pages/s) ormineru
with sglang acceleration
For additional support, see implementation details in src/pdfkb/main.py
and src/pdfkb/config.py
.
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
A Model Context Protocol server that enables intelligent document search and retrieval from PDF collections, providing semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.
- Table of Contents
- 🚀 Quick Start
- 🌐 Web Interface
- 🏗️ Architecture Overview
- 🤖 Embedding Options
- 📝 Markdown Document Support
- 🔄 Reranking
- 🔍 Hybrid Search
- 🔽 Minimum Chunk Filtering
- 🧩 Semantic Chunking
- 🎯 Parser Selection Guide
- ⚙️ Configuration
- 🖥️ MCP Client Setup
- 📊 Performance & Troubleshooting
- 🔧 Advanced Configuration
- 📚 Appendix
Related MCP Servers
- AsecurityAlicenseAqualityA Model Context Protocol server providing vector database capabilities through Chroma, enabling semantic document search, metadata filtering, and document management with persistent storage.Last updated -638MIT License
- -securityFlicense-qualityA Model Context Protocol server for ingesting, chunking and semantically searching documentation files, with support for markdown, Python, OpenAPI, HTML files and URLs.Last updated -
- AsecurityAlicenseAqualityA Model Context Protocol (MCP) server for the Open Library API that enables AI assistants to search for book information.Last updated -6332MIT License
- -securityAlicense-qualityA Model Context Protocol server that provides intelligent file reading and semantic search capabilities across multiple document formats with security-first access controls.Last updated -5MIT License