Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@MCP Server Knowledge EngineSearch for 'data privacy' near 'encryption' in the documentation"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
MCP Server Knowledge Engine
A powerful Model Context Protocol (MCP) server that transforms any PDF document collection into an intelligent, searchable knowledge base accessible through Claude Desktop. This server features advanced search capabilities using TF-IDF scoring, proximity matching, and domain-specific optimization.
🌟 Key Features
🔍 Advanced Search Engine: TF-IDF-based inverted index with proximity matching for highly relevant results
📄 Universal PDF Support: Process any PDF collection - technical docs, legal papers, research, and more
⚡ High Performance: Cached search index, incremental processing, and background initialization
🎯 Domain Optimization: Configure domain-specific keywords for enhanced search accuracy
⚙️ Fully Configurable: JSON-based configuration with environment variable support
🛠️ Comprehensive CLI: Complete server management through intuitive commands
🔗 Seamless MCP Integration: Ready-to-use with Claude Desktop, VS Code, and other MCP clients
📊 Smart Caching: MD5 hash-based change detection for efficient updates
📋 Quick Start
Prerequisites
Python 3.8 or higher
pip (Python package manager)
Claude Desktop app (for MCP integration)
1. Installation
2. Create Your Server
3. Add PDF Documents
4. Process Documents
5. Generate MCP Configuration
6. Start Using with Claude
Restart Claude Desktop and your server will appear in the MCP tools menu!
💬 Using with Claude Desktop
Once configured, you can interact with your PDFs naturally:
Example prompts:
"Search for information about [topic] in the documentation"
"What does the documentation say about [specific feature]?"
"Find all references to [keyword] across all PDFs"
"Show me the content of [document name]"
"List all available documents"
Advanced usage:
"Search for [term1] near [term2]" - Leverages proximity matching
"Get page 15 of [document]" - Retrieves specific pages
"Find the top 10 results for [query]" - Adjusts result count
📁 Project Structure
⚙️ Configuration
The server is configured via server_config.json:
🛠️ Management Commands
Server Management
PDF Management
MCP Configuration
💡 Usage Examples
Legal Documents Server
Technical Documentation Server
Research Papers Server
🔧 Available MCP Tools
Each server provides three configurable tools:
Search Tool (default:
search_docs)Intelligent search through all documents
TF-IDF scoring with proximity matching
Returns relevant excerpts with context
List Tool (default:
list_docs)Lists all available documents
Shows document metadata and page counts
Content Tool (default:
get_document_content)Retrieves full document content
Can fetch specific pages
Includes complete markdown formatting
🎯 Domain Customization
The server adapts to your domain through:
Domain Keywords: Configure terms important to your field
Tool Names: Customize tool names (e.g.,
search_legal_docs)Descriptions: Tailor descriptions for your use case
Context Size: Adjust how much context to return in search results
🔍 How the Search Engine Works
Inverted Index Architecture
The server uses an advanced inverted index for lightning-fast searches:
Document Processing: PDFs are converted to markdown and tokenized
Index Building: Words are mapped to their locations (document, page, position)
TF-IDF Scoring:
TF (Term Frequency): How often a word appears in a document
IDF (Inverse Document Frequency): How rare a word is across all documents
Combined score ensures relevant, unique results rank higher
Search Features
Proximity Boosting: Multi-word queries score higher when terms appear close together
Context Extraction: Returns relevant snippets with search terms highlighted
Domain Keyword Recognition: Configured keywords get special treatment
Page-Level Precision: Results include specific page numbers
Smart Caching: Search index persists between server restarts
📊 Performance Optimizations
Incremental Processing: MD5 hash-based change detection - only new/modified PDFs are processed
Persistent Search Index: Pickled index loads instantly on server restart
Background Initialization: Server accepts connections while building index
Memory Efficiency: Streaming PDF processing and markdown storage
Configurable Limits: Control file size limits and processing parameters
🐛 Troubleshooting
Common Issues & Solutions
Server not appearing in Claude Desktop:
Ensure MCP configuration was merged:
python generate_mcp_config.py --mergeCheck Python path:
which pythonorwhere python(Windows)Verify server_config.json exists and is valid JSON
Restart Claude Desktop after configuration changes
PDFs not processing:
Check folder permissions:
ls -la /path/to/pdf/folderVerify PDF files aren't corrupted:
file document.pdfLook for errors in stderr:
python server.py 2>error.logEnsure sufficient disk space for markdown cache
Search returns no/poor results:
Initial indexing may take time - check stderr for progress
Verify markdown files exist:
ls markdown/*.mdCheck search index exists:
ls markdown/.search_index.pklTry single-word queries first, then expand
Review domain keywords in configuration
Server crashes or hangs:
Check Python version (3.8+ required):
python --versionVerify all dependencies installed:
pip install -r requirements.txtClear cache and reprocess:
rm -rf markdown/.pdf_cache.json markdown/.search_index.pklCheck for file locking issues on Windows
Debug Mode
Validation Commands
🚀 Advanced Usage
Multiple Servers
You can run multiple specialized servers:
Batch Processing
Custom Keywords
Configure domain-specific keywords for better search relevance:
🏗️ Architecture Overview
Core Components
SearchIndex Class (
server.py:27-140)Implements inverted index with TF-IDF scoring
Handles word tokenization and document indexing
Provides proximity-based ranking for multi-word queries
GenericPDFServer Class (
server.py:142-661)Main server implementation with MCP protocol handling
Manages PDF processing pipeline
Handles async operations and background initialization
Configuration System (
config.py)Dataclass-based type-safe configuration
JSON schema validation
Environment variable support
Management CLI (
manage_server.py)Interactive configuration creation
PDF management operations
Server testing and validation
Data Flow
🔄 Current Server Configuration
The repository currently includes a configuration for QuantConnect documentation (server_config.json). To create your own server:
📚 Example Use Cases
Legal Firms: Search through contracts, case files, and legal documents
Research Labs: Query scientific papers and technical reports
Software Teams: Access API documentation and technical specs
Medical Practices: Search patient records and medical literature
Educational Institutions: Browse course materials and textbooks
🤝 Contributing
We welcome contributions! Here are some ways to help:
Enhancement Ideas
Document Format Support: Add support for Word, HTML, or other formats
Search Improvements: Implement semantic search, fuzzy matching, or ML-based ranking
Performance: Add database backend, parallel processing, or distributed indexing
Tools: Create specialized MCP tools for specific domains
UI: Build a web interface for configuration management
Development Guidelines
Follow existing code style and patterns
Add tests for new functionality
Update documentation for new features
Submit PRs with clear descriptions
🔐 Security Considerations
The server only has read access to specified PDF folders
No external network calls are made during operation
Sensitive data remains local - nothing is sent to external services
Configure appropriate file permissions for your PDF folders
📄 License
This project is open source. See LICENSE file for details.
🙏 Acknowledgments
Built with the Model Context Protocol by Anthropic.
Ready to transform your PDFs into a searchable knowledge base?
Run python manage_server.py create-config to get started! 🚀
📦 Dependencies
mcp: Model Context Protocol SDK for building MCP servers
PyPDF2: PDF parsing and text extraction
asyncio: Asynchronous I/O for concurrent operations
jsonschema: JSON validation for configuration files
All dependencies are lightweight and have minimal system requirements.