# MCP Server Knowledge Engine
A powerful Model Context Protocol (MCP) server that transforms any PDF document collection into an intelligent, searchable knowledge base accessible through Claude Desktop. This server features advanced search capabilities using TF-IDF scoring, proximity matching, and domain-specific optimization.
## π Key Features
- **π Advanced Search Engine**: TF-IDF-based inverted index with proximity matching for highly relevant results
- **π Universal PDF Support**: Process any PDF collection - technical docs, legal papers, research, and more
- **β‘ High Performance**: Cached search index, incremental processing, and background initialization
- **π― Domain Optimization**: Configure domain-specific keywords for enhanced search accuracy
- **βοΈ Fully Configurable**: JSON-based configuration with environment variable support
- **π οΈ Comprehensive CLI**: Complete server management through intuitive commands
- **π Seamless MCP Integration**: Ready-to-use with Claude Desktop, VS Code, and other MCP clients
- **π Smart Caching**: MD5 hash-based change detection for efficient updates
## π Quick Start
### Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
- Claude Desktop app (for MCP integration)
### 1. Installation
```bash
# Clone the repository
git clone https://github.com/lhstorm/mcp_server_knowledge_engine.git
cd mcp_server_knowledge_engine
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### 2. Create Your Server
```bash
# Interactive setup
python manage_server.py create-config
# This will ask you for:
# - Server name (e.g., 'legal-docs-server')
# - Display name (e.g., 'Legal Documents Server')
# - PDF folder location
# - Domain-specific keywords
```
### 3. Add PDF Documents
```bash
# Add individual PDFs
python manage_server.py add-pdf /path/to/document.pdf
python manage_server.py add-pdf /path/to/another-doc.pdf
# Or copy PDFs directly to your configured folder
```
### 4. Process Documents
```bash
# Convert PDFs to searchable format
python manage_server.py process-pdfs
```
### 5. Generate MCP Configuration
```bash
# Generate configuration for Claude Desktop
python generate_mcp_config.py --merge
# Or get the config to copy manually
python generate_mcp_config.py
```
### 6. Start Using with Claude
Restart Claude Desktop and your server will appear in the MCP tools menu!
## π¬ Using with Claude Desktop
Once configured, you can interact with your PDFs naturally:
**Example prompts:**
- "Search for information about [topic] in the documentation"
- "What does the documentation say about [specific feature]?"
- "Find all references to [keyword] across all PDFs"
- "Show me the content of [document name]"
- "List all available documents"
**Advanced usage:**
- "Search for [term1] near [term2]" - Leverages proximity matching
- "Get page 15 of [document]" - Retrieves specific pages
- "Find the top 10 results for [query]" - Adjusts result count
## π Project Structure
```
mcp_server_knowledge_engine/
βββ server.py # Main MCP server with search engine
βββ config.py # Configuration management & validation
βββ manage_server.py # CLI for server management
βββ generate_mcp_config.py # MCP configuration generator
βββ convert_pdfs.py # Standalone PDF conversion utility
βββ server_config.json # Active server configuration
βββ requirements.txt # Python dependencies
βββ examples/ # Example configurations
β βββ legal_docs_config.json
β βββ medical_docs_config.json
β βββ research_papers_config.json
β βββ tech_docs_config.json
βββ your-pdfs/ # Your PDF folder (configurable)
βββ document1.pdf
βββ document2.pdf
βββ markdown/ # Auto-generated cache
βββ .pdf_cache.json # Processing metadata
βββ .search_index.pkl # Cached search index
βββ document1.md # Converted documents
βββ document2.md
```
## βοΈ Configuration
The server is configured via `server_config.json`:
```json
{
"server": {
"name": "my-docs-server",
"display_name": "My Documents Server",
"description": "Search through my PDF collection",
"version": "1.0.0"
},
"storage": {
"pdf_folder": "./docs",
"markdown_folder": "./docs/markdown",
"domain_keywords": ["keyword1", "keyword2", "domain-term"]
},
"tools": {
"search": {
"name": "search_docs",
"description": "Search through PDF documentation"
},
"list": {
"name": "list_docs",
"description": "List all available documents"
},
"content": {
"name": "get_document_content",
"description": "Get full content from documents"
},
"max_results_default": 5
},
"processing": {
"cache_enabled": true,
"parallel_processing": true,
"max_file_size_mb": 50,
"context_size": 500
}
}
```
## π οΈ Management Commands
### Server Management
```bash
# Create new configuration
python manage_server.py create-config
# Test configuration
python manage_server.py test
# Generate MCP config
python manage_server.py generate-mcp-config
```
### PDF Management
```bash
# List all PDFs
python manage_server.py list-pdfs
# Add PDF
python manage_server.py add-pdf document.pdf
# Remove PDF
python manage_server.py remove-pdf document.pdf
# Process all PDFs
python manage_server.py process-pdfs
```
### MCP Configuration
```bash
# Print MCP config
python generate_mcp_config.py
# Automatically merge with Claude Desktop config
python generate_mcp_config.py --merge
# Save to file
python generate_mcp_config.py --output my_mcp_config.json
```
## π‘ Usage Examples
### Legal Documents Server
```json
{
"server": {
"name": "legal-docs-server",
"display_name": "Legal Documents Server"
},
"storage": {
"domain_keywords": ["contract", "liability", "jurisdiction", "plaintiff", "defendant"]
}
}
```
### Technical Documentation Server
```json
{
"server": {
"name": "tech-docs-server",
"display_name": "Technical Documentation Server"
},
"storage": {
"domain_keywords": ["API", "function", "class", "method", "parameter", "return"]
}
}
```
### Research Papers Server
```json
{
"server": {
"name": "research-server",
"display_name": "Research Papers Server"
},
"storage": {
"domain_keywords": ["hypothesis", "methodology", "results", "conclusion", "analysis"]
}
}
```
## π§ Available MCP Tools
Each server provides three configurable tools:
1. **Search Tool** (default: `search_docs`)
- Intelligent search through all documents
- TF-IDF scoring with proximity matching
- Returns relevant excerpts with context
2. **List Tool** (default: `list_docs`)
- Lists all available documents
- Shows document metadata and page counts
3. **Content Tool** (default: `get_document_content`)
- Retrieves full document content
- Can fetch specific pages
- Includes complete markdown formatting
## π― Domain Customization
The server adapts to your domain through:
- **Domain Keywords**: Configure terms important to your field
- **Tool Names**: Customize tool names (e.g., `search_legal_docs`)
- **Descriptions**: Tailor descriptions for your use case
- **Context Size**: Adjust how much context to return in search results
## π How the Search Engine Works
### Inverted Index Architecture
The server uses an advanced inverted index for lightning-fast searches:
1. **Document Processing**: PDFs are converted to markdown and tokenized
2. **Index Building**: Words are mapped to their locations (document, page, position)
3. **TF-IDF Scoring**:
- **TF (Term Frequency)**: How often a word appears in a document
- **IDF (Inverse Document Frequency)**: How rare a word is across all documents
- Combined score ensures relevant, unique results rank higher
### Search Features
- **Proximity Boosting**: Multi-word queries score higher when terms appear close together
- **Context Extraction**: Returns relevant snippets with search terms highlighted
- **Domain Keyword Recognition**: Configured keywords get special treatment
- **Page-Level Precision**: Results include specific page numbers
- **Smart Caching**: Search index persists between server restarts
## π Performance Optimizations
- **Incremental Processing**: MD5 hash-based change detection - only new/modified PDFs are processed
- **Persistent Search Index**: Pickled index loads instantly on server restart
- **Background Initialization**: Server accepts connections while building index
- **Memory Efficiency**: Streaming PDF processing and markdown storage
- **Configurable Limits**: Control file size limits and processing parameters
## π Troubleshooting
### Common Issues & Solutions
**Server not appearing in Claude Desktop:**
- Ensure MCP configuration was merged: `python generate_mcp_config.py --merge`
- Check Python path: `which python` or `where python` (Windows)
- Verify server_config.json exists and is valid JSON
- Restart Claude Desktop after configuration changes
**PDFs not processing:**
- Check folder permissions: `ls -la /path/to/pdf/folder`
- Verify PDF files aren't corrupted: `file document.pdf`
- Look for errors in stderr: `python server.py 2>error.log`
- Ensure sufficient disk space for markdown cache
**Search returns no/poor results:**
- Initial indexing may take time - check stderr for progress
- Verify markdown files exist: `ls markdown/*.md`
- Check search index exists: `ls markdown/.search_index.pkl`
- Try single-word queries first, then expand
- Review domain keywords in configuration
**Server crashes or hangs:**
- Check Python version (3.8+ required): `python --version`
- Verify all dependencies installed: `pip install -r requirements.txt`
- Clear cache and reprocess: `rm -rf markdown/.pdf_cache.json markdown/.search_index.pkl`
- Check for file locking issues on Windows
### Debug Mode
```bash
# Run with full debug output
python server.py 2>&1 | tee debug.log
# Check server initialization
grep "initialization" debug.log
# Monitor PDF processing
grep "Processing\|Error" debug.log
```
### Validation Commands
```bash
# Test configuration validity
python manage_server.py test
# Verify configuration loading
python -c "from config import load_config_from_env_or_file; c=load_config_from_env_or_file(); print(f'β Config loaded: {c.server.name}')"
# Check MCP integration
python generate_mcp_config.py # Should output valid JSON
```
## π Advanced Usage
### Multiple Servers
You can run multiple specialized servers:
```bash
# Legal documents server
python manage_server.py --config legal_config.json create-config
# Technical docs server
python manage_server.py --config tech_config.json create-config
# Research papers server
python manage_server.py --config research_config.json create-config
```
### Batch Processing
```bash
# Process multiple PDF folders
for folder in docs legal_docs tech_docs; do
python convert_pdfs.py "$folder" "$folder/markdown"
done
```
### Custom Keywords
Configure domain-specific keywords for better search relevance:
```json
{
"storage": {
"domain_keywords": [
"algorithm", "data structure", "complexity",
"optimization", "performance", "scalability"
]
}
}
```
## ποΈ Architecture Overview
### Core Components
1. **SearchIndex Class** (`server.py:27-140`)
- Implements inverted index with TF-IDF scoring
- Handles word tokenization and document indexing
- Provides proximity-based ranking for multi-word queries
2. **GenericPDFServer Class** (`server.py:142-661`)
- Main server implementation with MCP protocol handling
- Manages PDF processing pipeline
- Handles async operations and background initialization
3. **Configuration System** (`config.py`)
- Dataclass-based type-safe configuration
- JSON schema validation
- Environment variable support
4. **Management CLI** (`manage_server.py`)
- Interactive configuration creation
- PDF management operations
- Server testing and validation
### Data Flow
```
PDFs β PDF Reader β Markdown Converter β Search Index β MCP Tools β Claude
β β β
[.pdf files] [.md cache files] [.search_index.pkl]
```
## π Current Server Configuration
The repository currently includes a configuration for QuantConnect documentation (`server_config.json`). To create your own server:
```bash
# Option 1: Interactive setup
python manage_server.py create-config
# Option 2: Copy and modify an example
cp examples/tech_docs_config.json server_config.json
# Edit server_config.json with your settings
```
## π Example Use Cases
- **Legal Firms**: Search through contracts, case files, and legal documents
- **Research Labs**: Query scientific papers and technical reports
- **Software Teams**: Access API documentation and technical specs
- **Medical Practices**: Search patient records and medical literature
- **Educational Institutions**: Browse course materials and textbooks
## π€ Contributing
We welcome contributions! Here are some ways to help:
### Enhancement Ideas
1. **Document Format Support**: Add support for Word, HTML, or other formats
2. **Search Improvements**: Implement semantic search, fuzzy matching, or ML-based ranking
3. **Performance**: Add database backend, parallel processing, or distributed indexing
4. **Tools**: Create specialized MCP tools for specific domains
5. **UI**: Build a web interface for configuration management
### Development Guidelines
- Follow existing code style and patterns
- Add tests for new functionality
- Update documentation for new features
- Submit PRs with clear descriptions
## π Security Considerations
- The server only has read access to specified PDF folders
- No external network calls are made during operation
- Sensitive data remains local - nothing is sent to external services
- Configure appropriate file permissions for your PDF folders
## π License
This project is open source. See LICENSE file for details.
## π Acknowledgments
Built with the [Model Context Protocol](https://modelcontextprotocol.io/) by Anthropic.
---
**Ready to transform your PDFs into a searchable knowledge base?**
Run `python manage_server.py create-config` to get started! π
## π¦ Dependencies
- **mcp**: Model Context Protocol SDK for building MCP servers
- **PyPDF2**: PDF parsing and text extraction
- **asyncio**: Asynchronous I/O for concurrent operations
- **jsonschema**: JSON validation for configuration files
All dependencies are lightweight and have minimal system requirements.