Skip to main content
Glama
lhstorm

MCP Server Knowledge Engine

by lhstorm

MCP Server Knowledge Engine

A powerful Model Context Protocol (MCP) server that transforms any PDF document collection into an intelligent, searchable knowledge base accessible through Claude Desktop. This server features advanced search capabilities using TF-IDF scoring, proximity matching, and domain-specific optimization.

🌟 Key Features

  • 🔍 Advanced Search Engine: TF-IDF-based inverted index with proximity matching for highly relevant results

  • 📄 Universal PDF Support: Process any PDF collection - technical docs, legal papers, research, and more

  • ⚡ High Performance: Cached search index, incremental processing, and background initialization

  • 🎯 Domain Optimization: Configure domain-specific keywords for enhanced search accuracy

  • ⚙️ Fully Configurable: JSON-based configuration with environment variable support

  • 🛠️ Comprehensive CLI: Complete server management through intuitive commands

  • 🔗 Seamless MCP Integration: Ready-to-use with Claude Desktop, VS Code, and other MCP clients

  • 📊 Smart Caching: MD5 hash-based change detection for efficient updates

📋 Quick Start

Prerequisites

  • Python 3.8 or higher

  • pip (Python package manager)

  • Claude Desktop app (for MCP integration)

1. Installation

# Clone the repository git clone https://github.com/lhstorm/mcp_server_knowledge_engine.git cd mcp_server_knowledge_engine # Create virtual environment (recommended) python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt

2. Create Your Server

# Interactive setup python manage_server.py create-config # This will ask you for: # - Server name (e.g., 'legal-docs-server') # - Display name (e.g., 'Legal Documents Server') # - PDF folder location # - Domain-specific keywords

3. Add PDF Documents

# Add individual PDFs python manage_server.py add-pdf /path/to/document.pdf python manage_server.py add-pdf /path/to/another-doc.pdf # Or copy PDFs directly to your configured folder

4. Process Documents

# Convert PDFs to searchable format python manage_server.py process-pdfs

5. Generate MCP Configuration

# Generate configuration for Claude Desktop python generate_mcp_config.py --merge # Or get the config to copy manually python generate_mcp_config.py

6. Start Using with Claude

Restart Claude Desktop and your server will appear in the MCP tools menu!

💬 Using with Claude Desktop

Once configured, you can interact with your PDFs naturally:

Example prompts:

  • "Search for information about [topic] in the documentation"

  • "What does the documentation say about [specific feature]?"

  • "Find all references to [keyword] across all PDFs"

  • "Show me the content of [document name]"

  • "List all available documents"

Advanced usage:

  • "Search for [term1] near [term2]" - Leverages proximity matching

  • "Get page 15 of [document]" - Retrieves specific pages

  • "Find the top 10 results for [query]" - Adjusts result count

📁 Project Structure

mcp_server_knowledge_engine/ ├── server.py # Main MCP server with search engine ├── config.py # Configuration management & validation ├── manage_server.py # CLI for server management ├── generate_mcp_config.py # MCP configuration generator ├── convert_pdfs.py # Standalone PDF conversion utility ├── server_config.json # Active server configuration ├── requirements.txt # Python dependencies ├── examples/ # Example configurations │ ├── legal_docs_config.json │ ├── medical_docs_config.json │ ├── research_papers_config.json │ └── tech_docs_config.json └── your-pdfs/ # Your PDF folder (configurable) ├── document1.pdf ├── document2.pdf └── markdown/ # Auto-generated cache ├── .pdf_cache.json # Processing metadata ├── .search_index.pkl # Cached search index ├── document1.md # Converted documents └── document2.md

⚙️ Configuration

The server is configured via server_config.json:

{ "server": { "name": "my-docs-server", "display_name": "My Documents Server", "description": "Search through my PDF collection", "version": "1.0.0" }, "storage": { "pdf_folder": "./docs", "markdown_folder": "./docs/markdown", "domain_keywords": ["keyword1", "keyword2", "domain-term"] }, "tools": { "search": { "name": "search_docs", "description": "Search through PDF documentation" }, "list": { "name": "list_docs", "description": "List all available documents" }, "content": { "name": "get_document_content", "description": "Get full content from documents" }, "max_results_default": 5 }, "processing": { "cache_enabled": true, "parallel_processing": true, "max_file_size_mb": 50, "context_size": 500 } }

🛠️ Management Commands

Server Management

# Create new configuration python manage_server.py create-config # Test configuration python manage_server.py test # Generate MCP config python manage_server.py generate-mcp-config

PDF Management

# List all PDFs python manage_server.py list-pdfs # Add PDF python manage_server.py add-pdf document.pdf # Remove PDF python manage_server.py remove-pdf document.pdf # Process all PDFs python manage_server.py process-pdfs

MCP Configuration

# Print MCP config python generate_mcp_config.py # Automatically merge with Claude Desktop config python generate_mcp_config.py --merge # Save to file python generate_mcp_config.py --output my_mcp_config.json

💡 Usage Examples

{ "server": { "name": "legal-docs-server", "display_name": "Legal Documents Server" }, "storage": { "domain_keywords": ["contract", "liability", "jurisdiction", "plaintiff", "defendant"] } }

Technical Documentation Server

{ "server": { "name": "tech-docs-server", "display_name": "Technical Documentation Server" }, "storage": { "domain_keywords": ["API", "function", "class", "method", "parameter", "return"] } }

Research Papers Server

{ "server": { "name": "research-server", "display_name": "Research Papers Server" }, "storage": { "domain_keywords": ["hypothesis", "methodology", "results", "conclusion", "analysis"] } }

🔧 Available MCP Tools

Each server provides three configurable tools:

  1. Search Tool (default: search_docs)

    • Intelligent search through all documents

    • TF-IDF scoring with proximity matching

    • Returns relevant excerpts with context

  2. List Tool (default: list_docs)

    • Lists all available documents

    • Shows document metadata and page counts

  3. Content Tool (default: get_document_content)

    • Retrieves full document content

    • Can fetch specific pages

    • Includes complete markdown formatting

🎯 Domain Customization

The server adapts to your domain through:

  • Domain Keywords: Configure terms important to your field

  • Tool Names: Customize tool names (e.g., search_legal_docs)

  • Descriptions: Tailor descriptions for your use case

  • Context Size: Adjust how much context to return in search results

🔍 How the Search Engine Works

Inverted Index Architecture

The server uses an advanced inverted index for lightning-fast searches:

  1. Document Processing: PDFs are converted to markdown and tokenized

  2. Index Building: Words are mapped to their locations (document, page, position)

  3. TF-IDF Scoring:

    • TF (Term Frequency): How often a word appears in a document

    • IDF (Inverse Document Frequency): How rare a word is across all documents

    • Combined score ensures relevant, unique results rank higher

Search Features

  • Proximity Boosting: Multi-word queries score higher when terms appear close together

  • Context Extraction: Returns relevant snippets with search terms highlighted

  • Domain Keyword Recognition: Configured keywords get special treatment

  • Page-Level Precision: Results include specific page numbers

  • Smart Caching: Search index persists between server restarts

📊 Performance Optimizations

  • Incremental Processing: MD5 hash-based change detection - only new/modified PDFs are processed

  • Persistent Search Index: Pickled index loads instantly on server restart

  • Background Initialization: Server accepts connections while building index

  • Memory Efficiency: Streaming PDF processing and markdown storage

  • Configurable Limits: Control file size limits and processing parameters

🐛 Troubleshooting

Common Issues & Solutions

Server not appearing in Claude Desktop:

  • Ensure MCP configuration was merged: python generate_mcp_config.py --merge

  • Check Python path: which python or where python (Windows)

  • Verify server_config.json exists and is valid JSON

  • Restart Claude Desktop after configuration changes

PDFs not processing:

  • Check folder permissions: ls -la /path/to/pdf/folder

  • Verify PDF files aren't corrupted: file document.pdf

  • Look for errors in stderr: python server.py 2>error.log

  • Ensure sufficient disk space for markdown cache

Search returns no/poor results:

  • Initial indexing may take time - check stderr for progress

  • Verify markdown files exist: ls markdown/*.md

  • Check search index exists: ls markdown/.search_index.pkl

  • Try single-word queries first, then expand

  • Review domain keywords in configuration

Server crashes or hangs:

  • Check Python version (3.8+ required): python --version

  • Verify all dependencies installed: pip install -r requirements.txt

  • Clear cache and reprocess: rm -rf markdown/.pdf_cache.json markdown/.search_index.pkl

  • Check for file locking issues on Windows

Debug Mode

# Run with full debug output python server.py 2>&1 | tee debug.log # Check server initialization grep "initialization" debug.log # Monitor PDF processing grep "Processing\|Error" debug.log

Validation Commands

# Test configuration validity python manage_server.py test # Verify configuration loading python -c "from config import load_config_from_env_or_file; c=load_config_from_env_or_file(); print(f'✓ Config loaded: {c.server.name}')" # Check MCP integration python generate_mcp_config.py # Should output valid JSON

🚀 Advanced Usage

Multiple Servers

You can run multiple specialized servers:

# Legal documents server python manage_server.py --config legal_config.json create-config # Technical docs server python manage_server.py --config tech_config.json create-config # Research papers server python manage_server.py --config research_config.json create-config

Batch Processing

# Process multiple PDF folders for folder in docs legal_docs tech_docs; do python convert_pdfs.py "$folder" "$folder/markdown" done

Custom Keywords

Configure domain-specific keywords for better search relevance:

{ "storage": { "domain_keywords": [ "algorithm", "data structure", "complexity", "optimization", "performance", "scalability" ] } }

🏗️ Architecture Overview

Core Components

  1. SearchIndex Class (server.py:27-140)

    • Implements inverted index with TF-IDF scoring

    • Handles word tokenization and document indexing

    • Provides proximity-based ranking for multi-word queries

  2. GenericPDFServer Class (server.py:142-661)

    • Main server implementation with MCP protocol handling

    • Manages PDF processing pipeline

    • Handles async operations and background initialization

  3. Configuration System (config.py)

    • Dataclass-based type-safe configuration

    • JSON schema validation

    • Environment variable support

  4. Management CLI (manage_server.py)

    • Interactive configuration creation

    • PDF management operations

    • Server testing and validation

Data Flow

PDFs → PDF Reader → Markdown Converter → Search Index → MCP Tools → Claude ↓ ↓ ↓ [.pdf files] [.md cache files] [.search_index.pkl]

🔄 Current Server Configuration

The repository currently includes a configuration for QuantConnect documentation (server_config.json). To create your own server:

# Option 1: Interactive setup python manage_server.py create-config # Option 2: Copy and modify an example cp examples/tech_docs_config.json server_config.json # Edit server_config.json with your settings

📚 Example Use Cases

  • Legal Firms: Search through contracts, case files, and legal documents

  • Research Labs: Query scientific papers and technical reports

  • Software Teams: Access API documentation and technical specs

  • Medical Practices: Search patient records and medical literature

  • Educational Institutions: Browse course materials and textbooks

🤝 Contributing

We welcome contributions! Here are some ways to help:

Enhancement Ideas

  1. Document Format Support: Add support for Word, HTML, or other formats

  2. Search Improvements: Implement semantic search, fuzzy matching, or ML-based ranking

  3. Performance: Add database backend, parallel processing, or distributed indexing

  4. Tools: Create specialized MCP tools for specific domains

  5. UI: Build a web interface for configuration management

Development Guidelines

  • Follow existing code style and patterns

  • Add tests for new functionality

  • Update documentation for new features

  • Submit PRs with clear descriptions

🔐 Security Considerations

  • The server only has read access to specified PDF folders

  • No external network calls are made during operation

  • Sensitive data remains local - nothing is sent to external services

  • Configure appropriate file permissions for your PDF folders

📄 License

This project is open source. See LICENSE file for details.

🙏 Acknowledgments

Built with the Model Context Protocol by Anthropic.


Ready to transform your PDFs into a searchable knowledge base?

Run python manage_server.py create-config to get started! 🚀

📦 Dependencies

  • mcp: Model Context Protocol SDK for building MCP servers

  • PyPDF2: PDF parsing and text extraction

  • asyncio: Asynchronous I/O for concurrent operations

  • jsonschema: JSON validation for configuration files

All dependencies are lightweight and have minimal system requirements.

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lhstorm/mcp_server_knowledge_engine'

If you have feedback or need assistance with the MCP directory API, please join our Discord server