Uses Ollama for local LLM-powered document summarization and keyword extraction as part of the document indexing pipeline
MCP Document Indexer
A Python-based MCP (Model Context Protocol) server for local document indexing and search using LanceDB vector database and local LLMs.
Features
Real-time Document Monitoring: Automatically indexes new and modified documents in configured folders
Multi-format Support: Handles PDF, Word (docx/doc), text, Markdown, and RTF files
Local LLM Integration: Uses Ollama for document summarization and keyword extraction. Nothing ever leaves your computer
Vector Search: Semantic search using LanceDB and sentence transformers
MCP Integration: Exposes search and catalog tools via Model Context Protocol
Incremental Indexing: Only processes changed files to save resources
Performance Optimized: Designed for decent performance on standard laptops (e.g. M1/M2 MacBook)
Installation
Prerequisites
Python 3.9+ installed
uv package manager:
Ollama (for local LLM):
Install MCP Document Indexer
Configuration
Configure the indexer using environment variables or a .env
file:
Usage
Run as Standalone Service
Integrate with Claude Desktop
Add to your Claude Desktop configuration (~/Library/Application Support/Claude/claude_desktop_config.json
):
MCP Tools
The indexer exposes the following tools via MCP:
search_documents
Search for documents using natural language queries.
Parameters:
query
: Search query textlimit
: Maximum number of results (default: 10)search_type
: "documents" or "chunks"
get_catalog
List all indexed documents with summaries.
Parameters:
skip
: Number of documents to skip (default: 0)limit
: Maximum documents to return (default: 100)
get_document_info
Get detailed information about a specific document.
Parameters:
file_path
: Path to the document
reindex_document
Force reindexing of a specific document.
Parameters:
file_path
: Path to the document to reindex
get_indexing_stats
Get current indexing statistics.
Example Usage in Claude
Once configured, you can use the indexer in Claude:
Architecture
File Processing Pipeline
File Detection: Watchdog monitors configured folders for changes
Document Parsing: Extracts text from PDF, Word, and text files
Text Chunking: Splits documents into overlapping chunks for better retrieval
LLM Processing: Generates summaries and extracts keywords using Ollama
Embedding Generation: Creates vector embeddings using sentence transformers
Vector Storage: Stores documents and chunks in LanceDB
MCP Exposure: Makes search and catalog tools available via MCP
Performance Considerations
Incremental Indexing: Only changed files are reprocessed
Async Processing: Parallel processing of multiple documents
Batch Operations: Efficient batch indexing for multiple files
Debouncing: Prevents duplicate processing of rapidly changing files
Size Limits: Configurable maximum file size to prevent memory issues
Troubleshooting
Ollama Not Available
If Ollama is not running or the model isn't available, the indexer falls back to simple text extraction without summarization.
Permission Issues
Ensure the indexer has read access to monitored folders:
Memory Usage
For large document collections, consider:
Reducing
CHUNK_SIZE
to create smaller chunksLimiting
MAX_FILE_SIZE_MB
to skip very large filesUsing a smaller embedding model
Development
Running Tests
Code Formatting
Building Package
License
MIT License - See LICENSE file for details
Contributing
Contributions are welcome! Please:
Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request
Support
For issues or questions:
Open an issue on GitHub
Check the troubleshooting section
Review logs in the console output
This server cannot be installed
local-only server
The server can only run on the client's local machine because it depends on local resources.
Enables real-time indexing and semantic search of local documents (PDF, Word, text, Markdown, RTF) using vector embeddings and local LLMs. Monitors folders for changes and provides natural language search capabilities through Claude Desktop integration.
Related MCP Servers
- -securityFlicense-qualityIntegrates Claude Desktop with Azure AI Search, allowing users to query search indexes using keyword, vector, or hybrid search methods.Last updated -52
- -securityFlicense-qualityA Python-based local indexing server that creates semantic search capabilities for codebases using ChromaDB, allowing Cursor IDE to perform vector searches on your code without sending data to external services.Last updated -24
- -securityAlicense-qualityIntegrates with Claude to enable intelligent querying of documentation data, transforming crawled technical documentation into an actionable resource that LLMs can directly interact with.Last updated -1,918Apache 2.0
- AsecurityFlicenseAqualityA server that enables Claude to search and access documentation from popular libraries like LangChain, LlamaIndex, and OpenAI directly within conversations.Last updated -13