Local Documents MCP Server

README.md•8.73 kB

# 📚 Local Documents MCP Server A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs. ## ✨ Features - **📁 Document Discovery**: List all documents in a specified directory - **⚡ Document Processing**: Convert various document formats to markdown - **🔍 OCR Support**: Extract text from scanned PDFs using Tesseract OCR - **🎯 Token Management**: Automatic content truncation based on token limits - **📄 Multi-format Support**: Handle Word docs, PDFs, PowerPoint, Excel, and more ## 🛠️ Tools Available - `list_documents`: Find documents by path, name, and extension - `load_documents`: Extract document content as markdown - `load_scanned_document`: Extract text from scanned PDFs using OCR ## 💻 System Requirements - **Operating System**: Windows 10/11 - **Python**: 3.13 or higher - **Package Manager**: [uv](https://docs.astral.sh/uv/) (recommended) ## 📋 Prerequisites Installation ### 1. 🐍 Python 3.13 Download and install Python 3.13 from [python.org](https://www.python.org/downloads/) ### 2. ⚡ UV Package Manager Install uv using pip: ```powershell pip install uv ``` ### 3. 📖 Poppler for Windows **Purpose**: Required for PDF processing and conversion to images for OCR. 1. Download the latest Poppler Windows release from: https://github.com/oschwartz10612/poppler-windows/releases/ 2. Extract the ZIP file to: ``` D:\Program Files\poppler-24.08.0 ``` 3. The Poppler binaries should be located at: ``` D:\Program Files\poppler-24.08.0\Library\bin ``` **Alternative locations**: You can install Poppler in any directory, just make sure to update the `.env` file with the correct path. ### 4. 👁️ Tesseract OCR **Purpose**: Required for extracting text from scanned documents and images. 1. Download Tesseract for Windows from: https://github.com/UB-Mannheim/tesseract/wiki 2. Install Tesseract following the installer instructions 3. Make sure Tesseract is added to your system PATH, or note the installation directory ## 🚀 Project Installation ### 1. 📥 Clone or Download the Project ```powershell git clone <your-repo-url> cd LocalDocs ``` ### 2. 📦 Install Python Dependencies ```powershell uv sync ``` This will install all required dependencies from `pyproject.toml`: - `markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2` - Document conversion - `mcp[cli]>=1.10.1` - MCP server framework - `opencv-python>=4.11.0.86` - Image processing - `pdf2image>=1.17.0` - PDF to image conversion - `pytesseract>=0.3.13` - Tesseract OCR wrapper - `python-dotenv>=1.1.1` - Environment variable management - `tiktoken>=0.9.0` - Token counting ### 3. ⚙️ Configure Environment Variables Create or update the `.env` file in the project root: ```env POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin" ``` **Note**: Update the path to match your Poppler installation location. ## 🔧 Configuration for MCP Clients ### 🤖 Claude Desktop Configuration Add the following configuration to your Claude Desktop `config.json` file: - **First argument**: Path to your documents directory - Example: `"C:\\Users\\YourUsername\\Documents\\MyDocuments"` - Use double backslashes for Windows paths in JSON - **Second argument**: Maximum tokens per document - Example: `"30000"` - Adjust based on your needs and Claude's token limits ### 📝 Example Configurations **For different document locations**: ```json { "mcpServers": { "local-documents": { "command": "uv", "args": [ "--directory", "C:\\Users\\YourUsername\\Documents\\LocalDocs", "run", "server.py", "C:\\Users\\YourUsername\\Documents\\MyDocuments", "30000" ] } } } ``` ## 🎯 Usage ### 🚀 Starting the Server The server is automatically started when Claude Desktop loads with the configured settings. ### 🔄 Available Operations 1. **📋 List Documents**: Discover all documents in your configured directory 2. **📄 Load Standard Documents**: Process Word docs, PDFs, PowerPoint, Excel files 3. **🔍 Load Scanned Documents**: Use OCR to extract text from scanned PDFs ### 📊 Response Format The server returns structured responses with: - Document paths and metadata - Token usage information - Processing time (for OCR operations) - Extracted content in markdown format ## 🛠️ Troubleshooting ### ⚠️ Common Issues 1. **🔍 Poppler not found** - Verify Poppler installation path - Check `.env` file configuration - Ensure path uses double backslashes in Windows 2. **👁️ Tesseract not found** - Verify Tesseract installation - Add Tesseract to system PATH - Restart command prompt/PowerShell 3. **🔐 Permission denied errors** - Ensure the document directory is accessible - Check file permissions - Run as administrator if necessary 4. **❌ Import errors** - Verify all dependencies are installed: `uv sync` - Check Python version: `python --version` - Ensure you're using Python 3.13 5. **⏳ Large document processing** - Reduce token limit for better performance - Consider splitting large documents - Monitor memory usage during OCR operations ### 🐛 Debug Information To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window. ## 📁 File Structure ``` LocalDocs/ ├── server.py # Main MCP server ├── pyproject.toml # Project dependencies ├── .env # Environment configuration ├── README.md # This documentation ├── src/ │ └── instructions.md # Assistant instructions └── utils/ ├── __init__.py ├── markitdown.py # Document conversion ├── max_tokens.py # Token management ├── ocr.py # OCR processing ├── path_files.py # File discovery └── prompts.py # Instruction loading ``` ## 📄 Supported Document Formats - **📊 Microsoft Office**: .docx, .xlsx, .pptx - **📖 PDF**: Regular PDFs and scanned PDFs (via OCR) ## ⚡ Performance Considerations - **🔍 OCR Processing**: Scanned documents take significantly longer to process - **🎯 Token Limits**: Adjust based on your document sizes and Claude's context window - **💾 Memory Usage**: Large documents and OCR operations can be memory-intensive ## 🤝 Contributing When contributing to this project: 1. Ensure compatibility with Windows and Python 3.13 2. Test with various document formats 3. Verify OCR functionality with scanned documents 4. Update documentation for any new features ## 📚 Related Documentation - [MCP Documentation](https://modelcontextprotocol.io/) - [Claude Desktop MCP Guide](https://claude.ai/download) - [PDF2Image](https://github.com/Belval/pdf2image) - [Poppler PDF Processing](https://github.com/oschwartz10612/poppler-windows/releases/) - [Tesseract OCR](https://github.com/UB-Mannheim/tesseract/wiki) - [MarkItDown](https://github.com/microsoft/markitdown) ## 🗺️ Roadmap and Future Enhancements ### 🔮 Planned Features - **🧠 Vector Storage and RAG Integration**: Future versions will include vectorial document storage to: - Reduce token consumption by avoiding repeated text extraction - Enable semantic search across document collections - Provide more efficient document retrieval and chunking - Support for persistent document indexing - **🔍 Enhanced OCR Validation**: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with: - Complex layouts and formatting - Multi-column documents - Poor quality scans - Non-standard fonts or languages ### 💡 Current Recommendations #### 🚀 For Large Context Models - **🤖 Gemini Models**: With 1M+ token context windows, you can process very long documents without truncation - **🎯 Token Management**: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models - **📖 Document Processing**: Consider using higher token limits (e.g., 500K-1M) when working with: - Complete books or long reports - Multiple related documents - Comprehensive document analysis #### ⚠️ Limitations to Consider - **🔍 OCR Reliability**: Scanned document processing is experimental and may require manual validation - **⏳ Processing Time**: Large documents and OCR operations can be time-intensive - **💾 Memory Usage**: High-resolution scanned documents may require significant system resources

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Baronco/Local-Docs-MCP-Tool'

If you have feedback or need assistance with the MCP directory API, please join our Discord server