Skip to main content
Glama

Local Documents MCP Server

by Baronco
README.md•8.73 kB
# šŸ“š Local Documents MCP Server A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs. ## ✨ Features - **šŸ“ Document Discovery**: List all documents in a specified directory - **⚔ Document Processing**: Convert various document formats to markdown - **šŸ” OCR Support**: Extract text from scanned PDFs using Tesseract OCR - **šŸŽÆ Token Management**: Automatic content truncation based on token limits - **šŸ“„ Multi-format Support**: Handle Word docs, PDFs, PowerPoint, Excel, and more ## šŸ› ļø Tools Available - `list_documents`: Find documents by path, name, and extension - `load_documents`: Extract document content as markdown - `load_scanned_document`: Extract text from scanned PDFs using OCR ## šŸ’» System Requirements - **Operating System**: Windows 10/11 - **Python**: 3.13 or higher - **Package Manager**: [uv](https://docs.astral.sh/uv/) (recommended) ## šŸ“‹ Prerequisites Installation ### 1. šŸ Python 3.13 Download and install Python 3.13 from [python.org](https://www.python.org/downloads/) ### 2. ⚔ UV Package Manager Install uv using pip: ```powershell pip install uv ``` ### 3. šŸ“– Poppler for Windows **Purpose**: Required for PDF processing and conversion to images for OCR. 1. Download the latest Poppler Windows release from: https://github.com/oschwartz10612/poppler-windows/releases/ 2. Extract the ZIP file to: ``` D:\Program Files\poppler-24.08.0 ``` 3. The Poppler binaries should be located at: ``` D:\Program Files\poppler-24.08.0\Library\bin ``` **Alternative locations**: You can install Poppler in any directory, just make sure to update the `.env` file with the correct path. ### 4. šŸ‘ļø Tesseract OCR **Purpose**: Required for extracting text from scanned documents and images. 1. Download Tesseract for Windows from: https://github.com/UB-Mannheim/tesseract/wiki 2. Install Tesseract following the installer instructions 3. Make sure Tesseract is added to your system PATH, or note the installation directory ## šŸš€ Project Installation ### 1. šŸ“„ Clone or Download the Project ```powershell git clone <your-repo-url> cd LocalDocs ``` ### 2. šŸ“¦ Install Python Dependencies ```powershell uv sync ``` This will install all required dependencies from `pyproject.toml`: - `markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2` - Document conversion - `mcp[cli]>=1.10.1` - MCP server framework - `opencv-python>=4.11.0.86` - Image processing - `pdf2image>=1.17.0` - PDF to image conversion - `pytesseract>=0.3.13` - Tesseract OCR wrapper - `python-dotenv>=1.1.1` - Environment variable management - `tiktoken>=0.9.0` - Token counting ### 3. āš™ļø Configure Environment Variables Create or update the `.env` file in the project root: ```env POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin" ``` **Note**: Update the path to match your Poppler installation location. ## šŸ”§ Configuration for MCP Clients ### šŸ¤– Claude Desktop Configuration Add the following configuration to your Claude Desktop `config.json` file: - **First argument**: Path to your documents directory - Example: `"C:\\Users\\YourUsername\\Documents\\MyDocuments"` - Use double backslashes for Windows paths in JSON - **Second argument**: Maximum tokens per document - Example: `"30000"` - Adjust based on your needs and Claude's token limits ### šŸ“ Example Configurations **For different document locations**: ```json { "mcpServers": { "local-documents": { "command": "uv", "args": [ "--directory", "C:\\Users\\YourUsername\\Documents\\LocalDocs", "run", "server.py", "C:\\Users\\YourUsername\\Documents\\MyDocuments", "30000" ] } } } ``` ## šŸŽÆ Usage ### šŸš€ Starting the Server The server is automatically started when Claude Desktop loads with the configured settings. ### šŸ”„ Available Operations 1. **šŸ“‹ List Documents**: Discover all documents in your configured directory 2. **šŸ“„ Load Standard Documents**: Process Word docs, PDFs, PowerPoint, Excel files 3. **šŸ” Load Scanned Documents**: Use OCR to extract text from scanned PDFs ### šŸ“Š Response Format The server returns structured responses with: - Document paths and metadata - Token usage information - Processing time (for OCR operations) - Extracted content in markdown format ## šŸ› ļø Troubleshooting ### āš ļø Common Issues 1. **šŸ” Poppler not found** - Verify Poppler installation path - Check `.env` file configuration - Ensure path uses double backslashes in Windows 2. **šŸ‘ļø Tesseract not found** - Verify Tesseract installation - Add Tesseract to system PATH - Restart command prompt/PowerShell 3. **šŸ” Permission denied errors** - Ensure the document directory is accessible - Check file permissions - Run as administrator if necessary 4. **āŒ Import errors** - Verify all dependencies are installed: `uv sync` - Check Python version: `python --version` - Ensure you're using Python 3.13 5. **ā³ Large document processing** - Reduce token limit for better performance - Consider splitting large documents - Monitor memory usage during OCR operations ### šŸ› Debug Information To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window. ## šŸ“ File Structure ``` LocalDocs/ ā”œā”€ā”€ server.py # Main MCP server ā”œā”€ā”€ pyproject.toml # Project dependencies ā”œā”€ā”€ .env # Environment configuration ā”œā”€ā”€ README.md # This documentation ā”œā”€ā”€ src/ │ └── instructions.md # Assistant instructions └── utils/ ā”œā”€ā”€ __init__.py ā”œā”€ā”€ markitdown.py # Document conversion ā”œā”€ā”€ max_tokens.py # Token management ā”œā”€ā”€ ocr.py # OCR processing ā”œā”€ā”€ path_files.py # File discovery └── prompts.py # Instruction loading ``` ## šŸ“„ Supported Document Formats - **šŸ“Š Microsoft Office**: .docx, .xlsx, .pptx - **šŸ“– PDF**: Regular PDFs and scanned PDFs (via OCR) ## ⚔ Performance Considerations - **šŸ” OCR Processing**: Scanned documents take significantly longer to process - **šŸŽÆ Token Limits**: Adjust based on your document sizes and Claude's context window - **šŸ’¾ Memory Usage**: Large documents and OCR operations can be memory-intensive ## šŸ¤ Contributing When contributing to this project: 1. Ensure compatibility with Windows and Python 3.13 2. Test with various document formats 3. Verify OCR functionality with scanned documents 4. Update documentation for any new features ## šŸ“š Related Documentation - [MCP Documentation](https://modelcontextprotocol.io/) - [Claude Desktop MCP Guide](https://claude.ai/download) - [PDF2Image](https://github.com/Belval/pdf2image) - [Poppler PDF Processing](https://github.com/oschwartz10612/poppler-windows/releases/) - [Tesseract OCR](https://github.com/UB-Mannheim/tesseract/wiki) - [MarkItDown](https://github.com/microsoft/markitdown) ## šŸ—ŗļø Roadmap and Future Enhancements ### šŸ”® Planned Features - **🧠 Vector Storage and RAG Integration**: Future versions will include vectorial document storage to: - Reduce token consumption by avoiding repeated text extraction - Enable semantic search across document collections - Provide more efficient document retrieval and chunking - Support for persistent document indexing - **šŸ” Enhanced OCR Validation**: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with: - Complex layouts and formatting - Multi-column documents - Poor quality scans - Non-standard fonts or languages ### šŸ’” Current Recommendations #### šŸš€ For Large Context Models - **šŸ¤– Gemini Models**: With 1M+ token context windows, you can process very long documents without truncation - **šŸŽÆ Token Management**: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models - **šŸ“– Document Processing**: Consider using higher token limits (e.g., 500K-1M) when working with: - Complete books or long reports - Multiple related documents - Comprehensive document analysis #### āš ļø Limitations to Consider - **šŸ” OCR Reliability**: Scanned document processing is experimental and may require manual validation - **ā³ Processing Time**: Large documents and OCR operations can be time-intensive - **šŸ’¾ Memory Usage**: High-resolution scanned documents may require significant system resources

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Baronco/Local-Docs-MCP-Tool'

If you have feedback or need assistance with the MCP directory API, please join our Discord server