Skip to main content
Glama
Baronco
by Baronco
README.mdโ€ข8.73 kB
# ๐Ÿ“š Local Documents MCP Server A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs. ## โœจ Features - **๐Ÿ“ Document Discovery**: List all documents in a specified directory - **โšก Document Processing**: Convert various document formats to markdown - **๐Ÿ” OCR Support**: Extract text from scanned PDFs using Tesseract OCR - **๐ŸŽฏ Token Management**: Automatic content truncation based on token limits - **๐Ÿ“„ Multi-format Support**: Handle Word docs, PDFs, PowerPoint, Excel, and more ## ๐Ÿ› ๏ธ Tools Available - `list_documents`: Find documents by path, name, and extension - `load_documents`: Extract document content as markdown - `load_scanned_document`: Extract text from scanned PDFs using OCR ## ๐Ÿ’ป System Requirements - **Operating System**: Windows 10/11 - **Python**: 3.13 or higher - **Package Manager**: [uv](https://docs.astral.sh/uv/) (recommended) ## ๐Ÿ“‹ Prerequisites Installation ### 1. ๐Ÿ Python 3.13 Download and install Python 3.13 from [python.org](https://www.python.org/downloads/) ### 2. โšก UV Package Manager Install uv using pip: ```powershell pip install uv ``` ### 3. ๐Ÿ“– Poppler for Windows **Purpose**: Required for PDF processing and conversion to images for OCR. 1. Download the latest Poppler Windows release from: https://github.com/oschwartz10612/poppler-windows/releases/ 2. Extract the ZIP file to: ``` D:\Program Files\poppler-24.08.0 ``` 3. The Poppler binaries should be located at: ``` D:\Program Files\poppler-24.08.0\Library\bin ``` **Alternative locations**: You can install Poppler in any directory, just make sure to update the `.env` file with the correct path. ### 4. ๐Ÿ‘๏ธ Tesseract OCR **Purpose**: Required for extracting text from scanned documents and images. 1. Download Tesseract for Windows from: https://github.com/UB-Mannheim/tesseract/wiki 2. Install Tesseract following the installer instructions 3. Make sure Tesseract is added to your system PATH, or note the installation directory ## ๐Ÿš€ Project Installation ### 1. ๐Ÿ“ฅ Clone or Download the Project ```powershell git clone <your-repo-url> cd LocalDocs ``` ### 2. ๐Ÿ“ฆ Install Python Dependencies ```powershell uv sync ``` This will install all required dependencies from `pyproject.toml`: - `markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2` - Document conversion - `mcp[cli]>=1.10.1` - MCP server framework - `opencv-python>=4.11.0.86` - Image processing - `pdf2image>=1.17.0` - PDF to image conversion - `pytesseract>=0.3.13` - Tesseract OCR wrapper - `python-dotenv>=1.1.1` - Environment variable management - `tiktoken>=0.9.0` - Token counting ### 3. โš™๏ธ Configure Environment Variables Create or update the `.env` file in the project root: ```env POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin" ``` **Note**: Update the path to match your Poppler installation location. ## ๐Ÿ”ง Configuration for MCP Clients ### ๐Ÿค– Claude Desktop Configuration Add the following configuration to your Claude Desktop `config.json` file: - **First argument**: Path to your documents directory - Example: `"C:\\Users\\YourUsername\\Documents\\MyDocuments"` - Use double backslashes for Windows paths in JSON - **Second argument**: Maximum tokens per document - Example: `"30000"` - Adjust based on your needs and Claude's token limits ### ๐Ÿ“ Example Configurations **For different document locations**: ```json { "mcpServers": { "local-documents": { "command": "uv", "args": [ "--directory", "C:\\Users\\YourUsername\\Documents\\LocalDocs", "run", "server.py", "C:\\Users\\YourUsername\\Documents\\MyDocuments", "30000" ] } } } ``` ## ๐ŸŽฏ Usage ### ๐Ÿš€ Starting the Server The server is automatically started when Claude Desktop loads with the configured settings. ### ๐Ÿ”„ Available Operations 1. **๐Ÿ“‹ List Documents**: Discover all documents in your configured directory 2. **๐Ÿ“„ Load Standard Documents**: Process Word docs, PDFs, PowerPoint, Excel files 3. **๐Ÿ” Load Scanned Documents**: Use OCR to extract text from scanned PDFs ### ๐Ÿ“Š Response Format The server returns structured responses with: - Document paths and metadata - Token usage information - Processing time (for OCR operations) - Extracted content in markdown format ## ๐Ÿ› ๏ธ Troubleshooting ### โš ๏ธ Common Issues 1. **๐Ÿ” Poppler not found** - Verify Poppler installation path - Check `.env` file configuration - Ensure path uses double backslashes in Windows 2. **๐Ÿ‘๏ธ Tesseract not found** - Verify Tesseract installation - Add Tesseract to system PATH - Restart command prompt/PowerShell 3. **๐Ÿ” Permission denied errors** - Ensure the document directory is accessible - Check file permissions - Run as administrator if necessary 4. **โŒ Import errors** - Verify all dependencies are installed: `uv sync` - Check Python version: `python --version` - Ensure you're using Python 3.13 5. **โณ Large document processing** - Reduce token limit for better performance - Consider splitting large documents - Monitor memory usage during OCR operations ### ๐Ÿ› Debug Information To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window. ## ๐Ÿ“ File Structure ``` LocalDocs/ โ”œโ”€โ”€ server.py # Main MCP server โ”œโ”€โ”€ pyproject.toml # Project dependencies โ”œโ”€โ”€ .env # Environment configuration โ”œโ”€โ”€ README.md # This documentation โ”œโ”€โ”€ src/ โ”‚ โ””โ”€โ”€ instructions.md # Assistant instructions โ””โ”€โ”€ utils/ โ”œโ”€โ”€ __init__.py โ”œโ”€โ”€ markitdown.py # Document conversion โ”œโ”€โ”€ max_tokens.py # Token management โ”œโ”€โ”€ ocr.py # OCR processing โ”œโ”€โ”€ path_files.py # File discovery โ””โ”€โ”€ prompts.py # Instruction loading ``` ## ๐Ÿ“„ Supported Document Formats - **๐Ÿ“Š Microsoft Office**: .docx, .xlsx, .pptx - **๐Ÿ“– PDF**: Regular PDFs and scanned PDFs (via OCR) ## โšก Performance Considerations - **๐Ÿ” OCR Processing**: Scanned documents take significantly longer to process - **๐ŸŽฏ Token Limits**: Adjust based on your document sizes and Claude's context window - **๐Ÿ’พ Memory Usage**: Large documents and OCR operations can be memory-intensive ## ๐Ÿค Contributing When contributing to this project: 1. Ensure compatibility with Windows and Python 3.13 2. Test with various document formats 3. Verify OCR functionality with scanned documents 4. Update documentation for any new features ## ๐Ÿ“š Related Documentation - [MCP Documentation](https://modelcontextprotocol.io/) - [Claude Desktop MCP Guide](https://claude.ai/download) - [PDF2Image](https://github.com/Belval/pdf2image) - [Poppler PDF Processing](https://github.com/oschwartz10612/poppler-windows/releases/) - [Tesseract OCR](https://github.com/UB-Mannheim/tesseract/wiki) - [MarkItDown](https://github.com/microsoft/markitdown) ## ๐Ÿ—บ๏ธ Roadmap and Future Enhancements ### ๐Ÿ”ฎ Planned Features - **๐Ÿง  Vector Storage and RAG Integration**: Future versions will include vectorial document storage to: - Reduce token consumption by avoiding repeated text extraction - Enable semantic search across document collections - Provide more efficient document retrieval and chunking - Support for persistent document indexing - **๐Ÿ” Enhanced OCR Validation**: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with: - Complex layouts and formatting - Multi-column documents - Poor quality scans - Non-standard fonts or languages ### ๐Ÿ’ก Current Recommendations #### ๐Ÿš€ For Large Context Models - **๐Ÿค– Gemini Models**: With 1M+ token context windows, you can process very long documents without truncation - **๐ŸŽฏ Token Management**: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models - **๐Ÿ“– Document Processing**: Consider using higher token limits (e.g., 500K-1M) when working with: - Complete books or long reports - Multiple related documents - Comprehensive document analysis #### โš ๏ธ Limitations to Consider - **๐Ÿ” OCR Reliability**: Scanned document processing is experimental and may require manual validation - **โณ Processing Time**: Large documents and OCR operations can be time-intensive - **๐Ÿ’พ Memory Usage**: High-resolution scanned documents may require significant system resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Baronco/Local-Docs-MCP-Tool'

If you have feedback or need assistance with the MCP directory API, please join our Discord server