Skip to main content
Glama
Baronco
by Baronco

šŸ“š Local Documents MCP Server

A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs.

✨ Features

  • šŸ“ Document Discovery: List all documents in a specified directory

  • ⚔ Document Processing: Convert various document formats to markdown

  • šŸ” OCR Support: Extract text from scanned PDFs using Tesseract OCR

  • šŸŽÆ Token Management: Automatic content truncation based on token limits

  • šŸ“„ Multi-format Support: Handle Word docs, PDFs, PowerPoint, Excel, and more

Related MCP server: MCP File System Server

šŸ› ļø Tools Available

  • list_documents: Find documents by path, name, and extension

  • load_documents: Extract document content as markdown

  • load_scanned_document: Extract text from scanned PDFs using OCR

šŸ’» System Requirements

  • Operating System: Windows 10/11

  • Python: 3.13 or higher

  • Package Manager: uv (recommended)

šŸ“‹ Prerequisites Installation

1. šŸ Python 3.13

Download and install Python 3.13 from python.org

2. ⚔ UV Package Manager

Install uv using pip:

pip install uv

3. šŸ“– Poppler for Windows

Purpose: Required for PDF processing and conversion to images for OCR.

  1. Download the latest Poppler Windows release from: https://github.com/oschwartz10612/poppler-windows/releases/

  2. Extract the ZIP file to:

    D:\Program Files\poppler-24.08.0
  3. The Poppler binaries should be located at:

    D:\Program Files\poppler-24.08.0\Library\bin

Alternative locations: You can install Poppler in any directory, just make sure to update the .env file with the correct path.

4. šŸ‘ļø Tesseract OCR

Purpose: Required for extracting text from scanned documents and images.

  1. Download Tesseract for Windows from: https://github.com/UB-Mannheim/tesseract/wiki

  2. Install Tesseract following the installer instructions

  3. Make sure Tesseract is added to your system PATH, or note the installation directory

šŸš€ Project Installation

1. šŸ“„ Clone or Download the Project

git clone <your-repo-url> cd LocalDocs

2. šŸ“¦ Install Python Dependencies

uv sync

This will install all required dependencies from pyproject.toml:

  • markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2 - Document conversion

  • mcp[cli]>=1.10.1 - MCP server framework

  • opencv-python>=4.11.0.86 - Image processing

  • pdf2image>=1.17.0 - PDF to image conversion

  • pytesseract>=0.3.13 - Tesseract OCR wrapper

  • python-dotenv>=1.1.1 - Environment variable management

  • tiktoken>=0.9.0 - Token counting

3. āš™ļø Configure Environment Variables

Create or update the .env file in the project root:

POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin"

Note: Update the path to match your Poppler installation location.

šŸ”§ Configuration for MCP Clients

šŸ¤– Claude Desktop Configuration

Add the following configuration to your Claude Desktop config.json file:

  • First argument: Path to your documents directory

    • Example: "C:\\Users\\YourUsername\\Documents\\MyDocuments"

    • Use double backslashes for Windows paths in JSON

  • Second argument: Maximum tokens per document

    • Example: "30000"

    • Adjust based on your needs and Claude's token limits

šŸ“ Example Configurations

For different document locations:

{ "mcpServers": { "local-documents": { "command": "uv", "args": [ "--directory", "C:\\Users\\YourUsername\\Documents\\LocalDocs", "run", "server.py", "C:\\Users\\YourUsername\\Documents\\MyDocuments", "30000" ] } } }

šŸŽÆ Usage

šŸš€ Starting the Server

The server is automatically started when Claude Desktop loads with the configured settings.

šŸ”„ Available Operations

  1. šŸ“‹ List Documents: Discover all documents in your configured directory

  2. šŸ“„ Load Standard Documents: Process Word docs, PDFs, PowerPoint, Excel files

  3. šŸ” Load Scanned Documents: Use OCR to extract text from scanned PDFs

šŸ“Š Response Format

The server returns structured responses with:

  • Document paths and metadata

  • Token usage information

  • Processing time (for OCR operations)

  • Extracted content in markdown format

šŸ› ļø Troubleshooting

āš ļø Common Issues

  1. šŸ” Poppler not found

    • Verify Poppler installation path

    • Check .env file configuration

    • Ensure path uses double backslashes in Windows

  2. šŸ‘ļø Tesseract not found

    • Verify Tesseract installation

    • Add Tesseract to system PATH

    • Restart command prompt/PowerShell

  3. šŸ” Permission denied errors

    • Ensure the document directory is accessible

    • Check file permissions

    • Run as administrator if necessary

  4. āŒ Import errors

    • Verify all dependencies are installed: uv sync

    • Check Python version: python --version

    • Ensure you're using Python 3.13

  5. ā³ Large document processing

    • Reduce token limit for better performance

    • Consider splitting large documents

    • Monitor memory usage during OCR operations

šŸ› Debug Information

To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window.

šŸ“ File Structure

LocalDocs/ ā”œā”€ā”€ server.py # Main MCP server ā”œā”€ā”€ pyproject.toml # Project dependencies ā”œā”€ā”€ .env # Environment configuration ā”œā”€ā”€ README.md # This documentation ā”œā”€ā”€ src/ │ └── instructions.md # Assistant instructions └── utils/ ā”œā”€ā”€ __init__.py ā”œā”€ā”€ markitdown.py # Document conversion ā”œā”€ā”€ max_tokens.py # Token management ā”œā”€ā”€ ocr.py # OCR processing ā”œā”€ā”€ path_files.py # File discovery └── prompts.py # Instruction loading

šŸ“„ Supported Document Formats

  • šŸ“Š Microsoft Office: .docx, .xlsx, .pptx

  • šŸ“– PDF: Regular PDFs and scanned PDFs (via OCR)

⚔ Performance Considerations

  • šŸ” OCR Processing: Scanned documents take significantly longer to process

  • šŸŽÆ Token Limits: Adjust based on your document sizes and Claude's context window

  • šŸ’¾ Memory Usage: Large documents and OCR operations can be memory-intensive

šŸ¤ Contributing

When contributing to this project:

  1. Ensure compatibility with Windows and Python 3.13

  2. Test with various document formats

  3. Verify OCR functionality with scanned documents

  4. Update documentation for any new features

šŸ—ŗļø Roadmap and Future Enhancements

šŸ”® Planned Features

  • 🧠 Vector Storage and RAG Integration: Future versions will include vectorial document storage to:

    • Reduce token consumption by avoiding repeated text extraction

    • Enable semantic search across document collections

    • Provide more efficient document retrieval and chunking

    • Support for persistent document indexing

  • šŸ” Enhanced OCR Validation: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with:

    • Complex layouts and formatting

    • Multi-column documents

    • Poor quality scans

    • Non-standard fonts or languages

šŸ’” Current Recommendations

šŸš€ For Large Context Models

  • šŸ¤– Gemini Models: With 1M+ token context windows, you can process very long documents without truncation

  • šŸŽÆ Token Management: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models

  • šŸ“– Document Processing: Consider using higher token limits (e.g., 500K-1M) when working with:

    • Complete books or long reports

    • Multiple related documents

    • Comprehensive document analysis

āš ļø Limitations to Consider

  • šŸ” OCR Reliability: Scanned document processing is experimental and may require manual validation

  • ā³ Processing Time: Large documents and OCR operations can be time-intensive

  • šŸ’¾ Memory Usage: High-resolution scanned documents may require significant system resources

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Baronco/Local-Docs-MCP-Tool'

If you have feedback or need assistance with the MCP directory API, please join our Discord server