README.mdā¢8.73 kB
# š Local Documents MCP Server
A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs.
## ⨠Features
- **š Document Discovery**: List all documents in a specified directory
- **ā” Document Processing**: Convert various document formats to markdown
- **š OCR Support**: Extract text from scanned PDFs using Tesseract OCR
- **šÆ Token Management**: Automatic content truncation based on token limits
- **š Multi-format Support**: Handle Word docs, PDFs, PowerPoint, Excel, and more
## š ļø Tools Available
- `list_documents`: Find documents by path, name, and extension
- `load_documents`: Extract document content as markdown
- `load_scanned_document`: Extract text from scanned PDFs using OCR
## š» System Requirements
- **Operating System**: Windows 10/11
- **Python**: 3.13 or higher
- **Package Manager**: [uv](https://docs.astral.sh/uv/) (recommended)
## š Prerequisites Installation
### 1. š Python 3.13
Download and install Python 3.13 from [python.org](https://www.python.org/downloads/)
### 2. ā” UV Package Manager
Install uv using pip:
```powershell
pip install uv
```
### 3. š Poppler for Windows
**Purpose**: Required for PDF processing and conversion to images for OCR.
1. Download the latest Poppler Windows release from:
https://github.com/oschwartz10612/poppler-windows/releases/
2. Extract the ZIP file to:
```
D:\Program Files\poppler-24.08.0
```
3. The Poppler binaries should be located at:
```
D:\Program Files\poppler-24.08.0\Library\bin
```
**Alternative locations**: You can install Poppler in any directory, just make sure to update the `.env` file with the correct path.
### 4. šļø Tesseract OCR
**Purpose**: Required for extracting text from scanned documents and images.
1. Download Tesseract for Windows from:
https://github.com/UB-Mannheim/tesseract/wiki
2. Install Tesseract following the installer instructions
3. Make sure Tesseract is added to your system PATH, or note the installation directory
## š Project Installation
### 1. š„ Clone or Download the Project
```powershell
git clone <your-repo-url>
cd LocalDocs
```
### 2. š¦ Install Python Dependencies
```powershell
uv sync
```
This will install all required dependencies from `pyproject.toml`:
- `markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2` - Document conversion
- `mcp[cli]>=1.10.1` - MCP server framework
- `opencv-python>=4.11.0.86` - Image processing
- `pdf2image>=1.17.0` - PDF to image conversion
- `pytesseract>=0.3.13` - Tesseract OCR wrapper
- `python-dotenv>=1.1.1` - Environment variable management
- `tiktoken>=0.9.0` - Token counting
### 3. āļø Configure Environment Variables
Create or update the `.env` file in the project root:
```env
POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin"
```
**Note**: Update the path to match your Poppler installation location.
## š§ Configuration for MCP Clients
### š¤ Claude Desktop Configuration
Add the following configuration to your Claude Desktop `config.json` file:
- **First argument**: Path to your documents directory
- Example: `"C:\\Users\\YourUsername\\Documents\\MyDocuments"`
- Use double backslashes for Windows paths in JSON
- **Second argument**: Maximum tokens per document
- Example: `"30000"`
- Adjust based on your needs and Claude's token limits
### š Example Configurations
**For different document locations**:
```json
{
"mcpServers": {
"local-documents": {
"command": "uv",
"args": [
"--directory",
"C:\\Users\\YourUsername\\Documents\\LocalDocs",
"run",
"server.py",
"C:\\Users\\YourUsername\\Documents\\MyDocuments",
"30000"
]
}
}
}
```
## šÆ Usage
### š Starting the Server
The server is automatically started when Claude Desktop loads with the configured settings.
### š Available Operations
1. **š List Documents**: Discover all documents in your configured directory
2. **š Load Standard Documents**: Process Word docs, PDFs, PowerPoint, Excel files
3. **š Load Scanned Documents**: Use OCR to extract text from scanned PDFs
### š Response Format
The server returns structured responses with:
- Document paths and metadata
- Token usage information
- Processing time (for OCR operations)
- Extracted content in markdown format
## š ļø Troubleshooting
### ā ļø Common Issues
1. **š Poppler not found**
- Verify Poppler installation path
- Check `.env` file configuration
- Ensure path uses double backslashes in Windows
2. **šļø Tesseract not found**
- Verify Tesseract installation
- Add Tesseract to system PATH
- Restart command prompt/PowerShell
3. **š Permission denied errors**
- Ensure the document directory is accessible
- Check file permissions
- Run as administrator if necessary
4. **ā Import errors**
- Verify all dependencies are installed: `uv sync`
- Check Python version: `python --version`
- Ensure you're using Python 3.13
5. **ā³ Large document processing**
- Reduce token limit for better performance
- Consider splitting large documents
- Monitor memory usage during OCR operations
### š Debug Information
To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window.
## š File Structure
```
LocalDocs/
āāā server.py # Main MCP server
āāā pyproject.toml # Project dependencies
āāā .env # Environment configuration
āāā README.md # This documentation
āāā src/
ā āāā instructions.md # Assistant instructions
āāā utils/
āāā __init__.py
āāā markitdown.py # Document conversion
āāā max_tokens.py # Token management
āāā ocr.py # OCR processing
āāā path_files.py # File discovery
āāā prompts.py # Instruction loading
```
## š Supported Document Formats
- **š Microsoft Office**: .docx, .xlsx, .pptx
- **š PDF**: Regular PDFs and scanned PDFs (via OCR)
## ā” Performance Considerations
- **š OCR Processing**: Scanned documents take significantly longer to process
- **šÆ Token Limits**: Adjust based on your document sizes and Claude's context window
- **š¾ Memory Usage**: Large documents and OCR operations can be memory-intensive
## š¤ Contributing
When contributing to this project:
1. Ensure compatibility with Windows and Python 3.13
2. Test with various document formats
3. Verify OCR functionality with scanned documents
4. Update documentation for any new features
## š Related Documentation
- [MCP Documentation](https://modelcontextprotocol.io/)
- [Claude Desktop MCP Guide](https://claude.ai/download)
- [PDF2Image](https://github.com/Belval/pdf2image)
- [Poppler PDF Processing](https://github.com/oschwartz10612/poppler-windows/releases/)
- [Tesseract OCR](https://github.com/UB-Mannheim/tesseract/wiki)
- [MarkItDown](https://github.com/microsoft/markitdown)
## šŗļø Roadmap and Future Enhancements
### š® Planned Features
- **š§ Vector Storage and RAG Integration**: Future versions will include vectorial document storage to:
- Reduce token consumption by avoiding repeated text extraction
- Enable semantic search across document collections
- Provide more efficient document retrieval and chunking
- Support for persistent document indexing
- **š Enhanced OCR Validation**: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with:
- Complex layouts and formatting
- Multi-column documents
- Poor quality scans
- Non-standard fonts or languages
### š” Current Recommendations
#### š For Large Context Models
- **š¤ Gemini Models**: With 1M+ token context windows, you can process very long documents without truncation
- **šÆ Token Management**: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models
- **š Document Processing**: Consider using higher token limits (e.g., 500K-1M) when working with:
- Complete books or long reports
- Multiple related documents
- Comprehensive document analysis
#### ā ļø Limitations to Consider
- **š OCR Reliability**: Scanned document processing is experimental and may require manual validation
- **ā³ Processing Time**: Large documents and OCR operations can be time-intensive
- **š¾ Memory Usage**: High-resolution scanned documents may require significant system resources