README.md•8.73 kB
# 📚 Local Documents MCP Server
A Model Context Protocol (MCP) server for interacting with local documents on Windows systems. This server provides tools to list, load, and process documents with support for OCR on scanned PDFs.
## ✨ Features
- **📁 Document Discovery**: List all documents in a specified directory
- **⚡ Document Processing**: Convert various document formats to markdown
- **🔍 OCR Support**: Extract text from scanned PDFs using Tesseract OCR
- **🎯 Token Management**: Automatic content truncation based on token limits
- **📄 Multi-format Support**: Handle Word docs, PDFs, PowerPoint, Excel, and more
## 🛠️ Tools Available
- `list_documents`: Find documents by path, name, and extension
- `load_documents`: Extract document content as markdown
- `load_scanned_document`: Extract text from scanned PDFs using OCR
## 💻 System Requirements
- **Operating System**: Windows 10/11
- **Python**: 3.13 or higher
- **Package Manager**: [uv](https://docs.astral.sh/uv/) (recommended)
## 📋 Prerequisites Installation
### 1. 🐍 Python 3.13
Download and install Python 3.13 from [python.org](https://www.python.org/downloads/)
### 2. ⚡ UV Package Manager
Install uv using pip:
```powershell
pip install uv
```
### 3. 📖 Poppler for Windows
**Purpose**: Required for PDF processing and conversion to images for OCR.
1. Download the latest Poppler Windows release from:
https://github.com/oschwartz10612/poppler-windows/releases/
2. Extract the ZIP file to:
```
D:\Program Files\poppler-24.08.0
```
3. The Poppler binaries should be located at:
```
D:\Program Files\poppler-24.08.0\Library\bin
```
**Alternative locations**: You can install Poppler in any directory, just make sure to update the `.env` file with the correct path.
### 4. 👁️ Tesseract OCR
**Purpose**: Required for extracting text from scanned documents and images.
1. Download Tesseract for Windows from:
https://github.com/UB-Mannheim/tesseract/wiki
2. Install Tesseract following the installer instructions
3. Make sure Tesseract is added to your system PATH, or note the installation directory
## 🚀 Project Installation
### 1. 📥 Clone or Download the Project
```powershell
git clone <your-repo-url>
cd LocalDocs
```
### 2. 📦 Install Python Dependencies
```powershell
uv sync
```
This will install all required dependencies from `pyproject.toml`:
- `markitdown[docx,pdf,pptx,xls,xlsx]>=0.1.2` - Document conversion
- `mcp[cli]>=1.10.1` - MCP server framework
- `opencv-python>=4.11.0.86` - Image processing
- `pdf2image>=1.17.0` - PDF to image conversion
- `pytesseract>=0.3.13` - Tesseract OCR wrapper
- `python-dotenv>=1.1.1` - Environment variable management
- `tiktoken>=0.9.0` - Token counting
### 3. ⚙️ Configure Environment Variables
Create or update the `.env` file in the project root:
```env
POPPLER_PATH="D:\\Program Files\\poppler-24.08.0\\Library\\bin"
```
**Note**: Update the path to match your Poppler installation location.
## 🔧 Configuration for MCP Clients
### 🤖 Claude Desktop Configuration
Add the following configuration to your Claude Desktop `config.json` file:
- **First argument**: Path to your documents directory
- Example: `"C:\\Users\\YourUsername\\Documents\\MyDocuments"`
- Use double backslashes for Windows paths in JSON
- **Second argument**: Maximum tokens per document
- Example: `"30000"`
- Adjust based on your needs and Claude's token limits
### 📝 Example Configurations
**For different document locations**:
```json
{
"mcpServers": {
"local-documents": {
"command": "uv",
"args": [
"--directory",
"C:\\Users\\YourUsername\\Documents\\LocalDocs",
"run",
"server.py",
"C:\\Users\\YourUsername\\Documents\\MyDocuments",
"30000"
]
}
}
}
```
## 🎯 Usage
### 🚀 Starting the Server
The server is automatically started when Claude Desktop loads with the configured settings.
### 🔄 Available Operations
1. **📋 List Documents**: Discover all documents in your configured directory
2. **📄 Load Standard Documents**: Process Word docs, PDFs, PowerPoint, Excel files
3. **🔍 Load Scanned Documents**: Use OCR to extract text from scanned PDFs
### 📊 Response Format
The server returns structured responses with:
- Document paths and metadata
- Token usage information
- Processing time (for OCR operations)
- Extracted content in markdown format
## 🛠️ Troubleshooting
### ⚠️ Common Issues
1. **🔍 Poppler not found**
- Verify Poppler installation path
- Check `.env` file configuration
- Ensure path uses double backslashes in Windows
2. **👁️ Tesseract not found**
- Verify Tesseract installation
- Add Tesseract to system PATH
- Restart command prompt/PowerShell
3. **🔐 Permission denied errors**
- Ensure the document directory is accessible
- Check file permissions
- Run as administrator if necessary
4. **❌ Import errors**
- Verify all dependencies are installed: `uv sync`
- Check Python version: `python --version`
- Ensure you're using Python 3.13
5. **⏳ Large document processing**
- Reduce token limit for better performance
- Consider splitting large documents
- Monitor memory usage during OCR operations
### 🐛 Debug Information
To get more detailed error information, check the Claude Desktop logs or run the server manually in a PowerShell window.
## 📁 File Structure
```
LocalDocs/
├── server.py # Main MCP server
├── pyproject.toml # Project dependencies
├── .env # Environment configuration
├── README.md # This documentation
├── src/
│ └── instructions.md # Assistant instructions
└── utils/
├── __init__.py
├── markitdown.py # Document conversion
├── max_tokens.py # Token management
├── ocr.py # OCR processing
├── path_files.py # File discovery
└── prompts.py # Instruction loading
```
## 📄 Supported Document Formats
- **📊 Microsoft Office**: .docx, .xlsx, .pptx
- **📖 PDF**: Regular PDFs and scanned PDFs (via OCR)
## ⚡ Performance Considerations
- **🔍 OCR Processing**: Scanned documents take significantly longer to process
- **🎯 Token Limits**: Adjust based on your document sizes and Claude's context window
- **💾 Memory Usage**: Large documents and OCR operations can be memory-intensive
## 🤝 Contributing
When contributing to this project:
1. Ensure compatibility with Windows and Python 3.13
2. Test with various document formats
3. Verify OCR functionality with scanned documents
4. Update documentation for any new features
## 📚 Related Documentation
- [MCP Documentation](https://modelcontextprotocol.io/)
- [Claude Desktop MCP Guide](https://claude.ai/download)
- [PDF2Image](https://github.com/Belval/pdf2image)
- [Poppler PDF Processing](https://github.com/oschwartz10612/poppler-windows/releases/)
- [Tesseract OCR](https://github.com/UB-Mannheim/tesseract/wiki)
- [MarkItDown](https://github.com/microsoft/markitdown)
## 🗺️ Roadmap and Future Enhancements
### 🔮 Planned Features
- **🧠 Vector Storage and RAG Integration**: Future versions will include vectorial document storage to:
- Reduce token consumption by avoiding repeated text extraction
- Enable semantic search across document collections
- Provide more efficient document retrieval and chunking
- Support for persistent document indexing
- **🔍 Enhanced OCR Validation**: Currently, OCR functionality for scanned books has not been fully validated and may encounter issues with:
- Complex layouts and formatting
- Multi-column documents
- Poor quality scans
- Non-standard fonts or languages
### 💡 Current Recommendations
#### 🚀 For Large Context Models
- **🤖 Gemini Models**: With 1M+ token context windows, you can process very long documents without truncation
- **🎯 Token Management**: Current implementation supports up to 128K tokens by default, but can be adjusted for larger context models
- **📖 Document Processing**: Consider using higher token limits (e.g., 500K-1M) when working with:
- Complete books or long reports
- Multiple related documents
- Comprehensive document analysis
#### ⚠️ Limitations to Consider
- **🔍 OCR Reliability**: Scanned document processing is experimental and may require manual validation
- **⏳ Processing Time**: Large documents and OCR operations can be time-intensive
- **💾 Memory Usage**: High-resolution scanned documents may require significant system resources