document-reader-mcp
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@document-reader-mcpextract text from the quarterly report.pdf"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
document-reader-mcp
Universal MCP server for extracting text from various document formats. Supports streaming, page/row limits, encoding detection, and simple rate limiting.
Cross-platform compatible: Works seamlessly on macOS, Linux, and Windows with identical functionality.
Supported Formats
Format | Extensions | Dependencies | Status |
|
| ✅ Included (text + images) | |
Excel |
|
| ✅ Included |
Word |
|
| ✅ Included |
CSV |
| Built-in | ✅ Always available |
Plain Text |
| Built-in | ✅ Always available |
JSON |
| Built-in | ✅ Always available |
Markdown |
| Built-in | ✅ Always available |
Related MCP server: doc-ops-mcp
Features
✅ Cross-platform: Works on macOS, Linux, and Windows
✅ Multiple format support: PDF, Excel, CSV, TXT, JSON, Markdown, DOCX, PowerPoint, HTML
✅ Markdown conversion: Convert documents to Markdown with automatic image extraction
✅ PDF image extraction: Automatically extracts and embeds images from PDFs at appropriate page positions
✅ Streaming API: Memory-efficient processing of large files
✅ Smart encoding detection: Handles UTF-8, Latin-1, CP1252, ISO-8859-1
✅ Context-aware limits: Automatic truncation to prevent AI context overflow
✅ Rate limiting: Process-wide rate limiting (configurable)
✅ Docker support: Run in isolated container with non-root user
✅ Modular design: Easy to extend with new formats
✅ Minimal dependencies: Most formats use Python stdlib only
Installation
Option 1: Install from GitHub (Recommended)
macOS/Linux
# Clone the repository
git clone https://github.com/ifmelate/document-reader-mcp.git
cd document-reader-mcp
# Create virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtWindows (Command Prompt)
# Clone the repository
git clone https://github.com/ifmelate/document-reader-mcp.git
cd document-reader-mcp
# Create virtual environment and install dependencies
python -m venv .venv
.venv\Scripts\activate.bat
pip install -r requirements.txtWindows (PowerShell)
# Clone the repository
git clone https://github.com/ifmelate/document-reader-mcp.git
cd document-reader-mcp
# Create virtual environment and install dependencies
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txtNote for Windows PowerShell users: If you encounter an execution policy error, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserQuick Setup Scripts
For convenience, you can use the provided setup scripts:
macOS/Linux:
chmod +x dev-setup.sh
./dev-setup.shWindows (Command Prompt):
dev-setup.batWindows (PowerShell):
.\dev-setup.ps1These scripts will create the virtual environment, install dependencies, and set up the development environment automatically.
Option 2: Direct Install with pip
pip install git+https://github.com/ifmelate/document-reader-mcp.gitOption 3: Docker
# Clone the repository
git clone https://github.com/ifmelate/document-reader-mcp.git
cd document-reader-mcp
# Build the Docker image
docker build -t document-reader-mcp:latest .See Docker Configuration below for MCP client setup.
Running the Server
After installation, start the MCP server:
python -m server.mainThe server runs over stdio for integration with MCP-compatible clients.
Configuration in Cursor (or other MCP clients)
For Cursor IDE
Add this configuration to your Cursor MCP settings:
macOS/Linux:
~/.cursor/mcp.jsonWindows:
%APPDATA%\Cursor\User\globalStorage\mcp.jsonor via Settings → MCP
macOS/Linux Configuration
{
"mcpServers": {
"document-reader": {
"command": "python3",
"args": ["-m", "server.main"],
"cwd": "/absolute/path/to/document-reader-mcp"
}
}
}Windows Configuration
{
"mcpServers": {
"document-reader": {
"command": "python",
"args": ["-m", "server.main"],
"cwd": "C:\\Users\\YourUsername\\document-reader-mcp"
}
}
}Important for Windows users:
Use double backslashes (
\\) in JSON paths, or use forward slashes (/) which also work on WindowsReplace
YourUsernamewith your actual Windows usernameEnsure the
pythoncommand points to your Python 3.10+ installation (check withpython --version)
For Claude Desktop or other MCP clients
Add similar configuration to your client's MCP settings file, adjusting the path accordingly.
Docker Configuration
To use the Docker version with MCP clients:
{
"mcpServers": {
"document-reader": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-v", "/absolute/path/to/documents:/documents:ro",
"document-reader-mcp:latest"
]
}
}
}Important notes:
Replace
/absolute/path/to/documentswith the directory containing files you want to processThe
-vflag mounts your documents directory as/documentsin the container (read-only)Use
-ifor interactive mode (required for stdio communication)Use
--rmto automatically remove the container after it stopsFile paths in MCP tool calls should use
/documents/filename.pdfformat
Multiple volume mounts:
If you need to access files from multiple directories:
{
"mcpServers": {
"document-reader": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-v", "/Users/you/Documents:/documents:ro",
"-v", "/Users/you/Downloads:/downloads:ro",
"document-reader-mcp:latest"
]
}
}
}Custom rate limiting:
{
"mcpServers": {
"document-reader": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-e", "DOC_READER_RATE_LIMIT_PER_MINUTE=120",
"-v", "/absolute/path/to/documents:/documents:ro",
"document-reader-mcp:latest"
]
}
}
}Security considerations for Docker:
The container runs as non-root user (UID 1000)
Volumes are mounted read-only (
:ro) for safetyNo network ports are exposed
Container has minimal attack surface
Available Tools
Once configured, you can use these tools:
Tool: extract_text_from_file
Extract complete text from a document file.
Parameters:
path(string, required): Absolute or relative path to the documentmax_pages(int, optional): For PDFs, parse only the first N pages (default: 50, set to 0 to disable)max_rows(int, optional): For CSV/Excel, parse only N data rows (default: 500, set to 0 to disable)
Returns: Extracted text as string (automatically truncated at 100,000 characters by default)
Supported formats: .pdf, .xlsx, .xlsm, .csv, .txt, .json, .md, .docx
Note: For large files, use extract_text_from_file_stream instead to avoid memory issues.
Default Limits: To prevent AI context overflow, the tool applies sensible defaults:
PDFs: First 50 pages
Excel/CSV: First 500 rows
All formats: 100,000 character output limit
Tool: extract_text_from_file_stream
Stream text chunks from a document (memory-efficient for large files).
Parameters:
path(string, required): Absolute or relative path to the documentmax_pages(int, optional): For PDFs, page cap (default: 50, set to 0 to disable)max_rows(int, optional): For CSV/Excel, row cap (default: 500, set to 0 to disable)chunk_size(int, optional): Characters per chunk (default: 4096, min: 512)
Yields: Text chunks as strings
Supported formats: All formats from extract_text_from_file
Tool: convert_to_markdown
Convert various document formats to Markdown, extracting and saving images when applicable.
⚠️ Important: This tool converts the ENTIRE document and saves it to a file. It ignores the DOC_READER_DEFAULT_MAX_ROWS, DOC_READER_DEFAULT_MAX_PAGES, and DOC_READER_MAX_OUTPUT_CHARS environment variables. Only the preview returned to the AI is limited to protect context - the saved file contains the complete document.
Parameters:
path(string, required): Absolute or relative path to the file to convertoutput_dir(string, optional): Directory where the markdown file and images will be saved. If not specified, saves in the same directory as the source fileoutput_filename(string, optional): Name for the output markdown file (without extension). If not specified, uses the source filename with .md extension
Returns: Dictionary containing:
markdown_path: Path to the saved markdown file (contains FULL content, not truncated)images_dir: Path to the directory containing extracted images (if any)image_count: Number of images extractedmarkdown_preview: First 500 characters preview (truncated for AI context protection)file_size_chars: Total character count of the saved markdown filestatus: "success" or error statusmessage: Human-readable status message
Supported formats:
PDF (
.pdf) - with automatic image extraction and positioning at page locationsExcel (
.xlsx,.xlsm,.xltx,.xltm) - converted to markdown tablesWord (
.docx) - with image extractionCSV (
.csv) - converted to markdown tablesPowerPoint (
.pptx) - text and imagesHTML (
.html,.htm)Plain text (
.txt,.log)Images (
.jpg,.jpeg,.png) - with OCR if available
Example Usage:
# Convert a Word document with images
result = convert_to_markdown(
path="/path/to/document.docx",
output_dir="/path/to/output"
)
# Creates: /path/to/output/document.md
# /path/to/output/document_images/image_1.png
# /path/to/output/document_images/image_2.pngImportant Notes:
Full file is saved: The complete markdown file is saved to disk without any truncation, regardless of size
Preview is truncated: Only the preview returned to the AI is limited to 500 characters to protect context
Images: Automatically extracted from supported formats and saved in a
{filename}_images/subdirectory, with markdown using relative paths to reference themPDF images: Images are intelligently positioned throughout the markdown document at their corresponding page locations, making them viewable in preview
Usage Examples
In Cursor Chat:
Extract text from ~/Downloads/report.pdf and summarize the findingsRead the CSV file data.csv and show me the first 10 rowsWhat's in the JSON file config.json?Convert the Word document ~/Documents/proposal.docx to Markdown and save it in ~/Documents/markdown/Convert this Excel file to Markdown: ~/data/sales_report.xlsxProgrammatic Usage:
# Via MCP client - Extract text
result = await client.call_tool("extract_text_from_file", {
"path": "/path/to/document.pdf",
"max_pages": 5
})
# Streaming large files
async for chunk in client.stream_tool("extract_text_from_file_stream", {
"path": "/path/to/large_file.csv",
"chunk_size": 8192
}):
print(chunk)
# Convert to Markdown
result = await client.call_tool("convert_to_markdown", {
"path": "/path/to/document.docx",
"output_dir": "/path/to/output",
"output_filename": "converted_document"
})
print(f"Markdown saved to: {result['markdown_path']}")
print(f"Images extracted: {result['image_count']}")Configuration
Environment Variables
Configure the server behavior using these environment variables:
DOC_READER_RATE_LIMIT_PER_MINUTE: Maximum tool calls per minute (default: 60)Applies to: All tools
DOC_READER_MAX_OUTPUT_CHARS: Maximum output text size in characters (default: 100000)Applies to:
extract_text_from_fileandextract_text_from_file_streamonlyDoes NOT apply to:
convert_to_markdown(saves full file, only preview is limited)
DOC_READER_DEFAULT_MAX_ROWS: Default maximum rows for spreadsheets/CSV (default: 500, set to 0 to disable)Applies to:
extract_text_from_fileandextract_text_from_file_streamonlyDoes NOT apply to:
convert_to_markdown(converts entire document)
DOC_READER_DEFAULT_MAX_PAGES: Default maximum pages for PDFs (default: 50, set to 0 to disable)Applies to:
extract_text_from_fileandextract_text_from_file_streamonlyDoes NOT apply to:
convert_to_markdown(converts entire document)
Example:
export DOC_READER_RATE_LIMIT_PER_MINUTE=120
export DOC_READER_MAX_OUTPUT_CHARS=200000
export DOC_READER_DEFAULT_MAX_ROWS=1000
export DOC_READER_DEFAULT_MAX_PAGES=100
python -m server.mainWhy these limits? Large documents can easily exceed AI model context windows (typically 200K-1M tokens). These defaults prevent context overflow while allowing flexibility for specific use cases. When limits are hit, the tool provides clear warnings with instructions on how to adjust them.
Technical Details
File Size Limits
Maximum file size: 100 MB
Files larger than this will be rejected with an error
Encoding Detection
Text-based formats (CSV, TXT, JSON, Markdown) automatically try multiple encodings:
UTF-8
Latin-1 (ISO-8859-1)
Windows-1252 (CP1252)
Dependencies by Format
Format | Library | Type |
PDF (text) |
| Included |
PDF (images) |
| Included |
Excel |
| Included |
Word |
| Included |
CSV |
| Built-in |
TXT | File I/O (stdlib) | Built-in |
JSON |
| Built-in |
Markdown | File I/O (stdlib) | Built-in |
Conversion |
| Included |
Security Considerations
⚠️ Important: This server reads local files from the filesystem.
Do NOT expose this server to untrusted networks
Only use in trusted MCP client environments (e.g., Cursor IDE)
Rate limiting is per-process, not per-user
No authentication is built-in
File paths are expanded with
os.path.expanduser()(supports~)
Troubleshooting
"Unsupported file type" error
Check that the file extension matches one of the supported formats
Supported:
.pdf,.xlsx,.xlsm,.xltx,.xltm,.docx,.csv,.txt,.log,.json,.md,.markdown
"Failed to decode" error
The file may use an unsupported text encoding
Try converting the file to UTF-8 encoding first
This typically affects CSV, TXT, JSON, and Markdown files
Rate limit exceeded
Increase the
DOC_READER_RATE_LIMIT_PER_MINUTEenvironment variableOr wait 60 seconds for the rate limit window to reset
Missing dependency errors
If you see "X is not installed" errors, reinstall dependencies:
pip install -r requirements.txtFor PDF image extraction issues, ensure PyMuPDF is installed:
pip install pymupdf
Windows-specific issues
PowerShell execution policy error
If you see cannot be loaded because running scripts is disabled:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserPath length limitations (Windows)
Windows has a 260-character path limit by default. For long paths:
Enable long path support in Windows 10/11: Microsoft Docs
Or move the repository to a shorter path (e.g.,
C:\mcp\document-reader)
Python not found on Windows
Ensure Python 3.10+ is installed and added to PATH
Verify with:
python --versionIf
pythondoesn't work, trypyorpython3
Virtual environment activation issues on Windows
Command Prompt: Use
.venv\Scripts\activate.batPowerShell: Use
.venv\Scripts\Activate.ps1Git Bash: Use
source .venv/Scripts/activate
Docker-related issues
Docker not running
Ensure Docker Desktop is installed and running
On Windows, Docker Desktop requires WSL 2
Permission errors with Docker volumes
On Windows, ensure the drive is shared in Docker Desktop settings
Right-click Docker Desktop icon → Settings → Resources → File Sharing
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines on:
Setting up your development environment
Code style and commit conventions
Adding support for new file formats
Submitting pull requests
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
Issues: GitHub Issues
Discussions: GitHub Discussions
Version
Current version: 1.0.0
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ifmelate/document-reader-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server