Skip to main content
Glama

OfficeReader-MCP

A Model Context Protocol (MCP) server that converts Microsoft Office documents (Word, Excel, PowerPoint) to Markdown format with intelligent image extraction and optimization.

License: MIT Python 3.10+

Features

  • Multi-Format Support: Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt)

  • Intelligent Image Processing: Automatic extraction and optimization with WebP compression

  • Format Preservation: Maintains document structure including headings, tables, lists, and formatting

  • Metadata Extraction: Access document properties (author, title, creation date, etc.)

  • Efficient Caching: Smart caching system for quick reuse of converted documents

  • Cross-Platform: Works on Windows, macOS, and Linux

Supported Formats

Format

Extensions

Features

Word

.docx

,

.doc

Text formatting, headings, lists, tables, images

Excel

.xlsx

,

.xls

Multi-sheet support, tables, charts, embedded images

PowerPoint

.pptx

,

.ppt

Slides, text boxes, images, speaker notes, tables

Installation

Prerequisites

  • Python 3.10 or higher

  • Claude Desktop or Claude Code

Step 1: Install the Package

# Clone the repository git clone https://github.com/Asunainlove/office-reader-mcp.git cd office-reader-mcp # Install in editable mode pip install -e .

Step 2: Configure Claude

For Claude Desktop

Add to your Claude Desktop config file:

Windows: %APPDATA%\Claude\claude_desktop_config.json macOS/Linux: ~/.config/Claude/claude_desktop_config.json

{ "mcpServers": { "officereader": { "command": "python", "args": ["-m", "officereader_mcp.server"], "env": { "OFFICEREADER_CACHE_DIR": "/path/to/cache" } } } }

For Claude Code

Add to your Claude Code settings:

Windows: %LOCALAPPDATA%\claude-code\settings.json macOS/Linux: ~/.config/claude-code/settings.json

{ "mcpServers": { "officereader": { "command": "python", "args": ["-m", "officereader_mcp.server"], "env": { "OFFICEREADER_CACHE_DIR": "/path/to/cache" } } } }

Step 3: Restart Claude

Restart Claude Desktop or Claude Code to load the MCP server.

Quick Start

After installation, you can use OfficeReader-MCP directly in your conversations with Claude:

Convert my Excel file at D:\Reports\sales_2024.xlsx to markdown
Extract text and images from D:\Presentations\keynote.pptx
Get metadata from my document at C:\Documents\report.docx

Available Tools

1. convert_document

Convert any supported Office document to Markdown format.

Parameters:

  • file_path (required): Absolute path to the document

  • extract_images (optional, default: true): Extract embedded images

  • image_format (optional, default: "file"): How to handle images

    • "file": Save images to disk (recommended)

    • "base64": Embed images as base64 in markdown

    • "both": Both save and embed

  • output_name (optional): Custom name for output files

Example:

Convert D:\Documents\report.xlsx with images

2. read_converted_markdown

Read the full content of a previously converted markdown file.

Parameters:

  • markdown_path (required): Path to the markdown file

Example:

Read the markdown at D:\cache\output\report_abc12345\report_abc12345.md

3. list_conversions

List all cached document conversions with details.

Example:

List all converted documents

4. clear_cache

Clear all cached conversions to free up disk space.

Example:

Clear the document cache

5. get_document_metadata

Extract metadata from a document without full conversion (faster).

Parameters:

  • file_path (required): Path to the document

Example:

Get metadata from D:\Documents\presentation.pptx

6. get_supported_formats

Get list of all supported file formats and extensions.

Example:

What file formats does officereader support?

Output Structure

Converted documents are organized in the cache directory:

cache/ └── output/ └── document_name_abc12345/ ├── document_name_abc12345.md # Converted markdown └── images/ ├── image_001.webp # Optimized images ├── slide2_image_002.webp └── excel_image_003.webp

Image Optimization

Images are automatically optimized to reduce file size while maintaining quality:

  • Max Dimensions: 1920×1080 pixels (configurable)

  • Format: WebP (preferred) or PNG/JPEG fallback

  • Quality: 80% for photos, 85% for JPEG, lossless PNG for graphics with transparency

  • Typical Compression: 50-80% size reduction

  • Smart Detection: Automatically distinguishes between photos and graphics

Technical Details

Architecture

OfficeReader-MCP/ ├── src/officereader_mcp/ │ ├── server.py # MCP server implementation │ ├── converter.py # Word converter (DocxConverter, OfficeConverter) │ ├── excel_converter.py # Excel to Markdown converter │ ├── pptx_converter.py # PowerPoint to Markdown converter │ ├── image_optimizer.py # Image compression utility │ └── __init__.py # Package initialization ├── test/ │ ├── test_converter.py # Basic functionality tests │ └── test_all_formats.py # Comprehensive test suite ├── pyproject.toml # Project configuration └── README.md # Documentation

Dependencies

Package

Version

Purpose

mcp

>=1.0.0

Model Context Protocol SDK

python-docx

>=1.1.0

DOCX file parsing and manipulation

mammoth

>=1.6.0

DOC/DOCX to HTML conversion (fallback)

Pillow

>=10.0.0

Image processing and optimization

markdownify

>=0.11.0

HTML to Markdown conversion

openpyxl

>=3.1.0

Excel file parsing

python-pptx

>=0.6.21

PowerPoint file parsing

All dependencies are automatically installed when you run pip install -e .

Testing

Run Tests

# Basic converter test python test/test_converter.py # Comprehensive test suite for all formats python test/test_all_formats.py # Test with a specific document python test/test_converter.py path/to/your/document.docx

Test Coverage

The test suite verifies:

  • Module imports and initialization

  • Converter functionality for all formats

  • Image extraction and optimization

  • File type detection

  • Cache management

  • Metadata extraction

Configuration

OfficeReader-MCP supports multiple configuration methods to customize cache locations and behavior.

Quick Configuration (Recommended)

  1. Copy the example config file:

    cp config.example.json config.json
  2. Edit config.json to set your cache directory:

    { "cache_dir": "D:/MyDocuments/OfficeReaderCache", "image_optimization": { "enabled": true, "max_dimension": 1920, "quality": 80 } }
  3. The config file will be automatically loaded on startup.

For detailed configuration options, see CONFIG.md.

Environment Variables

Variable

Description

Default

OFFICEREADER_CACHE_DIR

Directory for cached conversions

System temp directory

Example usage:

# Set custom cache directory export OFFICEREADER_CACHE_DIR=/path/to/custom/cache # Or in Windows set OFFICEREADER_CACHE_DIR=C:\path\to\custom\cache

Note: Environment variables take priority over config file settings.

Usage Examples

Converting Excel with Multiple Sheets

User: Convert my Excel file at D:\Reports\Q4_sales.xlsx Claude: I'll convert that Excel file. Each sheet will be converted to a separate section in the markdown with properly formatted tables... [Output includes all sheets as markdown tables with preserved formatting]

Extracting PowerPoint Content

User: Extract all text and images from D:\Presentations\product_launch.pptx Claude: Converting the PowerPoint presentation. I'll extract text from each slide, including speaker notes, along with all embedded images... [Output includes slide-by-slide breakdown with images and notes]

Batch Processing

User: Convert all Office documents in D:\Documents\ Claude: I'll convert each document and cache the results for quick access... [Processes all supported files and provides summary]

Troubleshooting

"Module not found" Error

# Reinstall the package pip install -e .

Configuration Not Loading

  1. Verify the config file location is correct

  2. Check JSON syntax is valid (use a JSON validator)

  3. Restart Claude Desktop or Claude Code completely

  4. Check logs for error messages

Images Not Extracting

Possible causes:

  • Document contains linked images (not embedded)

  • Insufficient write permissions for cache directory

  • Image format not supported by the document library

Solution:

# Verify cache directory is writable ls -la /path/to/cache # Unix/Mac dir /path/to/cache # Windows # Check if images are embedded # Use convert_document with extract_images=true explicitly

Encoding Issues

The converter uses UTF-8 encoding throughout. If you see garbled text:

  • Check the source document encoding

  • Ensure your terminal/console supports UTF-8

  • Try converting with different system locale settings

Changelog

v2.0.0 (2024-11)

Major Features:

  • Added Excel (.xlsx, .xls) support with multi-sheet conversion

  • Added PowerPoint (.pptx, .ppt) support with slide extraction

  • Implemented intelligent image optimization with WebP compression

  • Added unified OfficeConverter interface for all document types

  • Enhanced metadata extraction for all formats

Improvements:

  • Smart caching system with hash-based file identification

  • Lazy-loading of format-specific converters for better performance

  • Better error handling and validation

  • Comprehensive test suite for all formats

Tools:

  • Added get_supported_formats tool

  • Enhanced get_document_metadata for all formats

  • Improved list_conversions with detailed cache information

v1.0.0 (2024-09)

  • Initial release

  • Word document (.docx, .doc) conversion

  • Basic image extraction

  • MCP server implementation

Contributing

Contributions are welcome! Here's how you can help:

  1. Report Bugs: Open an issue with details and steps to reproduce

  2. Suggest Features: Describe your idea and use case

  3. Submit Pull Requests:

    • Fork the repository

    • Create a feature branch (git checkout -b feature/amazing-feature)

    • Commit your changes (git commit -m 'Add amazing feature')

    • Push to your branch (git push origin feature/amazing-feature)

    • Open a Pull Request

Development Setup

# Clone and install with dev dependencies git clone https://github.com/Asunainlove/office-reader-mcp.git cd office-reader-mcp pip install -e ".[dev]" # Run tests python test/test_all_formats.py # Run linting (if configured) black src/ ruff check src/

License

MIT License - see LICENSE file for details.

Author

Asunainlove

Acknowledgments

This project uses the following open-source libraries:

Support

If you find this project helpful, please:

  • ⭐ Star the repository

  • 🐛 Report bugs and issues

  • 💡 Suggest new features

  • 🔀 Contribute code improvements


Happy converting! 🚀

-
security - not tested
A
license - permissive license
-
quality - not tested

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Asunainlove/OfficeReader-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server