Word Document Reader MCP Server

README-enhanced.md•6.91 kB

# Enhanced Word Document Reader MCP Server A powerful Word document reading MCP server with table extraction, image OCR analysis, large document optimization, and intelligent caching. ## 🚀 New Features ### 1. Table Extraction - Automatically identify and extract tables from Word documents - Convert to structured data format - Preserve table row/column structure ### 2. Image OCR Analysis - Extract images from Word documents - Use Tesseract.js v5 for OCR text recognition - Support mixed Chinese-English recognition - Intelligent image preprocessing to improve recognition accuracy ### 3. Large Document Optimization - Automatically detect large documents (>10MB or >100 pages) - Parallel processing to improve analysis speed - Chunked processing to avoid memory overflow - Worker thread multi-core processing ### 4. Smart Caching System - File system caching of parsing results - Smart cache invalidation based on file modification time - Support cache statistics and management - Significantly improve repeated document processing speed ### 5. Full-text Index Search - Inverted index for fast search - Support Chinese-English word segmentation - Relevance scoring and sorting - Real-time index updates ## 📦 Install Dependencies ```bash npm install ``` ## 🛠️ Usage ### Start Server ```bash npm start # or node server.js ``` ### Configuration Edit `config.json` to customize behavior: ```json { "processing": { "maxFileSize": 10485760, "maxPages": 100, "chunkSize": 1048576, "parallelProcessing": true }, "cache": { "enabled": true, "defaultTTL": 3600, "cacheDirectory": "./.cache" }, "ocr": { "enabled": true, "languages": ["chi_sim", "eng"] } } ``` ## 🎯 API Examples ### Enhanced Document Reading ```javascript const result = await mcp.call("read_word_document", { filePath: "./document.docx", memoryKey: "my-doc", documentType: "api-doc", extractTables: true, // Extract tables extractImages: true, // Extract and OCR images useCache: true, // Use smart caching outputDir: "./output" // Output directory for extracted images }); ``` ### Advanced Search ```javascript const searchResults = await mcp.call("search_documents", { query: "table configuration", documentType: "api-doc", limit: 10 }); ``` ### Cache Management ```javascript // Get cache statistics const stats = await mcp.call("get_cache_stats"); // Clear specific cache type await mcp.call("clear_cache", { type: "document" // "all", "document", "index" }); ``` ## 📊 Performance Improvements | Feature | Basic Version | Enhanced Version | Improvement | |---------|---------------|------------------|-------------| | Large Document Processing | Serial | Parallel | 60%+ faster | | Repeated Document Access | No Cache | Smart Cache | 90%+ faster | | Table Recognition | Manual | Automatic | New feature | | Image Analysis | Not supported | OCR with preprocessing | New feature | | Search Speed | Linear scan | Full-text index | <100ms response | ## 🔧 Technical Details ### Table Extraction Algorithm 1. XML parsing of document structure 2. Table boundary detection 3. Cell content extraction 4. Row/column relationship mapping 5. Structured data output ### OCR Processing Pipeline 1. Image extraction from document 2. Preprocessing (noise reduction, contrast enhancement) 3. Text region detection 4. Character recognition with Tesseract.js v5 5. Post-processing and confidence scoring ### Caching Strategy - **Document Cache**: Parsed document content with TTL - **Index Cache**: Search index for fast retrieval - **Image Cache**: Processed images with OCR results - **Smart Invalidation**: File modification time based ### Parallel Processing - Worker threads for CPU-intensive tasks - Chunked memory management - Concurrent table and image processing - Resource pooling for efficiency ## 🧪 Testing Enhanced Features ```bash # Run all tests npm test # Test specific features node --test tests/integration/tools/ node --test tests/integration/cache/ # Performance benchmarks node tests/benchmark/ ``` ## 📈 Monitoring and Debugging ### Enable Debug Mode ```bash DEBUG=* node server.js ``` ### Performance Metrics ```javascript const stats = await mcp.call("get_cache_stats"); console.log("Cache hit rate:", stats.hitRate); console.log("Average processing time:", stats.avgProcessingTime); ``` ### Resource Usage - Memory usage: `process.memoryUsage()` - Cache statistics: Built-in monitoring - Processing time: Automatic tracking ## 🔒 Security Enhancements - File type validation - Size limits enforcement - Memory usage protection - Temporary file cleanup - Cache isolation ## 📝 Migration Guide ### From Basic Version 1. Install additional dependencies: ```bash npm install tesseract.js node-cache sharp jszip ``` 2. Update server reference: ```json { "mcpServers": { "word-doc-reader": { "command": "node", "args": ["path/to/word-doc-mcp/server.js"] } } } ``` 3. Optional: Configure `config.json` ### Performance Tuning 1. **For Large Documents** ```json { "processing": { "maxFileSize": 20971520, // 20MB "chunkSize": 2097152, // 2MB chunks "parallelProcessing": true } } ``` 2. **For OCR Accuracy** ```json { "ocr": { "enabled": true, "languages": ["chi_sim", "eng"], "preprocessing": true } } ``` 3. **For Cache Optimization** ```json { "cache": { "enabled": true, "defaultTTL": 7200, // 2 hours "maxCacheSize": 1073741824 // 1GB } } ``` ## 🚨 Troubleshooting Enhanced Features ### OCR Issues - Ensure sufficient memory (8GB+ recommended) - Check image format support - Verify language pack installation ### Large Document Processing - Increase available memory - Enable parallel processing - Adjust chunk size ### Cache Problems - Check disk space - Verify write permissions - Clear corrupted cache ## 📚 Advanced Usage ### Custom Document Types ```javascript const result = await mcp.call("read_word_document", { filePath: "./technical-spec.docx", memoryKey: "tech-spec", documentType: "technical-doc", // Custom type extractTables: true, extractImages: false // Skip images for speed }); ``` ### Batch Processing ```javascript const documents = ["doc1.docx", "doc2.docx", "doc3.docx"]; const results = await Promise.all( documents.map(doc => mcp.call("read_word_document", { filePath: doc, memoryKey: doc.replace('.docx', ''), documentType: "batch-doc" }) ) ); ``` ## 🔮 Future Enhancements - Support for more document formats - Advanced image analysis (charts, diagrams) - Machine learning-based table detection - Distributed processing for very large documents - Real-time collaboration features

Latest Blog Posts

Model Context Protocol Proxies: Enabling Enterprise Control with Virtual MCPs
By Om-Shree-0709 on December 9, 2025.
AI Security
Virtual MCP
Kubernetes Operator
The State of MCP in 2025: Who's Building What and Why It Matters
By punkpeye on December 7, 2025.
mcp
startups
MCP hosting with persistent storage
By punkpeye on December 6, 2025.
changelog

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/little2512/word-doc-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server