Skip to main content
Glama
little2512
by little2512
README-enhanced.mdโ€ข6.91 kB
# Enhanced Word Document Reader MCP Server A powerful Word document reading MCP server with table extraction, image OCR analysis, large document optimization, and intelligent caching. ## ๐Ÿš€ New Features ### 1. Table Extraction - Automatically identify and extract tables from Word documents - Convert to structured data format - Preserve table row/column structure ### 2. Image OCR Analysis - Extract images from Word documents - Use Tesseract.js v5 for OCR text recognition - Support mixed Chinese-English recognition - Intelligent image preprocessing to improve recognition accuracy ### 3. Large Document Optimization - Automatically detect large documents (>10MB or >100 pages) - Parallel processing to improve analysis speed - Chunked processing to avoid memory overflow - Worker thread multi-core processing ### 4. Smart Caching System - File system caching of parsing results - Smart cache invalidation based on file modification time - Support cache statistics and management - Significantly improve repeated document processing speed ### 5. Full-text Index Search - Inverted index for fast search - Support Chinese-English word segmentation - Relevance scoring and sorting - Real-time index updates ## ๐Ÿ“ฆ Install Dependencies ```bash npm install ``` ## ๐Ÿ› ๏ธ Usage ### Start Server ```bash npm start # or node server.js ``` ### Configuration Edit `config.json` to customize behavior: ```json { "processing": { "maxFileSize": 10485760, "maxPages": 100, "chunkSize": 1048576, "parallelProcessing": true }, "cache": { "enabled": true, "defaultTTL": 3600, "cacheDirectory": "./.cache" }, "ocr": { "enabled": true, "languages": ["chi_sim", "eng"] } } ``` ## ๐ŸŽฏ API Examples ### Enhanced Document Reading ```javascript const result = await mcp.call("read_word_document", { filePath: "./document.docx", memoryKey: "my-doc", documentType: "api-doc", extractTables: true, // Extract tables extractImages: true, // Extract and OCR images useCache: true, // Use smart caching outputDir: "./output" // Output directory for extracted images }); ``` ### Advanced Search ```javascript const searchResults = await mcp.call("search_documents", { query: "table configuration", documentType: "api-doc", limit: 10 }); ``` ### Cache Management ```javascript // Get cache statistics const stats = await mcp.call("get_cache_stats"); // Clear specific cache type await mcp.call("clear_cache", { type: "document" // "all", "document", "index" }); ``` ## ๐Ÿ“Š Performance Improvements | Feature | Basic Version | Enhanced Version | Improvement | |---------|---------------|------------------|-------------| | Large Document Processing | Serial | Parallel | 60%+ faster | | Repeated Document Access | No Cache | Smart Cache | 90%+ faster | | Table Recognition | Manual | Automatic | New feature | | Image Analysis | Not supported | OCR with preprocessing | New feature | | Search Speed | Linear scan | Full-text index | <100ms response | ## ๐Ÿ”ง Technical Details ### Table Extraction Algorithm 1. XML parsing of document structure 2. Table boundary detection 3. Cell content extraction 4. Row/column relationship mapping 5. Structured data output ### OCR Processing Pipeline 1. Image extraction from document 2. Preprocessing (noise reduction, contrast enhancement) 3. Text region detection 4. Character recognition with Tesseract.js v5 5. Post-processing and confidence scoring ### Caching Strategy - **Document Cache**: Parsed document content with TTL - **Index Cache**: Search index for fast retrieval - **Image Cache**: Processed images with OCR results - **Smart Invalidation**: File modification time based ### Parallel Processing - Worker threads for CPU-intensive tasks - Chunked memory management - Concurrent table and image processing - Resource pooling for efficiency ## ๐Ÿงช Testing Enhanced Features ```bash # Run all tests npm test # Test specific features node --test tests/integration/tools/ node --test tests/integration/cache/ # Performance benchmarks node tests/benchmark/ ``` ## ๐Ÿ“ˆ Monitoring and Debugging ### Enable Debug Mode ```bash DEBUG=* node server.js ``` ### Performance Metrics ```javascript const stats = await mcp.call("get_cache_stats"); console.log("Cache hit rate:", stats.hitRate); console.log("Average processing time:", stats.avgProcessingTime); ``` ### Resource Usage - Memory usage: `process.memoryUsage()` - Cache statistics: Built-in monitoring - Processing time: Automatic tracking ## ๐Ÿ”’ Security Enhancements - File type validation - Size limits enforcement - Memory usage protection - Temporary file cleanup - Cache isolation ## ๐Ÿ“ Migration Guide ### From Basic Version 1. Install additional dependencies: ```bash npm install tesseract.js node-cache sharp jszip ``` 2. Update server reference: ```json { "mcpServers": { "word-doc-reader": { "command": "node", "args": ["path/to/word-doc-mcp/server.js"] } } } ``` 3. Optional: Configure `config.json` ### Performance Tuning 1. **For Large Documents** ```json { "processing": { "maxFileSize": 20971520, // 20MB "chunkSize": 2097152, // 2MB chunks "parallelProcessing": true } } ``` 2. **For OCR Accuracy** ```json { "ocr": { "enabled": true, "languages": ["chi_sim", "eng"], "preprocessing": true } } ``` 3. **For Cache Optimization** ```json { "cache": { "enabled": true, "defaultTTL": 7200, // 2 hours "maxCacheSize": 1073741824 // 1GB } } ``` ## ๐Ÿšจ Troubleshooting Enhanced Features ### OCR Issues - Ensure sufficient memory (8GB+ recommended) - Check image format support - Verify language pack installation ### Large Document Processing - Increase available memory - Enable parallel processing - Adjust chunk size ### Cache Problems - Check disk space - Verify write permissions - Clear corrupted cache ## ๐Ÿ“š Advanced Usage ### Custom Document Types ```javascript const result = await mcp.call("read_word_document", { filePath: "./technical-spec.docx", memoryKey: "tech-spec", documentType: "technical-doc", // Custom type extractTables: true, extractImages: false // Skip images for speed }); ``` ### Batch Processing ```javascript const documents = ["doc1.docx", "doc2.docx", "doc3.docx"]; const results = await Promise.all( documents.map(doc => mcp.call("read_word_document", { filePath: doc, memoryKey: doc.replace('.docx', ''), documentType: "batch-doc" }) ) ); ``` ## ๐Ÿ”ฎ Future Enhancements - Support for more document formats - Advanced image analysis (charts, diagrams) - Machine learning-based table detection - Distributed processing for very large documents - Real-time collaboration features

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/little2512/word-doc-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server