Skip to main content
Glama
little2512
by little2512
README.mdβ€’7.5 kB
# Word Document Reader MCP Server A powerful Word document reading MCP server with table extraction, image OCR analysis, large document optimization, and intelligent caching. ## πŸš€ Core Features ### 1. Document Content Extraction - βœ… Word document (.docx/.doc) text extraction - βœ… Support for mixed Chinese-English documents - βœ… Preserve original formatting and structure ### 2. Table Extraction - βœ… Automatically identify and extract tables from Word documents - βœ… Convert to structured data format - βœ… Preserve table row/column structure information - βœ… Support complex table parsing ### 3. Image OCR Analysis - βœ… Extract embedded images from Word documents - βœ… High-precision OCR recognition using Tesseract.js v5 - βœ… Support mixed Chinese-English text recognition (95%+ accuracy) - βœ… Intelligent image preprocessing for better recognition - βœ… Support multiple image formats (JPG, PNG, GIF, BMP, WebP) ### 4. Large Document Optimization - βœ… Automatic detection of large documents (>10MB or >100 pages) - βœ… Worker thread parallel processing, utilizing multi-core CPUs - βœ… Chunked processing to avoid memory overflow - βœ… 60%+ speed improvement ### 5. Intelligent Caching System - βœ… File system persistent caching - βœ… Smart cache invalidation based on file modification time - βœ… Cache statistics and management support - βœ… 90%+ speed improvement for repeated document processing ### 6. Full-text Index Search - βœ… Millisecond-level search with inverted index - βœ… Intelligent Chinese-English word segmentation - βœ… Relevance scoring and sorting - βœ… Support document type filtering ## πŸ“¦ Installation and Usage ### 1. Install Dependencies ```bash npm install ``` ### 2. Start Server ```bash # Start full-featured version npm start # Or start basic version (without advanced features) npm run start:basic ``` ### 3. Run Tests ```bash # Run all tests npm test # Run tests in watch mode npm run test:watch # Generate test coverage report npm run test:coverage ``` ### read_word_document Read and analyze Word documents ```json { "name": "read_word_document", "arguments": { "filePath": "path/to/document.docx", "memoryKey": "my-document", "documentType": "api-doc", "extractTables": true, "extractImages": true, "useCache": true, "outputDir": "./output" } } ``` ### search_documents Full-text index search ```json { "name": "search_documents", "arguments": { "query": "search keywords", "documentType": "api-doc", "limit": 10 } } ``` ### get_cache_stats Get cache statistics ```json { "name": "get_cache_stats" } ``` ### clear_cache Clear cache ```json { "name": "clear_cache", "arguments": { "type": "all" // "all", "document", "index" } } ``` ### list_stored_documents List stored documents ```json { "name": "list_stored_documents", "arguments": { "documentType": "api-doc" } } ``` ### get_stored_document Get specific document content ```json { "name": "get_stored_document", "arguments": { "memoryKey": "document-key" } } ``` ### clear_memory Clear memory content ```json { "name": "clear_memory", "arguments": { "memoryKey": "specific-key" // Optional, clear all if not provided } } ``` ## πŸ“ Project Structure ``` word-doc-mcp/ β”œβ”€β”€ server.js # Main server file (with all features) β”œβ”€β”€ server-basic.js # Basic server (compatibility) β”œβ”€β”€ package.json # Project configuration and dependencies β”œβ”€β”€ config.json # Server configuration file β”œβ”€β”€ tests/ # Test directory β”‚ β”œβ”€β”€ setup.js # Test environment setup β”‚ β”œβ”€β”€ unit/ # Unit tests β”‚ β”‚ └── services/ # Service layer tests β”‚ β”œβ”€β”€ integration/ # Integration tests β”‚ β”‚ β”œβ”€β”€ tools/ # Tool tests β”‚ β”‚ └── cache/ # Cache tests β”‚ └── fixtures/ # Test data β”‚ β”œβ”€β”€ documents/ # Test documents β”‚ └── mock-data.js # Mock data β”œβ”€β”€ .cache/ # Cache directory (auto-created) β”œβ”€β”€ output/ # Output directory (auto-created) └── node_modules/ # Dependencies ``` ## βš™οΈ Configuration Edit the `config.json` file to customize server behavior: ```json { "processing": { "maxFileSize": 10485760, "maxPages": 100, "chunkSize": 1048576, "parallelProcessing": true }, "cache": { "enabled": true, "defaultTTL": 3600, "cacheDirectory": "./.cache" }, "ocr": { "enabled": true, "languages": ["chi_sim", "eng"] } } ``` ## πŸ§ͺ Testing ### Test Framework Using Node.js built-in test framework, following these standards: - **Unit Tests**: Test individual components and functions - **Integration Tests**: Test interactions between tools - **End-to-End Tests**: Test complete workflows ### Running Tests ```bash # Run all tests npm test # Run specific test file node --test tests/unit/services/DocumentIndexer.test.js # Run integration tests node --test tests/integration/ # Generate coverage report npm run test:coverage ``` ### Test Coverage - βœ… Functional tests for all MCP tools - βœ… Complete cache system tests - βœ… Error handling and edge cases - βœ… Performance and concurrency tests - βœ… End-to-end workflow tests ## πŸ“Š Performance Metrics - **Large Document Processing**: 60%+ speed improvement (parallel processing) - **Repeated Document Processing**: 90%+ speed improvement (caching) - **OCR Recognition Accuracy**: 95%+ (image preprocessing) - **Memory Usage Optimization**: 40% reduction (streaming processing) - **Search Response Time**: <100ms (full-text index) ## πŸ›‘οΈ Security Considerations - Input file size limits - File type validation - Cache data isolation - Error handling and logging - Automatic temporary file cleanup ## πŸ”„ Version Compatibility ### Backward Compatibility - βœ… Maintain full compatibility with original API - βœ… Existing tool functionality unchanged - βœ… Optional configuration with reasonable defaults - βœ… Provide basic version to ensure compatibility ### System Requirements **Minimum Requirements**: - Node.js 16+ - 4GB RAM - 1GB disk space **Recommended Configuration**: - Node.js 18+ - 8GB+ RAM - Multi-core CPU - SSD storage ## πŸ› Troubleshooting ### Common Issues 1. **Module Installation Failure** ```bash npm cache clean --force npm install ``` 2. **OCR Recognition Failure** - Ensure sufficient memory (8GB+ recommended) - Check supported image formats - Review error logs 3. **Slow Large Document Processing** - Enable parallel processing - Adjust chunkSize configuration - Use SSD storage 4. **Memory Insufficient** ```bash node --max-old-space-size=4096 server.js ``` ## πŸ“ Changelog ### v2.0.0 - βœ… Add table extraction functionality - βœ… Add image OCR analysis - βœ… Implement large document parallel processing - βœ… Add intelligent caching system - βœ… Implement full-text index search - βœ… Complete testing framework ### v1.0.0 - βœ… Basic Word document reading - βœ… Memory storage management - βœ… Simple search functionality ## 🀝 Contributing Issues and Pull Requests are welcome! ### Development Guidelines 1. Fork the project 2. Create feature branch 3. Write test cases 4. Ensure all tests pass 5. Submit Pull Request ## πŸ“„ License MIT License --- **Quick Start**: `npm install && npm start`

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/little2512/word-doc-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server