Word Document Reader MCP Server

CLAUDE.md•16.8 KiB

# Word Document MCP Server - Project Memory > **Last Updated**: 2025-02-10 > **Project Type**: Node.js MCP Server > **Main Technology**: Model Context Protocol SDK, Tesseract.js, mammoth, sharp > **Purpose**: MCP server for reading and analyzing Word documents (.docx/.doc) with OCR, table extraction, and intelligent caching --- ## Quick Reference - [Development Commands](#development-commands) - [Project Architecture](#project-architecture) - [Code Conventions](#code-conventions) - [Testing Guidelines](#testing-guidelines) - [MCP Server Implementation](#mcp-server-implementation) - [Configuration](#configuration) - [Common Issues](#common-issues) --- ## Development Commands ### Essential Commands ```bash # Install dependencies npm install # Start full-featured server (default) npm start # Equivalent: node server.js # Start basic compatibility server npm run start:basic # Equivalent: node server-basic.js # Run all tests npm test # Run tests in watch mode for development npm run test:watch # Generate test coverage report npm run test:coverage ``` ### Manual Testing Commands ```bash # Run specific test file node --test tests/unit/services/DocumentIndexer.test.js # Run integration tests only node --test tests/integration/ # Run basic functionality test node tests/basic.test.js # Run custom test suite node tests/run-tests.js ``` ### Memory and Performance Commands ```bash # Increase Node.js memory for large documents node --max-old-space-size=4096 server.js # Clean npm cache if installation fails npm cache clean --force npm install ``` --- ## Project Architecture ### Core Components **server.js** (Main entry point) - Implements MCP server using `@modelcontextprotocol/sdk` - Exposes 7 tools: `read_word_document`, `search_documents`, `get_cache_stats`, `clear_cache`, `list_stored_documents`, `get_stored_document`, `clear_memory` - Coordinates between document processing, OCR, caching, and indexing **Key Classes**: - `DocumentIndexer` - Full-text search with inverted index, supports Chinese-English word segmentation - `LargeDocumentProcessor` - Handles files >10MB with parallel worker threads - `DocumentAnalyzer` - Extracts tables from HTML, performs OCR on images with Tesseract.js - `CacheManager` - File system-based persistent caching with MD5 keys based on file metadata **server-basic.js** - Simplified version for backward compatibility - Fewer features, easier to maintain for legacy use cases ### Data Flow ``` Word Document (.docx) ↓ File Validation & Cache Check ↓ Mammoth (text extraction) + JSZip (image extraction) ↓ DocumentAnalyzer (tables + OCR with Tesseract.js) ↓ DocumentIndexer (full-text indexing) ↓ CacheManager (persist to .cache/) ↓ Memory Storage (NodeCache in-memory) ``` ### Directory Structure ``` word-doc-mcp/ ├── server.js # Main MCP server (all features) ├── server-basic.js # Basic server (compatibility) ├── package.json # Dependencies: mcp-sdk, mammoth, tesseract.js, sharp ├── config.json # Server configuration (processing, cache, OCR settings) ├── tests/ │ ├── setup.js # Test environment configuration │ ├── basic.test.js # Basic functionality tests │ ├── run-tests.js # Custom test runner │ ├── unit/ # Unit tests for individual classes │ ├── integration/ # Integration tests for tools │ └── fixtures/ # Test documents and mock data ├── .cache/ # Auto-generated cache directory └── output/ # Auto-generated output for extracted images ``` --- ## Code Conventions ### Language and Module System - **Language**: JavaScript (ES modules) - **Module type**: `"type": "module"` in package.json - **Use ES imports**: `import { Server } from "@modelcontextprotocol/sdk/server/index.js";` - **Node.js version**: 16+ minimum, 18+ recommended ### Naming Conventions - **Files**: kebab-case (`server.js`, `document-analyzer.js`) - **Classes**: PascalCase (`class DocumentIndexer`) - **Variables/Functions**: camelCase (`documentCache`, `extractTables`) - **Constants**: UPPER_SNAKE_CASE for true constants (`MAX_FILE_SIZE`) - **Private methods**: Prefix with underscore if needed (though JS uses # for true privacy) ### Error Handling - Always wrap async operations in try-catch blocks - Use descriptive error messages with context - Log errors to console.error for debugging - Return structured error responses in MCP tools: ```javascript return { content: [{ type: "text", text: `错误: ${error.message}` }], isError: true }; ``` ### Async/Await Patterns - Prefer async/await over Promise chains - Use Promise.all() for parallel operations (chunk processing, table extraction) - Always handle promise rejections - Clean up resources in finally blocks where appropriate (OCR worker termination) ### Code Comments - Use Chinese comments for user-facing messages and console output - Use English comments for code documentation and technical explanations - Comment complex logic (Chinese word segmentation, OCR preprocessing) - Document tool schemas with clear descriptions --- ## Testing Guidelines ### Test Framework - **Framework**: Node.js built-in test runner (`node --test`) - **Coverage**: Use `--experimental-test-coverage` flag - **Organization**: - Unit tests in `tests/unit/` - test individual classes and functions - Integration tests in `tests/integration/` - test tool interactions - Fixtures in `tests/fixtures/` - test documents and mock data ### Test Categories **Unit Tests**: - Test `DocumentIndexer` word extraction and search algorithms - Test `CacheManager` get/set/clear operations - Test `LargeDocumentProcessor` chunking logic - Test `DocumentAnalyzer` table parsing and OCR **Integration Tests**: - Test complete `read_word_document` workflow - Test cache invalidation on file modification - Test search across multiple documents - Test memory storage and retrieval **Edge Cases to Cover**: - Non-existent file paths - Unsupported file formats (reject .pdf, .txt) - Empty documents - Documents with no tables/images - Very large documents (>10MB) - Corrupted or invalid .docx files - OCR failures (invalid images, unsupported formats) ### Running Tests ```bash # Quick development loop npm run test:watch # Full test suite with coverage npm run test:coverage # Target specific test category node --test tests/unit/ node --test tests/integration/ ``` ### Test Data - Store test documents in `tests/fixtures/documents/` - Use various document types: simple text, tables, images, mixed content - Include Chinese and English content for OCR testing - Mock complex objects in `tests/fixtures/mock-data.js` --- ## MCP Server Implementation ### Tool Design Patterns **Tool Schema**: Each tool must define: - `name`: snake_case tool identifier - `description`: Clear explanation of functionality - `inputSchema`: JSON Schema with type, properties, required fields - Use `enum` for constrained choices (document types, cache types) - Provide sensible `default` values **Error Handling in Tools**: - Validate inputs early (file existence, format checking) - Return structured error responses with `isError: true` - Include actionable error messages **Response Format**: ```javascript return { content: [{ type: "text", text: "Human-readable result with relevant data" }] }; ``` ### MCP Tools Reference **read_word_document** (Primary tool) - Required: `filePath` - Optional: `memoryKey`, `documentType`, `extractTables`, `extractImages`, `useCache`, `outputDir` - Extracts text, tables, images with OCR - Caches results based on file mtime and size - Updates full-text index **search_documents** - Required: `query` - Optional: `documentType`, `limit` - Uses inverted index for sub-100ms search - Supports Chinese-English mixed queries - Returns relevance-scored results **Cache Management Tools**: - `get_cache_stats` - Display cache and index statistics - `clear_cache` - Clear document cache, index, or both - `list_stored_documents` - List all in-memory documents - `get_stored_document` - Retrieve full document by memory key - `clear_memory` - Clear specific or all memory ### Server Lifecycle **Initialization**: - Import MCP SDK components - Create Server instance with name and version - Define capabilities (tools, resources) - Register request handlers (ListTools, CallTool) **Graceful Shutdown**: - Listen for SIGINT and SIGTERM - Clean up OCR worker (terminate Tesseract.js) - Exit cleanly with appropriate status code **Export for Testing**: - Export classes for unit testing: `export { DocumentIndexer, CacheManager, ... };` --- ## Configuration ### config.json Structure **processing** section: ```json { "maxFileSize": 10485760, // 10MB threshold for large document handling "maxPages": 100, // Page limit "chunkSize": 1048576, // 1MB chunks for parallel processing "parallelProcessing": true // Enable worker threads } ``` **cache** section: ```json { "enabled": true, "defaultTTL": 3600, // 1 hour cache lifetime "checkPeriod": 600, // Check for expired cache every 10 min "maxCacheSize": 100, // Maximum cache entries "cacheDirectory": "./.cache" } ``` **ocr** section: ```json { "enabled": true, "languages": ["chi_sim", "eng"], // Chinese Simplified + English "imageProcessing": { "resizeWidth": 2000, // Resize for better OCR "sharpen": true, // Enhance text edges "normalize": true // Improve contrast } } ``` ### Configuration Best Practices - Keep cache enabled in production for 90%+ performance improvement - Adjust `maxFileSize` based on available memory - Use SSD storage for faster large document processing - Tune `resizeWidth` for OCR accuracy vs. speed tradeoff --- ## Common Issues ### Module Installation Failure **Problem**: npm install fails with dependency errors **Solution**: ```bash npm cache clean --force npm install ``` If persisting, delete `node_modules` and `package-lock.json`: ```bash rm -rf node_modules package-lock.json npm install ``` ### OCR Recognition Failure **Problem**: Tesseract.js fails to recognize text or crashes **Solutions**: - Ensure 8GB+ RAM available - Check image format is supported (JPG, PNG, GIF, BMP, WebP) - Review console.error logs for specific failure - Reduce `resizeWidth` in config if memory constrained - Disable OCR if not needed: set `extractImages: false` ### Slow Large Document Processing **Problem**: Documents >10MB process very slowly **Solutions**: - Verify `parallelProcessing: true` in config.json - Increase `chunkSize` for fewer, larger chunks - Use SSD storage instead of HDD - Increase Node.js memory: `node --max-old-space-size=4096 server.js` ### Out of Memory Errors **Problem**: Process crashes with heap out of memory **Solutions**: ```bash # Increase Node.js heap size node --max-old-space-size=4096 server.js # Or in npm scripts node --max-old-space-size=8192 server.js ``` ### Cache Not Invalidating **Problem**: Modified documents return cached results **Solution**: Cache key includes file mtime and size, so modifications automatically invalidate. If issues persist: ```bash # Manually clear cache # Use the clear_cache tool with type: "all" # Or delete .cache directory: rm -rf .cache ``` ### Worker Thread Errors **Problem**: Parallel processing fails with worker errors **Solutions**: - Disable parallel processing: `parallelProcessing: false` in config - Ensure Node.js v16+ (worker_threads stability) - Reduce `chunkSize` to process smaller chunks ### File Format Rejection **Problem**: Error "不支持的文件格式" (Unsupported file format) **Solutions**: - Ensure file is .docx or .doc (not .pdf, .txt, .rtf) - Check file extension is lowercase (.docx not .DOCX) - Verify file is not corrupted by opening in Word/LibreOffice --- ## Performance Optimization ### Known Performance Metrics - **Large document processing**: 60%+ faster with parallel worker threads - **Repeated document processing**: 90%+ faster with caching enabled - **OCR accuracy**: 95%+ with image preprocessing (sharpen, normalize) - **Search response time**: <100ms with inverted index - **Memory usage**: 40% reduction with streaming processing ### Optimization Tips 1. **Enable caching** for production use 2. **Use SSD storage** for large documents 3. **Tune chunk size** based on document characteristics 4. **Limit OCR to needed images only** (`extractImages: false` for text-only docs) 5. **Adjust image resize width** - smaller is faster but less accurate --- ## Dependencies Overview ### Core Dependencies - `@modelcontextprotocol/sdk` (^1.0.0) - MCP server implementation - `mammoth` (^1.7.2) - Word document text extraction - `jszip` (^3.10.1) - Extract images from .docx (which is a ZIP file) - `tesseract.js` (^5.0.1) - OCR text recognition - `sharp` (^0.33.2) - Image preprocessing - `node-cache` (^5.1.2) - In-memory caching - `fs-extra` (^11.2.0) - Enhanced file system operations ### Dev Dependencies - `tempy` (^3.1.0) - Temporary file/directory generation for tests ### Dependency Management - Use caret ranges (^) for minor/patch updates - Pin major versions to avoid breaking changes - Review security advisories regularly - Test thoroughly after dependency updates --- ## Development Workflow ### Making Changes 1. **Modify code** in server.js or related files 2. **Run tests**: `npm test` 3. **Test manually**: `npm start` and connect with MCP client 4. **Check coverage**: `npm run test:coverage` 5. **Update documentation** if adding features or changing behavior ### Adding New MCP Tools 1. Add tool definition to `ListToolsRequestSchema` handler 2. Add case to `CallToolRequestSchema` handler 3. Implement tool logic with error handling 4. Add integration tests in `tests/integration/tools/` 5. Update README.md with tool documentation ### Debugging Tips - Use `console.error()` for debug output (MCP uses stderr for logging) - Inspect .cache directory to verify cache behavior - Test with simple documents before complex ones - Enable Node.js debug flags: `node --inspect server.js` --- ## Git Workflow ### Branches - `main` - Stable production code - `wdm-dev` - Development branch (current) ### Commit Guidelines - Write clear, descriptive commit messages - Reference issues in commit messages when applicable - Keep commits atomic (one logical change per commit) ### Repository Status - Recent commits show project initialization and documentation updates - Untracked files present in `.serena/` directory --- ## Security Considerations ### File Handling - Validate file paths to prevent directory traversal - Limit file sizes to prevent memory exhaustion - Check file extensions to reject non-Word formats - Clean up temporary files after processing ### Cache Security - Cache files stored in .cache directory (not web-accessible) - Cache keys based on file metadata (not content) - Automatic cache expiration prevents stale data ### OCR Security - Process images in memory (no persistent storage) - Limit image dimensions to prevent DoS - Validate image formats before processing --- ## Future Enhancement Opportunities ### Potential Improvements - Add PDF document support - Implement more sophisticated Chinese word segmentation (jieba-js) - Add document summarization tool - Support for .doc format (legacy Word) - Add document comparison/diff tool - Implement caching across server restarts (persistent index) - Add web-based cache viewer ### Scalability Considerations - Consider moving to .claude/rules/ structure if adding more tools - Split large classes (DocumentAnalyzer) into focused modules - Add TypeScript for better type safety - Implement proper logging framework (winston, pino) --- ## Notes for Claude ### When Working on This Project **DO**: - Use Chinese for user-facing messages and console output - Maintain backward compatibility with existing tools - Test with both .docx and basic text documents - Clean up temporary files in output/ after testing - Run full test suite before committing changes **DON'T**: - Break existing API contracts without version bump - Remove or modify tool schemas without updating tests - Commit test documents with sensitive data - Ignore console.error messages (they indicate real issues) - Use blocking operations in async functions ### Project-Specific Patterns - Mixed Chinese-English codebase - preserve this style - Cache-first approach for performance - Worker threads for CPU-intensive operations - Graceful degradation (OCR failures don't crash document processing) ### Key File Locations - Main server logic: `server.js` lines 1-1000 - Tool schemas: `server.js` lines 470-608 - Tool implementations: `server.js` lines 611-973 - Configuration: `config.json` - Tests: `tests/` directory --- **Remember**: This project emphasizes performance (caching, parallel processing) and comprehensive document analysis (text, tables, OCR). Maintain these priorities when making changes.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/little2512/word-doc-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CLAUDE.md•16.8 KiB