# Word Document MCP Server - Project Memory
> **Last Updated**: 2025-02-10
> **Project Type**: Node.js MCP Server
> **Main Technology**: Model Context Protocol SDK, Tesseract.js, mammoth, sharp
> **Purpose**: MCP server for reading and analyzing Word documents (.docx/.doc) with OCR, table extraction, and intelligent caching
---
## Quick Reference
- [Development Commands](#development-commands)
- [Project Architecture](#project-architecture)
- [Code Conventions](#code-conventions)
- [Testing Guidelines](#testing-guidelines)
- [MCP Server Implementation](#mcp-server-implementation)
- [Configuration](#configuration)
- [Common Issues](#common-issues)
---
## Development Commands
### Essential Commands
```bash
# Install dependencies
npm install
# Start full-featured server (default)
npm start
# Equivalent: node server.js
# Start basic compatibility server
npm run start:basic
# Equivalent: node server-basic.js
# Run all tests
npm test
# Run tests in watch mode for development
npm run test:watch
# Generate test coverage report
npm run test:coverage
```
### Manual Testing Commands
```bash
# Run specific test file
node --test tests/unit/services/DocumentIndexer.test.js
# Run integration tests only
node --test tests/integration/
# Run basic functionality test
node tests/basic.test.js
# Run custom test suite
node tests/run-tests.js
```
### Memory and Performance Commands
```bash
# Increase Node.js memory for large documents
node --max-old-space-size=4096 server.js
# Clean npm cache if installation fails
npm cache clean --force
npm install
```
---
## Project Architecture
### Core Components
**server.js** (Main entry point)
- Implements MCP server using `@modelcontextprotocol/sdk`
- Exposes 7 tools: `read_word_document`, `search_documents`, `get_cache_stats`, `clear_cache`, `list_stored_documents`, `get_stored_document`, `clear_memory`
- Coordinates between document processing, OCR, caching, and indexing
**Key Classes**:
- `DocumentIndexer` - Full-text search with inverted index, supports Chinese-English word segmentation
- `LargeDocumentProcessor` - Handles files >10MB with parallel worker threads
- `DocumentAnalyzer` - Extracts tables from HTML, performs OCR on images with Tesseract.js
- `CacheManager` - File system-based persistent caching with MD5 keys based on file metadata
**server-basic.js**
- Simplified version for backward compatibility
- Fewer features, easier to maintain for legacy use cases
### Data Flow
```
Word Document (.docx)
↓
File Validation & Cache Check
↓
Mammoth (text extraction) + JSZip (image extraction)
↓
DocumentAnalyzer (tables + OCR with Tesseract.js)
↓
DocumentIndexer (full-text indexing)
↓
CacheManager (persist to .cache/)
↓
Memory Storage (NodeCache in-memory)
```
### Directory Structure
```
word-doc-mcp/
├── server.js # Main MCP server (all features)
├── server-basic.js # Basic server (compatibility)
├── package.json # Dependencies: mcp-sdk, mammoth, tesseract.js, sharp
├── config.json # Server configuration (processing, cache, OCR settings)
├── tests/
│ ├── setup.js # Test environment configuration
│ ├── basic.test.js # Basic functionality tests
│ ├── run-tests.js # Custom test runner
│ ├── unit/ # Unit tests for individual classes
│ ├── integration/ # Integration tests for tools
│ └── fixtures/ # Test documents and mock data
├── .cache/ # Auto-generated cache directory
└── output/ # Auto-generated output for extracted images
```
---
## Code Conventions
### Language and Module System
- **Language**: JavaScript (ES modules)
- **Module type**: `"type": "module"` in package.json
- **Use ES imports**: `import { Server } from "@modelcontextprotocol/sdk/server/index.js";`
- **Node.js version**: 16+ minimum, 18+ recommended
### Naming Conventions
- **Files**: kebab-case (`server.js`, `document-analyzer.js`)
- **Classes**: PascalCase (`class DocumentIndexer`)
- **Variables/Functions**: camelCase (`documentCache`, `extractTables`)
- **Constants**: UPPER_SNAKE_CASE for true constants (`MAX_FILE_SIZE`)
- **Private methods**: Prefix with underscore if needed (though JS uses # for true privacy)
### Error Handling
- Always wrap async operations in try-catch blocks
- Use descriptive error messages with context
- Log errors to console.error for debugging
- Return structured error responses in MCP tools:
```javascript
return {
content: [{ type: "text", text: `错误: ${error.message}` }],
isError: true
};
```
### Async/Await Patterns
- Prefer async/await over Promise chains
- Use Promise.all() for parallel operations (chunk processing, table extraction)
- Always handle promise rejections
- Clean up resources in finally blocks where appropriate (OCR worker termination)
### Code Comments
- Use Chinese comments for user-facing messages and console output
- Use English comments for code documentation and technical explanations
- Comment complex logic (Chinese word segmentation, OCR preprocessing)
- Document tool schemas with clear descriptions
---
## Testing Guidelines
### Test Framework
- **Framework**: Node.js built-in test runner (`node --test`)
- **Coverage**: Use `--experimental-test-coverage` flag
- **Organization**:
- Unit tests in `tests/unit/` - test individual classes and functions
- Integration tests in `tests/integration/` - test tool interactions
- Fixtures in `tests/fixtures/` - test documents and mock data
### Test Categories
**Unit Tests**:
- Test `DocumentIndexer` word extraction and search algorithms
- Test `CacheManager` get/set/clear operations
- Test `LargeDocumentProcessor` chunking logic
- Test `DocumentAnalyzer` table parsing and OCR
**Integration Tests**:
- Test complete `read_word_document` workflow
- Test cache invalidation on file modification
- Test search across multiple documents
- Test memory storage and retrieval
**Edge Cases to Cover**:
- Non-existent file paths
- Unsupported file formats (reject .pdf, .txt)
- Empty documents
- Documents with no tables/images
- Very large documents (>10MB)
- Corrupted or invalid .docx files
- OCR failures (invalid images, unsupported formats)
### Running Tests
```bash
# Quick development loop
npm run test:watch
# Full test suite with coverage
npm run test:coverage
# Target specific test category
node --test tests/unit/
node --test tests/integration/
```
### Test Data
- Store test documents in `tests/fixtures/documents/`
- Use various document types: simple text, tables, images, mixed content
- Include Chinese and English content for OCR testing
- Mock complex objects in `tests/fixtures/mock-data.js`
---
## MCP Server Implementation
### Tool Design Patterns
**Tool Schema**:
Each tool must define:
- `name`: snake_case tool identifier
- `description`: Clear explanation of functionality
- `inputSchema`: JSON Schema with type, properties, required fields
- Use `enum` for constrained choices (document types, cache types)
- Provide sensible `default` values
**Error Handling in Tools**:
- Validate inputs early (file existence, format checking)
- Return structured error responses with `isError: true`
- Include actionable error messages
**Response Format**:
```javascript
return {
content: [{
type: "text",
text: "Human-readable result with relevant data"
}]
};
```
### MCP Tools Reference
**read_word_document** (Primary tool)
- Required: `filePath`
- Optional: `memoryKey`, `documentType`, `extractTables`, `extractImages`, `useCache`, `outputDir`
- Extracts text, tables, images with OCR
- Caches results based on file mtime and size
- Updates full-text index
**search_documents**
- Required: `query`
- Optional: `documentType`, `limit`
- Uses inverted index for sub-100ms search
- Supports Chinese-English mixed queries
- Returns relevance-scored results
**Cache Management Tools**:
- `get_cache_stats` - Display cache and index statistics
- `clear_cache` - Clear document cache, index, or both
- `list_stored_documents` - List all in-memory documents
- `get_stored_document` - Retrieve full document by memory key
- `clear_memory` - Clear specific or all memory
### Server Lifecycle
**Initialization**:
- Import MCP SDK components
- Create Server instance with name and version
- Define capabilities (tools, resources)
- Register request handlers (ListTools, CallTool)
**Graceful Shutdown**:
- Listen for SIGINT and SIGTERM
- Clean up OCR worker (terminate Tesseract.js)
- Exit cleanly with appropriate status code
**Export for Testing**:
- Export classes for unit testing: `export { DocumentIndexer, CacheManager, ... };`
---
## Configuration
### config.json Structure
**processing** section:
```json
{
"maxFileSize": 10485760, // 10MB threshold for large document handling
"maxPages": 100, // Page limit
"chunkSize": 1048576, // 1MB chunks for parallel processing
"parallelProcessing": true // Enable worker threads
}
```
**cache** section:
```json
{
"enabled": true,
"defaultTTL": 3600, // 1 hour cache lifetime
"checkPeriod": 600, // Check for expired cache every 10 min
"maxCacheSize": 100, // Maximum cache entries
"cacheDirectory": "./.cache"
}
```
**ocr** section:
```json
{
"enabled": true,
"languages": ["chi_sim", "eng"], // Chinese Simplified + English
"imageProcessing": {
"resizeWidth": 2000, // Resize for better OCR
"sharpen": true, // Enhance text edges
"normalize": true // Improve contrast
}
}
```
### Configuration Best Practices
- Keep cache enabled in production for 90%+ performance improvement
- Adjust `maxFileSize` based on available memory
- Use SSD storage for faster large document processing
- Tune `resizeWidth` for OCR accuracy vs. speed tradeoff
---
## Common Issues
### Module Installation Failure
**Problem**: npm install fails with dependency errors
**Solution**:
```bash
npm cache clean --force
npm install
```
If persisting, delete `node_modules` and `package-lock.json`:
```bash
rm -rf node_modules package-lock.json
npm install
```
### OCR Recognition Failure
**Problem**: Tesseract.js fails to recognize text or crashes
**Solutions**:
- Ensure 8GB+ RAM available
- Check image format is supported (JPG, PNG, GIF, BMP, WebP)
- Review console.error logs for specific failure
- Reduce `resizeWidth` in config if memory constrained
- Disable OCR if not needed: set `extractImages: false`
### Slow Large Document Processing
**Problem**: Documents >10MB process very slowly
**Solutions**:
- Verify `parallelProcessing: true` in config.json
- Increase `chunkSize` for fewer, larger chunks
- Use SSD storage instead of HDD
- Increase Node.js memory: `node --max-old-space-size=4096 server.js`
### Out of Memory Errors
**Problem**: Process crashes with heap out of memory
**Solutions**:
```bash
# Increase Node.js heap size
node --max-old-space-size=4096 server.js
# Or in npm scripts
node --max-old-space-size=8192 server.js
```
### Cache Not Invalidating
**Problem**: Modified documents return cached results
**Solution**: Cache key includes file mtime and size, so modifications automatically invalidate. If issues persist:
```bash
# Manually clear cache
# Use the clear_cache tool with type: "all"
# Or delete .cache directory:
rm -rf .cache
```
### Worker Thread Errors
**Problem**: Parallel processing fails with worker errors
**Solutions**:
- Disable parallel processing: `parallelProcessing: false` in config
- Ensure Node.js v16+ (worker_threads stability)
- Reduce `chunkSize` to process smaller chunks
### File Format Rejection
**Problem**: Error "不支持的文件格式" (Unsupported file format)
**Solutions**:
- Ensure file is .docx or .doc (not .pdf, .txt, .rtf)
- Check file extension is lowercase (.docx not .DOCX)
- Verify file is not corrupted by opening in Word/LibreOffice
---
## Performance Optimization
### Known Performance Metrics
- **Large document processing**: 60%+ faster with parallel worker threads
- **Repeated document processing**: 90%+ faster with caching enabled
- **OCR accuracy**: 95%+ with image preprocessing (sharpen, normalize)
- **Search response time**: <100ms with inverted index
- **Memory usage**: 40% reduction with streaming processing
### Optimization Tips
1. **Enable caching** for production use
2. **Use SSD storage** for large documents
3. **Tune chunk size** based on document characteristics
4. **Limit OCR to needed images only** (`extractImages: false` for text-only docs)
5. **Adjust image resize width** - smaller is faster but less accurate
---
## Dependencies Overview
### Core Dependencies
- `@modelcontextprotocol/sdk` (^1.0.0) - MCP server implementation
- `mammoth` (^1.7.2) - Word document text extraction
- `jszip` (^3.10.1) - Extract images from .docx (which is a ZIP file)
- `tesseract.js` (^5.0.1) - OCR text recognition
- `sharp` (^0.33.2) - Image preprocessing
- `node-cache` (^5.1.2) - In-memory caching
- `fs-extra` (^11.2.0) - Enhanced file system operations
### Dev Dependencies
- `tempy` (^3.1.0) - Temporary file/directory generation for tests
### Dependency Management
- Use caret ranges (^) for minor/patch updates
- Pin major versions to avoid breaking changes
- Review security advisories regularly
- Test thoroughly after dependency updates
---
## Development Workflow
### Making Changes
1. **Modify code** in server.js or related files
2. **Run tests**: `npm test`
3. **Test manually**: `npm start` and connect with MCP client
4. **Check coverage**: `npm run test:coverage`
5. **Update documentation** if adding features or changing behavior
### Adding New MCP Tools
1. Add tool definition to `ListToolsRequestSchema` handler
2. Add case to `CallToolRequestSchema` handler
3. Implement tool logic with error handling
4. Add integration tests in `tests/integration/tools/`
5. Update README.md with tool documentation
### Debugging Tips
- Use `console.error()` for debug output (MCP uses stderr for logging)
- Inspect .cache directory to verify cache behavior
- Test with simple documents before complex ones
- Enable Node.js debug flags: `node --inspect server.js`
---
## Git Workflow
### Branches
- `main` - Stable production code
- `wdm-dev` - Development branch (current)
### Commit Guidelines
- Write clear, descriptive commit messages
- Reference issues in commit messages when applicable
- Keep commits atomic (one logical change per commit)
### Repository Status
- Recent commits show project initialization and documentation updates
- Untracked files present in `.serena/` directory
---
## Security Considerations
### File Handling
- Validate file paths to prevent directory traversal
- Limit file sizes to prevent memory exhaustion
- Check file extensions to reject non-Word formats
- Clean up temporary files after processing
### Cache Security
- Cache files stored in .cache directory (not web-accessible)
- Cache keys based on file metadata (not content)
- Automatic cache expiration prevents stale data
### OCR Security
- Process images in memory (no persistent storage)
- Limit image dimensions to prevent DoS
- Validate image formats before processing
---
## Future Enhancement Opportunities
### Potential Improvements
- Add PDF document support
- Implement more sophisticated Chinese word segmentation (jieba-js)
- Add document summarization tool
- Support for .doc format (legacy Word)
- Add document comparison/diff tool
- Implement caching across server restarts (persistent index)
- Add web-based cache viewer
### Scalability Considerations
- Consider moving to .claude/rules/ structure if adding more tools
- Split large classes (DocumentAnalyzer) into focused modules
- Add TypeScript for better type safety
- Implement proper logging framework (winston, pino)
---
## Notes for Claude
### When Working on This Project
**DO**:
- Use Chinese for user-facing messages and console output
- Maintain backward compatibility with existing tools
- Test with both .docx and basic text documents
- Clean up temporary files in output/ after testing
- Run full test suite before committing changes
**DON'T**:
- Break existing API contracts without version bump
- Remove or modify tool schemas without updating tests
- Commit test documents with sensitive data
- Ignore console.error messages (they indicate real issues)
- Use blocking operations in async functions
### Project-Specific Patterns
- Mixed Chinese-English codebase - preserve this style
- Cache-first approach for performance
- Worker threads for CPU-intensive operations
- Graceful degradation (OCR failures don't crash document processing)
### Key File Locations
- Main server logic: `server.js` lines 1-1000
- Tool schemas: `server.js` lines 470-608
- Tool implementations: `server.js` lines 611-973
- Configuration: `config.json`
- Tests: `tests/` directory
---
**Remember**: This project emphasizes performance (caching, parallel processing) and comprehensive document analysis (text, tables, OCR). Maintain these priorities when making changes.