PROJECT_SUMMARY.md•8.03 kB
# Project Summary
## redact_mcp - PDF Redaction MCP Server
### Overview
A production-ready Model Context Protocol (MCP) server built with FastMCP 2 and PyMuPDF that provides comprehensive PDF redaction capabilities for use with AI assistants like Claude.
### Key Features
- ✅ Load and read PDF files with full text extraction
- ✅ **Batch text redaction** (search and redact multiple strings at once for performance)
- ✅ **Redaction tracking** (prevents duplicate work, tracks what's been redacted)
- ✅ Area-based redaction (redact rectangular regions by coordinates)
- ✅ Customizable redaction appearance (RGB color selection)
- ✅ **List applied redactions** (audit trail and progress tracking)
- ✅ Automatic filename generation for redacted PDFs
- ✅ Memory management (close PDFs to free resources)
- ✅ Comprehensive error handling with MCP ToolError
- ✅ Context logging for transparency
- ✅ Multiple transport options (stdio, HTTP)
### Architecture
#### Technology Stack
- **FastMCP 2.12+**: MCP server framework
- **PyMuPDF 1.24+**: PDF manipulation library
- **Python 3.13+**: Programming language
- **uv**: Package management
#### Project Structure
```
redact_mcp/
├── src/
│ └── redact_mcp/
│ ├── __init__.py # Package exports
│ └── server.py # Main MCP server with 6 tools
├── examples/
│ ├── create_test_pdf.py # Generate test PDFs
│ ├── test_integration.py # Integration test suite
│ ├── usage_example.py # HTTP client example
│ ├── test_document.pdf # Sample PDF
│ └── test_document_redacted.pdf # Sample output
├── pyproject.toml # Package configuration
├── README.md # Full documentation
├── QUICKSTART.md # Quick start guide
└── test_server.py # Simple server test
```
#### Available Tools
1. **load_pdf** - Load a PDF and extract text
2. **redact_text** - Redact multiple text strings at once (batch mode)
3. **redact_area** - Redact rectangular areas by coordinates
4. **save_redacted_pdf** - Apply redactions and save
5. **list_loaded_pdfs** - List currently loaded PDFs
6. **list_applied_redactions** - Show what has been redacted (new!)
7. **close_pdf** - Close a PDF and free memory
### Implementation Details
#### Design Decisions
1. **In-Memory Storage**: PDFs are kept in memory during the session for fast access. Trade-off: memory usage vs. speed.
2. **Redaction Tracking**: The server maintains a list of all texts that have been marked for redaction for each PDF. This prevents duplicate work and allows progress monitoring.
3. **Batch Processing**: The `redact_text` tool accepts a list of texts instead of a single text, allowing multiple redactions in one call for better performance.
4. **Lazy Redaction**: Redaction annotations are added but not applied until save_redacted_pdf is called. This allows multiple redactions before committing.
5. **Automatic Naming**: By default, redacted PDFs are saved with "_redacted" suffix to prevent accidental overwrites.
6. **Path Resolution**: All paths are resolved to absolute paths to avoid ambiguity.
7. **Error Handling**: Uses FastMCP's ToolError for proper MCP error propagation, with descriptive messages.
8. **Context Logging**: All operations log to MCP context for transparency to the user.
#### Security Considerations
- PDFs are only accessible from the local filesystem
- No network access or remote PDF loading
- Redactions are permanently applied when saved
- No temporary files created during operation
#### Testing
- ✅ Unit-level testing via integration test
- ✅ End-to-end workflow testing
- ✅ Error handling verification
- ✅ Sample PDFs with sensitive data patterns
### Usage Patterns
#### Basic Workflow
```
1. load_pdf(path) → View content
2. redact_text(path, [sensitive_string1, sensitive_string2, ...]) → Mark multiple texts for redaction
3. list_applied_redactions(path) → Check what's been redacted (optional)
4. redact_text(path, [more_strings]) → Add more redactions (duplicates automatically skipped)
5. save_redacted_pdf(path) → Apply and save
6. close_pdf(path) → Clean up (also clears redaction tracking)
```
#### Efficient Batch Workflow
```
1. load_pdf(path) → View content
2. Identify all sensitive texts → Make a list
3. redact_text(path, [text1, text2, text3, ..., textN]) → One call for all redactions
4. save_redacted_pdf(path) → Apply and save
5. close_pdf(path) → Clean up
```
#### Integration Points
**Claude Desktop**:
```json
{
"mcpServers": {
"pdf-redaction": {
"command": "uv",
"args": ["--directory", "/path/to/redact_mcp", "run", "fastmcp", "run", "redact_mcp.server:mcp"]
}
}
}
```
**HTTP Client**:
```python
from fastmcp import Client
client = Client("http://localhost:8000/mcp")
```
### Performance Characteristics
- **Load time**: O(n) where n = number of pages
- **Text redaction (batch)**: O(n*m*t) where n = pages, m = text instances per page, t = number of texts to redact
- **Text redaction (single)**: O(n*m) where n = pages, m = text instances
- **Redaction tracking**: O(1) per text (hash-based lookup)
- **Area redaction**: O(1) per area
- **Memory usage**: Proportional to PDF size (kept in memory) + redaction list size
- **Save time**: O(n) where n = pages with redactions
**Performance Note**: Batch redaction (passing multiple texts in one call) is significantly faster than multiple individual calls due to reduced overhead.
### Limitations (Current Version)
1. **No image redaction**: Only text redaction is implemented
2. **Non-persistent storage**: PDFs must be reloaded after server restart
3. **Single session**: No multi-user support
4. **No OCR**: Can't redact text in images/scanned documents
5. **No regex patterns**: Only exact string matching
### Future Enhancements (Potential)
- [ ] Image redaction support
- [ ] Regular expression pattern matching
- [ ] OCR integration for scanned documents
- [ ] Persistent storage layer for redaction history
- [ ] Batch processing of multiple PDFs in one call
- [ ] Redaction templates/profiles
- [ ] Audit logging with timestamps
- [ ] Preview before applying redactions
- [ ] Undo/redo redaction operations
- [ ] Export redaction report (what was redacted where)
### Development Status
**Version**: 0.1.0
**Status**: Production-ready for text redaction
**Last Updated**: October 2025
**Python Version**: 3.13+
### Dependencies
**Runtime**:
- fastmcp >= 2.12.0
- PyMuPDF >= 1.24.0
**Development**:
- uv (package manager)
### Testing
Run the integration test:
```bash
uv run python examples/test_integration.py
```
Expected output:
- ✓ All 8 test steps pass
- ✓ Redacted PDF created with redactions applied
- ✓ File size verification
### Documentation
- **README.md**: Complete usage guide and API reference
- **QUICKSTART.md**: Quick start for new users
- **examples/**: Working code examples
- **Inline docstrings**: Comprehensive function documentation
### Standards Compliance
- ✅ MCP Protocol 2025-06-18 specification
- ✅ FastMCP 2 best practices
- ✅ Python type hints throughout
- ✅ PEP 8 style guide
- ✅ Comprehensive error handling
### Deployment Options
1. **Local (stdio)**: Direct integration with MCP clients
2. **HTTP**: Remote access via network
3. **FastMCP Cloud**: Hosted deployment (recommended)
4. **Self-hosted**: Docker/container deployment
### Contributing
Contributions welcome! Focus areas:
- Image redaction support
- Regex pattern matching
- OCR integration
- Performance optimizations
### Support
- Issues: GitHub Issues
- Documentation: README.md, QUICKSTART.md
- Examples: examples/ directory
### Acknowledgments
Built with:
- [FastMCP](https://gofastmcp.com/) - MCP server framework
- [PyMuPDF](https://pymupdf.readthedocs.io/) - PDF manipulation
- [Model Context Protocol](https://modelcontextprotocol.io/) - Protocol specification