PDF Redaction MCP Server

PROJECT_SUMMARY.md•8.03 kB

# Project Summary ## redact_mcp - PDF Redaction MCP Server ### Overview A production-ready Model Context Protocol (MCP) server built with FastMCP 2 and PyMuPDF that provides comprehensive PDF redaction capabilities for use with AI assistants like Claude. ### Key Features - ✅ Load and read PDF files with full text extraction - ✅ **Batch text redaction** (search and redact multiple strings at once for performance) - ✅ **Redaction tracking** (prevents duplicate work, tracks what's been redacted) - ✅ Area-based redaction (redact rectangular regions by coordinates) - ✅ Customizable redaction appearance (RGB color selection) - ✅ **List applied redactions** (audit trail and progress tracking) - ✅ Automatic filename generation for redacted PDFs - ✅ Memory management (close PDFs to free resources) - ✅ Comprehensive error handling with MCP ToolError - ✅ Context logging for transparency - ✅ Multiple transport options (stdio, HTTP) ### Architecture #### Technology Stack - **FastMCP 2.12+**: MCP server framework - **PyMuPDF 1.24+**: PDF manipulation library - **Python 3.13+**: Programming language - **uv**: Package management #### Project Structure ``` redact_mcp/ ├── src/ │ └── redact_mcp/ │ ├── __init__.py # Package exports │ └── server.py # Main MCP server with 6 tools ├── examples/ │ ├── create_test_pdf.py # Generate test PDFs │ ├── test_integration.py # Integration test suite │ ├── usage_example.py # HTTP client example │ ├── test_document.pdf # Sample PDF │ └── test_document_redacted.pdf # Sample output ├── pyproject.toml # Package configuration ├── README.md # Full documentation ├── QUICKSTART.md # Quick start guide └── test_server.py # Simple server test ``` #### Available Tools 1. **load_pdf** - Load a PDF and extract text 2. **redact_text** - Redact multiple text strings at once (batch mode) 3. **redact_area** - Redact rectangular areas by coordinates 4. **save_redacted_pdf** - Apply redactions and save 5. **list_loaded_pdfs** - List currently loaded PDFs 6. **list_applied_redactions** - Show what has been redacted (new!) 7. **close_pdf** - Close a PDF and free memory ### Implementation Details #### Design Decisions 1. **In-Memory Storage**: PDFs are kept in memory during the session for fast access. Trade-off: memory usage vs. speed. 2. **Redaction Tracking**: The server maintains a list of all texts that have been marked for redaction for each PDF. This prevents duplicate work and allows progress monitoring. 3. **Batch Processing**: The `redact_text` tool accepts a list of texts instead of a single text, allowing multiple redactions in one call for better performance. 4. **Lazy Redaction**: Redaction annotations are added but not applied until save_redacted_pdf is called. This allows multiple redactions before committing. 5. **Automatic Naming**: By default, redacted PDFs are saved with "_redacted" suffix to prevent accidental overwrites. 6. **Path Resolution**: All paths are resolved to absolute paths to avoid ambiguity. 7. **Error Handling**: Uses FastMCP's ToolError for proper MCP error propagation, with descriptive messages. 8. **Context Logging**: All operations log to MCP context for transparency to the user. #### Security Considerations - PDFs are only accessible from the local filesystem - No network access or remote PDF loading - Redactions are permanently applied when saved - No temporary files created during operation #### Testing - ✅ Unit-level testing via integration test - ✅ End-to-end workflow testing - ✅ Error handling verification - ✅ Sample PDFs with sensitive data patterns ### Usage Patterns #### Basic Workflow ``` 1. load_pdf(path) → View content 2. redact_text(path, [sensitive_string1, sensitive_string2, ...]) → Mark multiple texts for redaction 3. list_applied_redactions(path) → Check what's been redacted (optional) 4. redact_text(path, [more_strings]) → Add more redactions (duplicates automatically skipped) 5. save_redacted_pdf(path) → Apply and save 6. close_pdf(path) → Clean up (also clears redaction tracking) ``` #### Efficient Batch Workflow ``` 1. load_pdf(path) → View content 2. Identify all sensitive texts → Make a list 3. redact_text(path, [text1, text2, text3, ..., textN]) → One call for all redactions 4. save_redacted_pdf(path) → Apply and save 5. close_pdf(path) → Clean up ``` #### Integration Points **Claude Desktop**: ```json { "mcpServers": { "pdf-redaction": { "command": "uv", "args": ["--directory", "/path/to/redact_mcp", "run", "fastmcp", "run", "redact_mcp.server:mcp"] } } } ``` **HTTP Client**: ```python from fastmcp import Client client = Client("http://localhost:8000/mcp") ``` ### Performance Characteristics - **Load time**: O(n) where n = number of pages - **Text redaction (batch)**: O(n*m*t) where n = pages, m = text instances per page, t = number of texts to redact - **Text redaction (single)**: O(n*m) where n = pages, m = text instances - **Redaction tracking**: O(1) per text (hash-based lookup) - **Area redaction**: O(1) per area - **Memory usage**: Proportional to PDF size (kept in memory) + redaction list size - **Save time**: O(n) where n = pages with redactions **Performance Note**: Batch redaction (passing multiple texts in one call) is significantly faster than multiple individual calls due to reduced overhead. ### Limitations (Current Version) 1. **No image redaction**: Only text redaction is implemented 2. **Non-persistent storage**: PDFs must be reloaded after server restart 3. **Single session**: No multi-user support 4. **No OCR**: Can't redact text in images/scanned documents 5. **No regex patterns**: Only exact string matching ### Future Enhancements (Potential) - [ ] Image redaction support - [ ] Regular expression pattern matching - [ ] OCR integration for scanned documents - [ ] Persistent storage layer for redaction history - [ ] Batch processing of multiple PDFs in one call - [ ] Redaction templates/profiles - [ ] Audit logging with timestamps - [ ] Preview before applying redactions - [ ] Undo/redo redaction operations - [ ] Export redaction report (what was redacted where) ### Development Status **Version**: 0.1.0 **Status**: Production-ready for text redaction **Last Updated**: October 2025 **Python Version**: 3.13+ ### Dependencies **Runtime**: - fastmcp >= 2.12.0 - PyMuPDF >= 1.24.0 **Development**: - uv (package manager) ### Testing Run the integration test: ```bash uv run python examples/test_integration.py ``` Expected output: - ✓ All 8 test steps pass - ✓ Redacted PDF created with redactions applied - ✓ File size verification ### Documentation - **README.md**: Complete usage guide and API reference - **QUICKSTART.md**: Quick start for new users - **examples/**: Working code examples - **Inline docstrings**: Comprehensive function documentation ### Standards Compliance - ✅ MCP Protocol 2025-06-18 specification - ✅ FastMCP 2 best practices - ✅ Python type hints throughout - ✅ PEP 8 style guide - ✅ Comprehensive error handling ### Deployment Options 1. **Local (stdio)**: Direct integration with MCP clients 2. **HTTP**: Remote access via network 3. **FastMCP Cloud**: Hosted deployment (recommended) 4. **Self-hosted**: Docker/container deployment ### Contributing Contributions welcome! Focus areas: - Image redaction support - Regex pattern matching - OCR integration - Performance optimizations ### Support - Issues: GitHub Issues - Documentation: README.md, QUICKSTART.md - Examples: examples/ directory ### Acknowledgments Built with: - [FastMCP](https://gofastmcp.com/) - MCP server framework - [PyMuPDF](https://pymupdf.readthedocs.io/) - PDF manipulation - [Model Context Protocol](https://modelcontextprotocol.io/) - Protocol specification

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/marc-hanheide/redact_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server