scrub_pii
Detect and redact personally identifiable information (PII) from Word documents using Presidio and spaCy NER. Use dry-run mode to review detections before applying redaction as black rectangles.
Instructions
[EXPERIMENTAL] Detect and redact PII from the open document using Presidio + spaCy NER.
WARNING: This tool WILL miss PII. It is experimental and NOT suitable for production use or as the sole control for privileged, regulated, or legally sensitive documents. Always run with dry_run=True first and manually review every detected entity before committing a redacted file.
Known limitations (statistical NER gaps):
Names in ALL-CAPS (ledger headers, table cells) are frequently missed.
Single-token names with no surrounding context are unreliable.
Non-English names (Arabic, CJK, African) have low recall on this English model.
Names embedded in legal boilerplate ("Borrower: Jane Doe") are often skipped.
NER model (en_core_web_lg, ~560MB) downloads automatically on first use.
Detects: PERSON, EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, SSN, IP_ADDRESS, IBAN_CODE, US_BANK_NUMBER, US_PASSPORT, and more via Presidio.
Redacted text is replaced with a solid black DrawingML rectangle — true XML redaction where the original text is deleted from the OOXML entirely, not merely hidden by formatting.
Args: output_path: Destination path. Required when dry_run=False. entities: Presidio entity types to redact. None = all detected types. confidence_threshold: Presidio score floor (default 0.35). dry_run: If True, detect only — return entity list, write no file. also_sanitize_metadata: Apply level-3 metadata sanitization (default True). redact_authors_as: Replacement author string for metadata pass.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| output_path | No | ||
| entities | No | ||
| confidence_threshold | No | ||
| dry_run | No | ||
| also_sanitize_metadata | No | ||
| redact_authors_as | No | REDACTED |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |