mcp-dlp
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mcp-dlpread customer-contract.txt"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
MCP DLP Prototype
A data-loss-prevention (DLP) layer for AI agents, built as a Model Context Protocol (MCP) server. It sits between a document connector and the agent: when a document is fetched, its contents are scanned for sensitive data, sensitive values are redacted (or the whole document is blocked, for credentials), and every read is recorded in an audit log — so raw sensitive data never reaches the model or the user.
This is a local prototype using a mock Google Drive-style file connector. Real Google Drive integration is out of scope by design (see Limitations).
More docs: Architecture · Write-up
The problem
AI agents are increasingly wired into business systems (Google Drive, Slack, Notion, Jira, etc.) through MCP connectors. An agent can fetch a document and pass its contents straight into a model or show them to a user — including any Social Security numbers, credit card numbers, API keys, or other secrets the document happens to contain. This prototype demonstrates one way to close that gap.
Related MCP server: ZugaShield
How it works
Client (MCP Inspector)
│ calls read_document("customer-contract.txt")
▼
MCP server (server.py)
│ 1. connector reads the raw file from mock_drive/
│ 2. scanner.scan() -> finds sensitive data + positions + confidence
│ 3. decide_action() -> allowed | redacted | blocked
│ 4a. redacted: scanner.redact() rebuilds text with labels
│ 4b. blocked: returns a [BLOCKED] message, no content
│ 5. log_audit_entry() appends one JSON line to logs/audit_log.jsonl
▼
Client receives ONLY the redacted text or block message — never the raw documentThe key design point: the DLP layer lives between document retrieval and the tool's return value. The raw text is read into a local variable and never leaves the function — only the redacted result or a block message is returned.
Project layout
mcp-dlp/
├── server.py # MCP server: read_document tool, policy, audit logging
├── scanner.py # detection rules (RULES), redaction, labels
├── test_scanner.py # 24 unit tests (pytest)
├── mock_drive/ # sample documents (the mock connector's "files")
│ ├── customer-contract.txt
│ ├── engineering-notes.txt
│ └── support-ticket.txt
├── logs/
│ └── audit_log.jsonl # append-only audit trail (auto-created)
└── pyproject.tomlSetup
Requires Python 3.10+, uv, and Node.js (the MCP
Inspector runs via npx).
# from the project root
uv add "mcp[cli]>=1.27,<2" # pinned below v2 for stability
uv add --dev pytestThe mcp SDK is pinned to <2 deliberately: a breaking v2 is scheduled and the prior
spec revision (2025-11-25) is the stable target for this prototype.
Demo (under 5 minutes)
Start the server, which launches the MCP Inspector and prints a URL with a session token pre-filled:
uv run mcp dev server.pyOpen that URL, go to the Tools tab, and select read_document. The demo walks through
three documents that exercise all three policy outcomes:
1. A user asks to read a document with sensitive data. Call read_document with
customer-contract.txt. The source file contains a name, email, phone, SSN, and credit
card.
2. The DLP layer detects and redacts. The response keeps the customer name but replaces the email, phone, SSN, and card with labels:
Customer: John Smith
Email: [REDACTED_EMAIL]
Phone: [REDACTED_PHONE]
SSN: [REDACTED_SSN]
Card on file: [REDACTED_CREDIT_CARD]The raw values never leave the server.
3. Credentials are blocked entirely. Call read_document with
engineering-notes.txt. Because it contains live credentials, the document is withheld:
[BLOCKED] 'engineering-notes.txt' contains high-risk credentials
(API_KEY, AWS_ACCESS_KEY, BEARER_TOKEN) and was withheld by DLP policy.4. The audit log shows what was detected and what action was taken. Every read, redacted or blocked, is recorded:
cat logs/audit_log.jsonlSummary of the three sample documents:
Document | Expected result | Why |
| redacted | contains PII (email, phone, SSN, card) |
| redacted | contains PII + a low-confidence account number |
| blocked | contains credentials (API key, AWS key, bearer token) |
Running the tests
uv run pytest -v24 tests cover every detector, redaction correctness, context preservation, overlap handling, confidence levels, and — importantly — false-positive guards (e.g. the word "password" in ordinary prose must not be redacted).
Detection coverage
Type | Confidence | Notes |
high | standard structure | |
Phone (formatted) | high | parens / dashes / dots / |
Phone (bare) | low | 10 bare digits — ambiguous |
SSN (formatted) | high | dashed or spaced |
SSN (bare) | low | 9 bare digits — ambiguous |
Credit card | high | issuer-prefix + length (Visa, Mastercard) |
Bearer token | high | anchored on the |
API key | high | known vendor prefixes ( |
AWS access key | high |
|
Private key | high | full PEM block, header to footer |
Secret (generic) | high |
|
Confidence is split deliberately: a formatted SSN or phone number is strong evidence, while bare digits could be an order ID or account number. Low-confidence findings are still redacted (fail-safe), but the distinction is recorded and is used to ensure a low-confidence guess can never trigger a full block.
Policy: allowed / redacted / blocked
Findings | Action | Returned to agent |
none | allowed | original document |
PII / financial (email, phone, SSN, card) | redacted | cleaned document with labels |
credentials (API key, AWS key, bearer, private key) | blocked |
|
The block list (BLOCK_TYPES in server.py) is fail-closed: a document containing
live credentials is withheld entirely rather than partially redacted, on the principle
that an agent should not be handling a credentials file at all. The generic SECRET
detector is intentionally redact-only (not block), because it is the fuzziest, lowest-
precision rule and shouldn't withhold a whole document on its own.
Audit log
Every read appends one JSON object to logs/audit_log.jsonl (JSON Lines: append-only,
one record per line). Example:
{"timestamp": "2026-06-26T09:34:21Z", "connector": "mock_google_drive", "tool": "read_document", "document_name": "engineering-notes.txt", "findings_count": 3, "finding_types": ["API_KEY", "AWS_ACCESS_KEY", "BEARER_TOKEN"], "action": "blocked", "original_length": 201, "redacted_length": 0}Configuration / extensibility
Detection rules live in
RULESinscanner.pyas a list of(label, compiled_regex, confidence[, capture_group])tuples. Adding a detector is one line; no changes to the scanning logic are needed.Redaction labels live in the
LABELSdict — change a label in one place.Block policy is the
BLOCK_TYPESset inserver.py— one line to make the policy stricter or looser.
Limitations & what production would need
This is a prototype. Honest gaps, and the reasoning behind them:
Regex-based detection, not ML. Real DLP (Microsoft Purview, Google DLP) combines regex with named-entity recognition and ML classifiers. Regex alone misses context and unusual formats. Production would add an NER/ML layer with a human review queue.
API-key coverage is a finite prefix list. Only encoded vendor prefixes are caught (Stripe, GitHub, AWS, …). A vendor whose prefix isn't listed is missed. This is the same approach real secret scanners (Gitleaks, GitGuardian) use, but their lists are far larger and continuously updated.
No entropy-based secret detection. Unlabeled high-entropy strings (a random secret not next to a
password =keyword) are not caught. Entropy detection was deliberately skipped because it false-positives heavily on hashes, UUIDs, and git SHAs without a review queue to absorb the noise.Credit-card matching has no Luhn checksum. Detection is issuer-prefix + length only, so a number matching the prefix pattern but failing the Luhn check would still be flagged. For DLP this over-flagging is the safer error, but a checksum would reduce false positives.
Overlap resolution is position-based. When two findings overlap, the left-most one wins. This is fine for the current rule set but isn't a true severity ranking; a production version would resolve overlaps by a type-priority order.
Mock connector only. Documents are local files. Real Google Drive integration (OAuth, the Drive API, streaming large files) is out of scope.
Single document, full-text scan. No streaming or chunking; very large documents are read into memory whole.
Tech
Python · MCP Python SDK (FastMCP) · stdio transport · regex detection · pytest · JSON Lines audit logging.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/aaravjain151/mcp-dlp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server