# ⚠️ SECURITY & PRIVACY WARNING ⚠️
**PLEASE READ CAREFULLY BEFORE USE**
Using this MCP server to detect PII involves sending text data to the Presidio engine. While the processing happens locally within the container or python process, **using this tool via an LLM Agent (like Claude, ChatGPT, etc.) implies that the text to be analyzed is being shared with that LLM.**
**RISKS:**
* **PII Leakage:** If you ask an LLM to "check this text for PII" or "anonymize this", you are sending the potentially sensitive text *to the LLM provider* first so they can construct the tool call.
* **Context Retention:** The PII may be retained in the LLM's chat history, training data, or logs.
* **Transmitted Context:** PII will be part of the prompt context transmitted over the network.
**RECOMMENDED USE:**
* **Local LLMs:** Use with locally hosted LLMs where data does not leave your infrastructure.
* **Private/Enterprise Agents:** Use in approved enterprise environments with strict data privacy agreements.
* **Non-LLM Integration:** Use the underlying libraries directly in your code without an LLM intermediary if strict privacy is required.
**ALTERNATIVE ARCHITECTURES:**
Consider using Presidio as a filter *before* the LLM. Tools like [LiteLLM](https://github.com/BerriAI/litellm) can integrate Presidio to sanitize input *before* it reaches the LLM provider, preventing PII from ever leaving your control. This MCP server is designed for *agentic* workflows where the LLM decides to check for PII, which inherently carries the risks mentioned above.
# MCP Presidio
A Model Context Protocol (MCP) server that provides comprehensive PII (Personally Identifiable Information) detection and anonymization capabilities using [Microsoft Presidio](https://github.com/microsoft/presidio). This server enables LLMs to safely handle sensitive data by detecting and anonymizing PII in text and structured data.
## Features
### Core Capabilities
- **PII Detection**: Identify 25+ types of PII including names, emails, phone numbers, credit cards, SSNs, addresses, and more
- **Text Anonymization**: Multiple anonymization strategies (replace, redact, hash, mask, encrypt)
- **Structured Data Support**: Analyze and anonymize JSON/dictionary data recursively
- **Batch Processing**: Process multiple texts efficiently in batch operations
- **Custom Recognizers**: Add domain-specific PII patterns with regex
- **Multi-language Support**: Detect PII in multiple languages
- **Validation Tools**: Test and validate detection accuracy with metrics
### Available MCP Tools
1. **analyze_text** - Detect PII entities in text with confidence scores
2. **anonymize_text** - Anonymize PII using various operators
3. **get_supported_entities** - List all supported PII entity types
4. **add_custom_recognizer** - Add custom PII detection patterns
5. **batch_analyze** - Analyze multiple texts for PII
6. **batch_anonymize** - Anonymize multiple texts
7. **get_anonymization_operators** - List available anonymization methods
8. **analyze_structured_data** - Detect PII in JSON/structured data
9. **anonymize_structured_data** - Anonymize PII in structured data
10. **validate_detection** - Validate detection accuracy with metrics
## Installation
Choose your preferred installation method:
- 🐳 **[Docker](#docker-installation-recommended-for-production)** - Self-contained, reproducible environment (recommended for production)
- 🐍 **[Python](#python-installation-quick-install)** - Direct installation with interactive setup
- 📦 **[Manual](#python-installation-manual)** - Full control over the installation process
For detailed Docker deployment instructions, see **[DOCKER.md](DOCKER.md)**.
### Prerequisites
**For Python Installation:**
- Python 3.10 or higher
- pip or uv package manager
**For Docker Installation:**
- Docker 20.10 or higher
- Docker Compose (optional, for easier management)
### Docker Installation (Recommended for Production)
Docker provides a self-contained, reproducible environment with all dependencies pre-installed.
#### Quick Start with Docker
```bash
# Clone the repository
git clone https://github.com/cmalpass/mcp-presidio.git
cd mcp-presidio
# Build the Docker image
docker build -t mcp-presidio .
# Run the container with stdio (default)
docker run -i mcp-presidio
```
#### Using Docker Compose
```bash
# Clone the repository
git clone https://github.com/cmalpass/mcp-presidio.git
cd mcp-presidio
# Build and start the container
docker-compose up -d
# View logs
docker-compose logs -f
# Stop the container
docker-compose down
```
#### Configuring Claude Desktop with Docker
To use the Docker container with Claude Desktop, update your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"presidio": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"mcp-presidio:latest"
],
"env": {}
}
}
}
```
Or if using a pre-built image from a registry:
```json
{
"mcpServers": {
"presidio": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"ghcr.io/cmalpass/mcp-presidio:latest"
],
"env": {}
}
}
}
```
#### Docker Image Details
The Docker image includes:
- Python 3.11 slim base
- All required dependencies (mcp, presidio-analyzer, presidio-anonymizer, spacy)
- Pre-installed English language model (en_core_web_lg)
- Security-hardened with non-root user
- Multi-stage build for minimal image size (~500MB)
#### Advanced Docker Usage
**Interactive Shell for Debugging:**
```bash
docker run -it mcp-presidio bash
```
**Custom Language Models:**
To include additional language models, modify the Dockerfile:
```dockerfile
# Add after the English model installation
RUN python -m spacy download es_core_news_lg # Spanish
RUN python -m spacy download fr_core_news_lg # French
RUN python -m spacy download de_core_news_lg # German
```
Then rebuild the image:
```bash
docker build -t mcp-presidio:multilang .
```
**Volume Mounting for Custom Configurations:**
```bash
docker run -i -v $(pwd)/config:/app/config:ro mcp-presidio
```
### Python Installation (Quick Install)
Use the interactive installation script that handles dependencies and language models:
**Unix/Linux/macOS:**
```bash
# Clone the repository
git clone https://github.com/cmalpass/mcp-presidio.git
cd mcp-presidio
# Run the installation script
./install.sh
# or
python install.py
```
**Windows:**
```cmd
# Clone the repository
git clone https://github.com/cmalpass/mcp-presidio.git
cd mcp-presidio
# Run the installation script
install.bat
# or
python install.py
```
The script will:
- Check Python version compatibility
- Install base dependencies (mcp, presidio-analyzer, presidio-anonymizer, spacy)
- Prompt for language model installation (English, Spanish, French, German, etc.)
- Optionally install development dependencies
- Verify the installation
- Test basic functionality
### Python Installation (Manual)
If you prefer manual installation:
```bash
# Clone the repository
git clone https://github.com/cmalpass/mcp-presidio.git
cd mcp-presidio
# Install the package
pip install -e .
# Download required spaCy language model (for English)
python -m spacy download en_core_web_lg
```
For other languages, download the appropriate spaCy model:
```bash
# Spanish
python -m spacy download es_core_news_lg
# French
python -m spacy download fr_core_news_lg
# German
python -m spacy download de_core_news_lg
```
## Usage
### Running the Server
The server runs using stdio transport, suitable for MCP clients:
```bash
mcp-presidio
```
Or run directly with Python:
```bash
python -m mcp_presidio.server
```
### Configuring with Claude Desktop
Add to your Claude Desktop configuration (`claude_desktop_config.json`):
```json
{
"mcpServers": {
"presidio": {
"command": "python",
"args": ["-m", "mcp_presidio.server"],
"env": {}
}
}
}
```
Or if installed as a script:
```json
{
"mcpServers": {
"presidio": {
"command": "mcp-presidio",
"args": [],
"env": {}
}
}
}
```
### Example Usage in LLM Conversations
**Detecting PII:**
```
User: Can you check this text for PII? "My name is John Smith and my email is john@example.com"
LLM: I'll analyze that text for PII using the analyze_text tool.
[Tool calls analyze_text with the text]
Result: Found 2 PII entities:
- PERSON: "John Smith" (confidence: 0.85)
- EMAIL_ADDRESS: "john@example.com" (confidence: 1.0)
```
**Anonymizing Text:**
```
User: Can you anonymize this customer feedback? "I'm Jane Doe, call me at 555-123-4567"
LLM: I'll anonymize the PII in that text.
[Tool calls anonymize_text]
Result: "I'm <PERSON>, call me at <PHONE_NUMBER>"
```
**Working with Structured Data:**
```
User: Check this JSON for PII: {"user": "bob@email.com", "phone": "555-0100"}
LLM: I'll analyze the structured data.
[Tool calls analyze_structured_data]
Result: Found PII in 2 fields:
- .user: EMAIL_ADDRESS
- .phone: PHONE_NUMBER
```
## Supported PII Entity Types
The server supports 25+ PII entity types including:
- **Personal**: PERSON, DATE_TIME
- **Contact**: EMAIL_ADDRESS, PHONE_NUMBER, URL
- **Financial**: CREDIT_CARD, IBAN_CODE, US_BANK_NUMBER, CRYPTO
- **Government IDs**: US_SSN, US_PASSPORT, US_DRIVER_LICENSE, UK_NHS
- **International IDs**: SG_NRIC_FIN, IN_PAN, IN_AADHAAR, AU_ABN, AU_TFN, AU_MEDICARE
- **Location**: LOCATION, IP_ADDRESS
- **Medical**: MEDICAL_LICENSE
- **Other**: And many more country-specific identifiers
Use the `get_supported_entities` tool to see all available types for your language.
## Anonymization Operators
The server supports multiple anonymization strategies:
1. **replace** - Replace PII with placeholder text (e.g., `<EMAIL_ADDRESS>`)
2. **redact** - Remove PII entirely from text
3. **hash** - Replace with cryptographic hash (SHA-256)
4. **mask** - Mask characters (e.g., `***-**-1234`)
5. **encrypt** - Encrypt PII with AES encryption
6. **keep** - Keep PII as-is (for selective anonymization)
## Advanced Features
### Custom Recognizers
Add domain-specific PII patterns:
```python
# Example: Detect custom employee IDs
add_custom_recognizer(
name="employee_id_recognizer",
entity_type="EMPLOYEE_ID",
patterns=[
{"name": "emp_pattern", "regex": "EMP-\\d{6}", "score": 0.9}
],
context=["employee", "staff", "worker"]
)
```
### Batch Processing
Process multiple documents efficiently:
```python
# Analyze multiple texts
batch_analyze(
texts=["Text 1...", "Text 2...", "Text 3..."],
entities=["PERSON", "EMAIL_ADDRESS"],
score_threshold=0.5
)
```
### Language Support
Specify different languages:
```python
analyze_text(
text="Me llamo María García",
language="es"
)
```
### Validation and Testing
Validate detection accuracy:
```python
validate_detection(
text="John lives at 123 Main St",
expected_entities=[
{"entity_type": "PERSON", "start": 0, "end": 4},
{"entity_type": "LOCATION", "start": 14, "end": 27}
]
)
# Returns precision, recall, and F1 score
```
## Architecture
This MCP server integrates:
- **MCP FastMCP**: Provides the MCP protocol implementation
- **Presidio Analyzer**: Detects PII using NLP and pattern matching
- **Presidio Anonymizer**: Anonymizes detected PII with various operators
- **spaCy**: Powers the NLP engine for accurate entity recognition
## Security Considerations
- All processing happens locally - no data is sent to external services
- The server uses stdio transport for secure communication with MCP clients
- Multiple anonymization strategies available for different privacy requirements
- Supports compliance requirements (GDPR, HIPAA, CCPA)
- Docker deployment provides additional isolation and security through containerization
- Container runs as non-root user for enhanced security
## Development
### Running Tests
```bash
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
```
### Project Structure
```
mcp-presidio/
├── src/
│ └── mcp_presidio/
│ ├── __init__.py
│ └── server.py # Main MCP server implementation
├── tests/ # Test suite
├── Dockerfile # Docker container definition
├── docker-compose.yml # Docker Compose configuration
├── docker-entrypoint.sh # Container entrypoint script
├── .dockerignore # Docker build exclusions
├── pyproject.toml # Project configuration
├── README.md # This file
├── DOCKER.md # Detailed Docker deployment guide
└── .gitignore
```
## License
MIT License - see LICENSE file for details
## Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
## Acknowledgments
- [Microsoft Presidio](https://github.com/microsoft/presidio) - The underlying PII detection engine
- [Model Context Protocol](https://modelcontextprotocol.io/) - The protocol specification
- [spaCy](https://spacy.io/) - NLP library for entity recognition
## Support
For issues, questions, or contributions, please visit the [GitHub repository](https://github.com/cmalpass/mcp-presidio).