# warehouse-athena-mcp
**HTTP-based MCP Server for secure Claude access to AWS Athena via SSE protocol**
## Overview
MCP server that provides Claude with validated access to AWS Athena data warehouse. Runs in isolated Docker container with SQL validation, data sanitization, and resource limits.
**Security Model**: Defense-in-depth with 3 layers:
1. **Query Validation** - Whitelist SELECT/WITH, block destructive operations
2. **Data Sanitization** - Redact sensitive columns (password, token, api_key, etc)
3. **Template Blacklist** - Block DROP_TABLE, MSCK_REPAIR_TABLE
## Architecture
```
Client (Claude) → SSE Transport → MCP Server → Validator → Athena → Sanitizer → Client
↓
Bearer Token Auth
↓
CORS Whitelist
```
### Core Components
**Security Layer** (`validator.py`, `sanitizer.py`, `blacklist.py`):
- `validator.validate_query(sql)` - Validates SELECT/WITH only, blocks DROP/DELETE/INSERT/etc
- `sanitizer.sanitize_results(rows)` - Redacts columns matching sensitive patterns
- `blacklist.is_template_blacklisted(name)` - Blocks dangerous templates
**MCP Tools** (`athena_tools/`):
- `execute_query` - Ad-hoc SQL with Jinja2 templates (validated, sanitized, max 100K rows)
- `list_tables` - Discover database schema
- `get_table_schema` - Get DDL for table
**HTTP Server** (`server_http.py`):
- SSE transport for MCP protocol
- Bearer token authentication (MCP_AUTH_TOKEN env var)
- CORS whitelist (MCP_ALLOWED_ORIGINS env var)
- Endpoints: `/sse`, `/messages`, `/health`
**Athena Client** (`lib/aws/athena/athena_helper.py`):
- Query execution with caching
- Result streaming (lazy iteration)
- Credential management (IRSA roles + direct keys)
- Error handling (HIVECannotOpenSplit, HIVEMalformedData)
## Project Structure
```
app/
├── athena_tools/ # MCP tool implementations
│ ├── execute_query.py # Ad-hoc SQL execution
│ ├── list_tables.py
│ └── get_table_schema.py
├── lib/
│ ├── aws/athena/ # Athena client library
│ │ ├── athena_helper.py
│ │ ├── athena_config.py
│ │ └── exceptions.py
│ └── aws/
│ ├── aws_auth_helper.py
│ └── aws_profile_config.py
├── validator.py # SQL validation (whitelist)
├── sanitizer.py # Data sanitization (redaction)
├── blacklist.py # Template blacklist
├── server_http.py # HTTP MCP server
└── settings.py # Configuration
```
## Configuration
### Required Environment Variables
```bash
# AWS Athena
AWS_ATHENA_S3_TMP_LOCATION=s3://your-bucket/athena-results/
AWS_ATHENA_DB_MAIN=analytics_db
AWS_ATHENA_BRAND_NAME=alpha
# MCP Server
MCP_AUTH_TOKEN=your-secret-token-here
# Optional
AWS_ATHENA_WORKGROUP=analytics-backend # Default: analytics-backend
MCP_PORT=8000 # Default: 8000
MCP_ALLOWED_ORIGINS=http://localhost:* # Default: localhost only
```
See `.env.example` for full list.
## Security Details
### Query Validation (`validator.py`)
**Allowed patterns**:
- `SELECT ...` - Read-only queries
- `WITH ... SELECT ...` - CTEs for complex queries
**Blocked keywords** (word boundaries, case-insensitive):
- `DROP`, `DELETE`, `TRUNCATE`, `INSERT`, `UPDATE`
- `CREATE`, `ALTER`, `REPLACE`
**Resource limits**:
- Max rows: 100,000 (truncated)
- Timeout: 600 seconds
- Scan warning: 100 GB
```python
# Example
validate_query('SELECT * FROM table') # OK
validate_query('DROP TABLE users') # QueryValidationError
```
### Data Sanitization (`sanitizer.py`)
**Sensitive column patterns** (regex, case-insensitive):
- `\bpassword\b`, `\bpasswd\b`, `\bpwd\b`
- `\btoken\b`, `\bapi_key\b`, `\bsecret\b`
- `\bcredential`, `\bauth_token\b`
**Redaction**: Replaces values with `***REDACTED***`
```python
# Example
rows = [{'id': 1, 'password': 'hash', 'email': 'user@example.com'}]
sanitized = sanitize_results(rows)
# → [{'id': 1, 'password': '***REDACTED***', 'email': 'user@example.com'}]
```
### Template Blacklist (`blacklist.py`)
**Blocked templates**:
- `DROP_TABLE` - Permanently deletes table
- `DROP_TABLE_IF_EXISTS` - Conditional drop
- `MSCK_REPAIR_TABLE` - Can corrupt Glue Catalog metadata
### Authentication & CORS
**Bearer Token**:
- Required on all endpoints
- Set via `MCP_AUTH_TOKEN` environment variable
- Server fails to start if not set
**CORS Whitelist**:
- Configurable via `MCP_ALLOWED_ORIGINS` (comma-separated)
- Default: `http://localhost:*,http://127.0.0.1:*`
- Restricts headers to `Authorization`, `Content-Type`
## Code Style
- **Python 3.13** with strict type checking (`mypy --strict`)
- **Type annotations**: All functions, use `str | None` syntax
- **Single quotes** for strings
- **Line length**: 100 characters max
- **Tools**: black, isort, flake8, pre-commit hooks
```python
# Good ✅
def execute_query(sql: str, params: dict[str, Any] | None = None) -> dict[str, Any]:
...
# Bad ❌
def execute_query(sql, params=None): # Missing types
...
```
## Running Locally
```bash
# 1. Setup environment
cp .env.example .env
# Edit .env with your AWS credentials
# 2. Install dependencies
python3.13 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# 3. Run server
PYTHONPATH=app python app/server_http.py
# 4. Test
curl http://localhost:8000/health
```
## Common Issues
**Server won't start**:
- Check `MCP_AUTH_TOKEN` is set
- Check `AWS_ATHENA_S3_TMP_LOCATION` is set
**Query rejected**:
- Use only SELECT/WITH queries
- Avoid DROP/DELETE/INSERT keywords
**CORS error**:
- Add origin to `MCP_ALLOWED_ORIGINS`
**Authentication failed**:
- Verify `Authorization: Bearer <token>` header matches `MCP_AUTH_TOKEN`
## Performance
- **Query caching**: 24h default TTL
- **Result streaming**: Lazy iteration, not loaded in memory
- **Truncation**: Max 100K rows prevents OOM
- **Sanitization**: O(n×m) with precompiled regex (~50ms for 100K rows)
## Key Files Reference
| File | Purpose |
|------|---------|
| `validator.py` | SQL query validation (whitelist SELECT/WITH) |
| `sanitizer.py` | Redact sensitive data from results |
| `blacklist.py` | Block dangerous query templates |
| `server_http.py` | HTTP MCP server with SSE transport |
| `athena_tools/execute_query.py` | Main ad-hoc query execution |
| `lib/aws/athena/athena_helper.py` | Core Athena client |
| `settings.py` | Environment configuration |
---
**Tech Stack**: Python 3.13, MCP SDK, AWS Athena, boto3, Jinja2, Starlette (ASGI)
**Protocol**: MCP over SSE (Server-Sent Events)
**Security**: Bearer token auth, CORS whitelist, SQL validation, data sanitization