MCP Athena Analytics Server

CLAUDE.md•6.58 KiB

# warehouse-athena-mcp

**HTTP-based MCP Server for secure Claude access to AWS Athena via SSE protocol**

## Overview

MCP server that provides Claude with validated access to AWS Athena data warehouse. Runs in isolated Docker container with SQL validation, data sanitization, and resource limits.

**Security Model**: Defense-in-depth with 3 layers:
1. **Query Validation** - Whitelist SELECT/WITH, block destructive operations
2. **Data Sanitization** - Redact sensitive columns (password, token, api_key, etc)
3. **Template Blacklist** - Block DROP_TABLE, MSCK_REPAIR_TABLE

## Architecture

```
Client (Claude) → SSE Transport → MCP Server → Validator → Athena → Sanitizer → Client
                     ↓
              Bearer Token Auth
                     ↓
               CORS Whitelist
```

### Core Components

**Security Layer** (`validator.py`, `sanitizer.py`, `blacklist.py`):
- `validator.validate_query(sql)` - Validates SELECT/WITH only, blocks DROP/DELETE/INSERT/etc
- `sanitizer.sanitize_results(rows)` - Redacts columns matching sensitive patterns
- `blacklist.is_template_blacklisted(name)` - Blocks dangerous templates

**MCP Tools** (`athena_tools/`):
- `execute_query` - Ad-hoc SQL with Jinja2 templates (validated, sanitized, max 100K rows)
- `list_tables` - Discover database schema
- `get_table_schema` - Get DDL for table

**HTTP Server** (`server_http.py`):
- SSE transport for MCP protocol
- Bearer token authentication (MCP_AUTH_TOKEN env var)
- CORS whitelist (MCP_ALLOWED_ORIGINS env var)
- Endpoints: `/sse`, `/messages`, `/health`

**Athena Client** (`lib/aws/athena/athena_helper.py`):
- Query execution with caching
- Result streaming (lazy iteration)
- Credential management (IRSA roles + direct keys)
- Error handling (HIVECannotOpenSplit, HIVEMalformedData)

## Project Structure

```
app/
├── athena_tools/           # MCP tool implementations
│   ├── execute_query.py    # Ad-hoc SQL execution
│   ├── list_tables.py
│   └── get_table_schema.py
├── lib/
│   ├── aws/athena/         # Athena client library
│   │   ├── athena_helper.py
│   │   ├── athena_config.py
│   │   └── exceptions.py
│   └── aws/
│       ├── aws_auth_helper.py
│       └── aws_profile_config.py
├── validator.py            # SQL validation (whitelist)
├── sanitizer.py            # Data sanitization (redaction)
├── blacklist.py            # Template blacklist
├── server_http.py          # HTTP MCP server
└── settings.py             # Configuration
```

## Configuration

### Required Environment Variables

```bash
# AWS Athena
AWS_ATHENA_S3_TMP_LOCATION=s3://your-bucket/athena-results/
AWS_ATHENA_DB_MAIN=analytics_db
AWS_ATHENA_BRAND_NAME=alpha

# MCP Server
MCP_AUTH_TOKEN=your-secret-token-here

# Optional
AWS_ATHENA_WORKGROUP=analytics-backend  # Default: analytics-backend
MCP_PORT=8000                            # Default: 8000
MCP_ALLOWED_ORIGINS=http://localhost:*  # Default: localhost only
```

See `.env.example` for full list.

## Security Details

### Query Validation (`validator.py`)

**Allowed patterns**:
- `SELECT ...` - Read-only queries
- `WITH ... SELECT ...` - CTEs for complex queries

**Blocked keywords** (word boundaries, case-insensitive):
- `DROP`, `DELETE`, `TRUNCATE`, `INSERT`, `UPDATE`
- `CREATE`, `ALTER`, `REPLACE`

**Resource limits**:
- Max rows: 100,000 (truncated)
- Timeout: 600 seconds
- Scan warning: 100 GB

```python
# Example
validate_query('SELECT * FROM table')  # OK
validate_query('DROP TABLE users')     # QueryValidationError
```

### Data Sanitization (`sanitizer.py`)

**Sensitive column patterns** (regex, case-insensitive):
- `\bpassword\b`, `\bpasswd\b`, `\bpwd\b`
- `\btoken\b`, `\bapi_key\b`, `\bsecret\b`
- `\bcredential`, `\bauth_token\b`

**Redaction**: Replaces values with `***REDACTED***`

```python
# Example
rows = [{'id': 1, 'password': 'hash', 'email': 'user@example.com'}]
sanitized = sanitize_results(rows)
# → [{'id': 1, 'password': '***REDACTED***', 'email': 'user@example.com'}]
```

### Template Blacklist (`blacklist.py`)

**Blocked templates**:
- `DROP_TABLE` - Permanently deletes table
- `DROP_TABLE_IF_EXISTS` - Conditional drop
- `MSCK_REPAIR_TABLE` - Can corrupt Glue Catalog metadata

### Authentication & CORS

**Bearer Token**:
- Required on all endpoints
- Set via `MCP_AUTH_TOKEN` environment variable
- Server fails to start if not set

**CORS Whitelist**:
- Configurable via `MCP_ALLOWED_ORIGINS` (comma-separated)
- Default: `http://localhost:*,http://127.0.0.1:*`
- Restricts headers to `Authorization`, `Content-Type`

## Code Style

- **Python 3.13** with strict type checking (`mypy --strict`)
- **Type annotations**: All functions, use `str | None` syntax
- **Single quotes** for strings
- **Line length**: 100 characters max
- **Tools**: black, isort, flake8, pre-commit hooks

```python
# Good ✅
def execute_query(sql: str, params: dict[str, Any] | None = None) -> dict[str, Any]:
    ...

# Bad ❌
def execute_query(sql, params=None):  # Missing types
    ...
```

## Running Locally

```bash
# 1. Setup environment
cp .env.example .env
# Edit .env with your AWS credentials

# 2. Install dependencies
python3.13 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# 3. Run server
PYTHONPATH=app python app/server_http.py

# 4. Test
curl http://localhost:8000/health
```

## Common Issues

**Server won't start**:
- Check `MCP_AUTH_TOKEN` is set
- Check `AWS_ATHENA_S3_TMP_LOCATION` is set

**Query rejected**:
- Use only SELECT/WITH queries
- Avoid DROP/DELETE/INSERT keywords

**CORS error**:
- Add origin to `MCP_ALLOWED_ORIGINS`

**Authentication failed**:
- Verify `Authorization: Bearer <token>` header matches `MCP_AUTH_TOKEN`

## Performance

- **Query caching**: 24h default TTL
- **Result streaming**: Lazy iteration, not loaded in memory
- **Truncation**: Max 100K rows prevents OOM
- **Sanitization**: O(n×m) with precompiled regex (~50ms for 100K rows)

## Key Files Reference

| File | Purpose |
|------|---------|
| `validator.py` | SQL query validation (whitelist SELECT/WITH) |
| `sanitizer.py` | Redact sensitive data from results |
| `blacklist.py` | Block dangerous query templates |
| `server_http.py` | HTTP MCP server with SSE transport |
| `athena_tools/execute_query.py` | Main ad-hoc query execution |
| `lib/aws/athena/athena_helper.py` | Core Athena client |
| `settings.py` | Environment configuration |

---

**Tech Stack**: Python 3.13, MCP SDK, AWS Athena, boto3, Jinja2, Starlette (ASGI)
**Protocol**: MCP over SSE (Server-Sent Events)
**Security**: Bearer token auth, CORS whitelist, SQL validation, data sanitization

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/elena-kuznetsova-wh/warehouse-athena-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CLAUDE.md•6.58 KiB

# warehouse-athena-mcp

**HTTP-based MCP Server for secure Claude access to AWS Athena via SSE protocol**

## Overview

MCP server that provides Claude with validated access to AWS Athena data warehouse. Runs in isolated Docker container with SQL validation, data sanitization, and resource limits.

**Security Model**: Defense-in-depth with 3 layers:
1. **Query Validation** - Whitelist SELECT/WITH, block destructive operations
2. **Data Sanitization** - Redact sensitive columns (password, token, api_key, etc)
3. **Template Blacklist** - Block DROP_TABLE, MSCK_REPAIR_TABLE

## Architecture

```
Client (Claude) → SSE Transport → MCP Server → Validator → Athena → Sanitizer → Client
                     ↓
              Bearer Token Auth
                     ↓
               CORS Whitelist
```

### Core Components

**Security Layer** (`validator.py`, `sanitizer.py`, `blacklist.py`):
- `validator.validate_query(sql)` - Validates SELECT/WITH only, blocks DROP/DELETE/INSERT/etc
- `sanitizer.sanitize_results(rows)` - Redacts columns matching sensitive patterns
- `blacklist.is_template_blacklisted(name)` - Blocks dangerous templates

**MCP Tools** (`athena_tools/`):
- `execute_query` - Ad-hoc SQL with Jinja2 templates (validated, sanitized, max 100K rows)
- `list_tables` - Discover database schema
- `get_table_schema` - Get DDL for table

**HTTP Server** (`server_http.py`):
- SSE transport for MCP protocol
- Bearer token authentication (MCP_AUTH_TOKEN env var)
- CORS whitelist (MCP_ALLOWED_ORIGINS env var)
- Endpoints: `/sse`, `/messages`, `/health`

**Athena Client** (`lib/aws/athena/athena_helper.py`):
- Query execution with caching
- Result streaming (lazy iteration)
- Credential management (IRSA roles + direct keys)
- Error handling (HIVECannotOpenSplit, HIVEMalformedData)

## Project Structure

```
app/
├── athena_tools/           # MCP tool implementations
│   ├── execute_query.py    # Ad-hoc SQL execution
│   ├── list_tables.py
│   └── get_table_schema.py
├── lib/
│   ├── aws/athena/         # Athena client library
│   │   ├── athena_helper.py
│   │   ├── athena_config.py
│   │   └── exceptions.py
│   └── aws/
│       ├── aws_auth_helper.py
│       └── aws_profile_config.py
├── validator.py            # SQL validation (whitelist)
├── sanitizer.py            # Data sanitization (redaction)
├── blacklist.py            # Template blacklist
├── server_http.py          # HTTP MCP server
└── settings.py             # Configuration
```

## Configuration

### Required Environment Variables

```bash
# AWS Athena
AWS_ATHENA_S3_TMP_LOCATION=s3://your-bucket/athena-results/
AWS_ATHENA_DB_MAIN=analytics_db
AWS_ATHENA_BRAND_NAME=alpha

# MCP Server
MCP_AUTH_TOKEN=your-secret-token-here

# Optional
AWS_ATHENA_WORKGROUP=analytics-backend  # Default: analytics-backend
MCP_PORT=8000                            # Default: 8000
MCP_ALLOWED_ORIGINS=http://localhost:*  # Default: localhost only
```

See `.env.example` for full list.

## Security Details

### Query Validation (`validator.py`)

**Allowed patterns**:
- `SELECT ...` - Read-only queries
- `WITH ... SELECT ...` - CTEs for complex queries

**Blocked keywords** (word boundaries, case-insensitive):
- `DROP`, `DELETE`, `TRUNCATE`, `INSERT`, `UPDATE`
- `CREATE`, `ALTER`, `REPLACE`

**Resource limits**:
- Max rows: 100,000 (truncated)
- Timeout: 600 seconds
- Scan warning: 100 GB

```python
# Example
validate_query('SELECT * FROM table')  # OK
validate_query('DROP TABLE users')     # QueryValidationError
```

### Data Sanitization (`sanitizer.py`)

**Sensitive column patterns** (regex, case-insensitive):
- `\bpassword\b`, `\bpasswd\b`, `\bpwd\b`
- `\btoken\b`, `\bapi_key\b`, `\bsecret\b`
- `\bcredential`, `\bauth_token\b`

**Redaction**: Replaces values with `***REDACTED***`

```python
# Example
rows = [{'id': 1, 'password': 'hash', 'email': 'user@example.com'}]
sanitized = sanitize_results(rows)
# → [{'id': 1, 'password': '***REDACTED***', 'email': 'user@example.com'}]
```

### Template Blacklist (`blacklist.py`)

**Blocked templates**:
- `DROP_TABLE` - Permanently deletes table
- `DROP_TABLE_IF_EXISTS` - Conditional drop
- `MSCK_REPAIR_TABLE` - Can corrupt Glue Catalog metadata

### Authentication & CORS

**Bearer Token**:
- Required on all endpoints
- Set via `MCP_AUTH_TOKEN` environment variable
- Server fails to start if not set

**CORS Whitelist**:
- Configurable via `MCP_ALLOWED_ORIGINS` (comma-separated)
- Default: `http://localhost:*,http://127.0.0.1:*`
- Restricts headers to `Authorization`, `Content-Type`

## Code Style

- **Python 3.13** with strict type checking (`mypy --strict`)
- **Type annotations**: All functions, use `str | None` syntax
- **Single quotes** for strings
- **Line length**: 100 characters max
- **Tools**: black, isort, flake8, pre-commit hooks

```python
# Good ✅
def execute_query(sql: str, params: dict[str, Any] | None = None) -> dict[str, Any]:
    ...

# Bad ❌
def execute_query(sql, params=None):  # Missing types
    ...
```

## Running Locally

```bash
# 1. Setup environment
cp .env.example .env
# Edit .env with your AWS credentials

# 2. Install dependencies
python3.13 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# 3. Run server
PYTHONPATH=app python app/server_http.py

# 4. Test
curl http://localhost:8000/health
```

## Common Issues

**Server won't start**:
- Check `MCP_AUTH_TOKEN` is set
- Check `AWS_ATHENA_S3_TMP_LOCATION` is set

**Query rejected**:
- Use only SELECT/WITH queries
- Avoid DROP/DELETE/INSERT keywords

**CORS error**:
- Add origin to `MCP_ALLOWED_ORIGINS`

**Authentication failed**:
- Verify `Authorization: Bearer <token>` header matches `MCP_AUTH_TOKEN`

## Performance

- **Query caching**: 24h default TTL
- **Result streaming**: Lazy iteration, not loaded in memory
- **Truncation**: Max 100K rows prevents OOM
- **Sanitization**: O(n×m) with precompiled regex (~50ms for 100K rows)

## Key Files Reference

| File | Purpose |
|------|---------|
| `validator.py` | SQL query validation (whitelist SELECT/WITH) |
| `sanitizer.py` | Redact sensitive data from results |
| `blacklist.py` | Block dangerous query templates |
| `server_http.py` | HTTP MCP server with SSE transport |
| `athena_tools/execute_query.py` | Main ad-hoc query execution |
| `lib/aws/athena/athena_helper.py` | Core Athena client |
| `settings.py` | Environment configuration |

---

**Tech Stack**: Python 3.13, MCP SDK, AWS Athena, boto3, Jinja2, Starlette (ASGI)
**Protocol**: MCP over SSE (Server-Sent Events)
**Security**: Bearer token auth, CORS whitelist, SQL validation, data sanitization