# Production Deployment Guide
## Prerequisites
- Python 3.9+
- Access to Spark History Server
- Gemini API Key
## Installation
### 1. Clone and Setup
```bash
cd /path/to/deployment
git clone <repository-url> spark_optimizer
cd spark_optimizer
```
### 2. Install Dependencies
```bash
pip install -r requirements.txt
```
### 3. Configure Environment
```bash
cp .env.example .env
# Edit .env with your settings
nano .env
```
Required settings:
- `GEMINI_API_KEY`: Your Gemini API key
- `SPARK_OPT_HISTORY_URL`: Spark History Server URL
## Running the Optimizer
### CLI Mode
```bash
export $(cat .env | xargs)
python3 spark_optimize.py \
--appId application_1234567890_0001 \
--historyUrl http://your-history-server:18080 \
--jobCode path/to/job.py \
--output reports/analysis.json
```
### As a Service (MCP Server)
```bash
export $(cat .env | xargs)
python3 -m src.server
```
The MCP server will start on the default port and expose tools for:
- `get_application_summary`
- `get_jobs`
- `get_stages`
- `get_executors`
- `get_sql_executions`
- etc.
## Production Considerations
### 1. Resource Requirements
- **Memory**: 2GB minimum, 4GB recommended
- **CPU**: 2 cores minimum
- **Network**: Low latency to Spark History Server
### 2. Rate Limiting
The system implements exponential backoff for Gemini API rate limits:
- Initial retry delay: 5 seconds
- Max retries: 5
- Exponential backoff multiplier: 2x
Configure via:
```bash
SPARK_OPT_MAX_RETRIES=5
SPARK_OPT_RETRY_DELAY=5.0
```
### 3. Logging
Set log level via environment:
```bash
SPARK_OPT_LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR
```
Logs include:
- API request/response details
- Agent analysis steps
- Error traces
### 4. Monitoring
Monitor these metrics:
- API call success rate
- Analysis completion time
- LLM token usage
- Error rates by type
### 5. Security
- **API Keys**: Store in environment variables, never commit
- **Network**: Use HTTPS for Spark History Server if possible
- **Access Control**: Restrict who can run analyses
## Troubleshooting
### Issue: "Connection refused" to Spark History Server
**Solution**: Verify Spark History Server is running and accessible:
```bash
curl http://localhost:18080/api/v1/applications
```
### Issue: "Quota exceeded" from Gemini API
**Solution**: The system auto-retries with backoff. If persistent:
1. Check API quota limits
2. Increase `SPARK_OPT_RETRY_DELAY`
3. Reduce analysis frequency
### Issue: Empty or incomplete reports
**Solution**:
1. Check Spark History Server has complete data
2. Verify application ID is correct
3. Enable DEBUG logging to see agent responses
### Issue: High memory usage
**Solution**:
1. Reduce `SPARK_OPT_MAX_STAGES` (default: 5)
2. Disable code analysis: `SPARK_OPT_CODE_ANALYSIS=false`
3. Process smaller applications first
## Performance Tuning
### For Large Clusters
```bash
SPARK_OPT_MAX_STAGES=10 # Analyze more stages
SPARK_OPT_TIMEOUT=60 # Longer timeout
```
### For Rate-Limited Environments
```bash
SPARK_OPT_MAX_RETRIES=10
SPARK_OPT_RETRY_DELAY=10.0
```
## Scaling
### Horizontal Scaling
Deploy multiple instances behind a load balancer:
```bash
# Instance 1
SPARK_OPT_HISTORY_URL=http://cluster1:18080 python3 -m src.server
# Instance 2
SPARK_OPT_HISTORY_URL=http://cluster2:18080 python3 -m src.server
```
### Batch Processing
Process multiple applications:
```bash
for app_id in $(cat app_ids.txt); do
python3 spark_optimize.py --appId $app_id --output reports/${app_id}.json
done
```
## Health Checks
### Readiness Check
```bash
curl http://localhost:8080/health
```
### Liveness Check
```bash
python3 -c "from src.client import SparkHistoryClient; c = SparkHistoryClient(); print('OK' if c.get_applications() is not None else 'FAIL')"
```