Deployment.md•11 kB
# MCP Orchestrator - Complete Deployment Guide
This guide covers end-to-end deployment of the MCP Orchestrator with SkyPilot and AI Envoy Gateway.
## Table of Contents
1. [Prerequisites](#prerequisites)
2. [SkyPilot Setup](#skypilot-setup)
3. [Cloud Provider Configuration](#cloud-provider-configuration)
4. [MCP Deployment](#mcp-deployment)
5. [AI Envoy Configuration](#ai-envoy-configuration)
6. [Verification](#verification)
7. [First Deployment](#first-deployment)
## Prerequisites
### System Requirements
- **OS**: Linux (Ubuntu 20.04+) or macOS
- **Docker**: 20.10+
- **Docker Compose**: 2.0+
- **RAM**: 8GB minimum, 16GB recommended
- **Disk**: 50GB free space
### Install Docker
```bash
# Ubuntu/Debian
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
# Verify
docker --version
docker-compose --version
```
## SkyPilot Setup
### 1. Install SkyPilot Locally (Optional, for Testing)
```bash
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install SkyPilot with cloud providers
pip install "skypilot[aws,gcp,azure]"
# Verify installation
sky check
```
Output should show:
```
✓ AWS
✓ GCP
✓ Azure
✓ RunPod
✓ Vast.ai
```
### 2. Configure Cloud Credentials
#### AWS
```bash
# Install AWS CLI
pip install awscli
# Configure credentials
aws configure
# Test
aws sts get-caller-identity
```
Set in `.env`:
```bash
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
AWS_DEFAULT_REGION=us-east-1
```
#### GCP
```bash
# Install gcloud CLI
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
# Authenticate
gcloud auth application-default login
# Set project
gcloud config set project YOUR_PROJECT_ID
```
Or use service account:
```bash
# Create service account key
gcloud iam service-accounts keys create key.json \
--iam-account=YOUR_SA@YOUR_PROJECT.iam.gserviceaccount.com
# Set in .env
GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
```
#### RunPod
```bash
# Get API key from https://runpod.io/console/user/settings
# Set in .env
RUNPOD_API_KEY=your_key_here
```
Test SkyPilot with RunPod:
```bash
sky launch --cloud runpod --gpus RTX3060:1 -- echo "Hello from RunPod"
```
#### Vast.ai
```bash
# Get API key from https://cloud.vast.ai/account/
# Set in .env
VASTAI_API_KEY=your_key_here
```
### 3. Verify SkyPilot Configuration
```bash
# Check all clouds
sky check
# List available GPUs on RunPod
sky show-gpus --cloud runpod
# List available GPUs on Vast.ai
sky show-gpus --cloud vastai
# Check quotas
sky quota
```
## Cloud Provider Configuration
### RunPod Setup
1. **Create Account**: https://runpod.io
2. **Add Credit**: Billing → Add Funds
3. **Get API Key**: Settings → API Keys → Create
4. **Test Connection**:
```bash
curl -H "Authorization: Bearer YOUR_KEY" \
https://api.runpod.io/graphql \
-d '{"query": "query { myself { id } }"}'
```
### Vast.ai Setup
1. **Create Account**: https://cloud.vast.ai
2. **Add Credit**: Account → Billing
3. **Get API Key**: Account → API Keys
4. **Test Connection**:
```bash
curl -H "Authorization: Bearer YOUR_KEY" \
https://api.vast.ai/v0/instances
```
## MCP Deployment
### 1. Clone Repository
```bash
git clone https://github.com/your-org/mcp-orchestrator.git
cd mcp-orchestrator
```
### 2. Configure Environment
```bash
# Copy template
cp .env.example .env
# Edit configuration
nano .env
```
**Required Variables**:
```bash
# At minimum, set these:
RUNPOD_API_KEY=xxx
VASTAI_API_KEY=xxx
CLICKHOUSE_PASSWORD=strong_password_here
```
**Optional but Recommended**:
```bash
# For multi-cloud via SkyPilot
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-key.json
# For observability
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
```
### 3. Initialize ClickHouse Schema
The schema is automatically initialized on first start, but you can pre-create it:
```bash
# Start only ClickHouse
docker-compose up -d clickhouse
# Wait for it to be ready
sleep 10
# Initialize schema
docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql
# Verify tables
docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp"
```
Should show:
```
deployments
events
gpu_rules
heartbeats
resources
```
### 4. Deploy All Services
```bash
# Make deploy script executable
chmod +x scripts/deploy.sh
# Run deployment
./scripts/deploy.sh
```
The script will:
1. Build Docker images
2. Start Redis and ClickHouse
3. Initialize database schema
4. Start Envoy Gateway
5. Start MCP API
6. Start Heartbeat worker
7. Verify all health checks
### 5. Deploy with Observability (Optional)
```bash
# Deploy with Prometheus, Grafana, OpenTelemetry
docker-compose --profile observability up -d
```
Access:
- **Prometheus**: http://localhost:9090
- **Grafana**: http://localhost:3000 (admin/admin)
- **OpenTelemetry**: http://localhost:8888
## AI Envoy Configuration
### 1. Verify Envoy is Running
```bash
# Check Envoy health
curl http://localhost:10000/ready
# View Envoy config
curl http://localhost:10000/config_dump | jq .
# Check clusters
curl http://localhost:10000/clusters
```
### 2. Configure Rate Limits
Edit `envoy.yaml`:
```yaml
rate_limits:
- actions:
- request_headers:
header_name: "x-provider"
descriptor_key: "provider"
- request_headers:
header_name: "x-user-id"
descriptor_key: "user"
```
Reload:
```bash
docker-compose restart envoy
```
### 3. Configure Circuit Breakers
In `envoy.yaml`:
```yaml
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 100
max_pending_requests: 100
max_requests: 100
max_retries: 3
max_connection_pools: 5
```
### 4. View Envoy Metrics
```bash
# Prometheus-formatted metrics
curl http://localhost:9901/stats/prometheus
# Human-readable stats
curl http://localhost:9901/stats
# Specific cluster stats
curl http://localhost:9901/clusters/mcp_api_cluster
```
## Verification
### 1. Health Checks
```bash
# MCP API health
curl http://localhost:8000/health
# Expected output:
{
"status": "healthy",
"components": {
"skypilot": true,
"envoy": true,
"clickhouse": true,
"registry": true
}
}
# Envoy health
curl http://localhost:10000/ready
# ClickHouse health
docker-compose exec clickhouse clickhouse-client --query "SELECT 1"
```
### 2. Verify SkyPilot Integration
```bash
# Check SkyPilot from container
docker-compose exec mcp-api sky check
# Should show configured clouds
docker-compose exec mcp-api sky status
```
### 3. Verify GPU Rules Loaded
```bash
curl http://localhost:8000/api/v1/gpu-rules | jq .
# Should show default rules:
# - RTX 3060
# - RTX 4090
# - RTX 6000 Ada
```
### 4. View Logs
```bash
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f mcp-api
# Filter for SkyPilot
docker-compose logs mcp-api | grep -i skypilot
# Filter for Envoy
docker-compose logs envoy | grep -i rate
```
### 5. Test API
```bash
# Root endpoint
curl http://localhost:8000/
# API documentation
open http://localhost:8000/docs
# Metrics
curl http://localhost:8000/metrics
```
## First Deployment
### Simple Test Deployment
```bash
# Deploy to RunPod with defaults (ON_DEMAND + RTX 3060)
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{
"spec": {
"name": "test-deployment",
"run": "nvidia-smi && sleep 300"
}
}'
```
### Monitor Deployment
```bash
# Watch logs
docker-compose logs -f mcp-api
# Check ClickHouse for deployment record
docker-compose exec clickhouse clickhouse-client --query \
"SELECT * FROM deployments ORDER BY created_at DESC LIMIT 1 FORMAT Vertical"
# Check SkyPilot status
docker-compose exec mcp-api sky status
```
### Verify via Envoy
```bash
# Check if request went through Envoy
curl http://localhost:9901/stats | grep mcp_api_cluster
# Should show:
cluster.mcp_api_cluster.upstream_rq_total: 1
cluster.mcp_api_cluster.upstream_rq_200: 1
```
### Cleanup Test Deployment
```bash
# List resources
curl http://localhost:8000/api/v1/providers/runpod/list
# Terminate
curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/test-deployment
```
## Troubleshooting
### SkyPilot Issues
**Problem**: `sky check` shows clouds as unavailable
```bash
# Check credentials
sky check -v
# For AWS
aws sts get-caller-identity
# For GCP
gcloud auth application-default print-access-token
# For RunPod/Vast.ai
echo $RUNPOD_API_KEY
echo $VASTAI_API_KEY
```
**Problem**: Deployment fails with "No resources available"
```bash
# Check available GPUs
sky show-gpus --cloud runpod
sky show-gpus --cloud vastai
# Try different GPU
curl -X POST ... -d '{"spec": {"gpu_name": "RTX 4090"}}'
```
### Envoy Issues
**Problem**: Requests not routing through Envoy
```bash
# Check Envoy listeners
curl http://localhost:10000/listeners
# Check if MCP API cluster is up
curl http://localhost:10000/clusters | grep mcp_api_cluster
# Restart Envoy
docker-compose restart envoy
```
**Problem**: Rate limit errors
```bash
# Check rate limit stats
curl http://localhost:9901/stats | grep ratelimit
# Adjust in envoy.yaml or disable temporarily
```
### ClickHouse Issues
**Problem**: Cannot connect to ClickHouse
```bash
# Check if running
docker-compose ps clickhouse
# Check logs
docker-compose logs clickhouse
# Test connection
docker-compose exec clickhouse clickhouse-client --query "SELECT 1"
# Reinitialize if needed
docker-compose down -v
docker-compose up -d clickhouse
# Wait 30 seconds
./scripts/deploy.sh
```
### Deployment Failures
```bash
# Check deployment errors in ClickHouse
docker-compose exec clickhouse clickhouse-client --query \
"SELECT * FROM events WHERE event_type = 'deployment_error' ORDER BY timestamp DESC LIMIT 5 FORMAT Vertical"
# Check SkyPilot logs
docker-compose exec mcp-api cat /app/logs/skypilot/sky.log
# Check agent logs
docker-compose logs mcp-api | grep -i "runpod\|vastai"
```
## Next Steps
1. **Read the README**: Full feature documentation
2. **API Documentation**: http://localhost:8000/docs
3. **Set up Monitoring**: Deploy with `--profile observability`
4. **Production Hardening**: See README "Production Deployment" section
5. **Custom Rules**: Add your GPU rules via API
## Support
- **Documentation**: See README.md and QUICK_START.md
- **Issues**: https://github.com/your-org/mcp-orchestrator/issues
- **Slack**: [Your Slack workspace]
## Security Checklist
Before going to production:
- [ ] Change default ClickHouse password
- [ ] Enable Envoy TLS
- [ ] Rotate all API keys
- [ ] Set up secrets management (Vault)
- [ ] Enable ClickHouse authentication
- [ ] Configure firewall rules
- [ ] Set up monitoring and alerting
- [ ] Review Envoy rate limits
- [ ] Enable audit logging
- [ ] Backup ClickHouse data
---
**You're ready to orchestrate GPU workloads across clouds! 🚀**