Multi-Cloud Infrastructure MCP Server

Deployment.md•11 kB

# MCP Orchestrator - Complete Deployment Guide This guide covers end-to-end deployment of the MCP Orchestrator with SkyPilot and AI Envoy Gateway. ## Table of Contents 1. [Prerequisites](#prerequisites) 2. [SkyPilot Setup](#skypilot-setup) 3. [Cloud Provider Configuration](#cloud-provider-configuration) 4. [MCP Deployment](#mcp-deployment) 5. [AI Envoy Configuration](#ai-envoy-configuration) 6. [Verification](#verification) 7. [First Deployment](#first-deployment) ## Prerequisites ### System Requirements - **OS**: Linux (Ubuntu 20.04+) or macOS - **Docker**: 20.10+ - **Docker Compose**: 2.0+ - **RAM**: 8GB minimum, 16GB recommended - **Disk**: 50GB free space ### Install Docker ```bash # Ubuntu/Debian curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER newgrp docker # Verify docker --version docker-compose --version ``` ## SkyPilot Setup ### 1. Install SkyPilot Locally (Optional, for Testing) ```bash # Create virtual environment python3 -m venv venv source venv/bin/activate # Install SkyPilot with cloud providers pip install "skypilot[aws,gcp,azure]" # Verify installation sky check ``` Output should show: ``` ✓ AWS ✓ GCP ✓ Azure ✓ RunPod ✓ Vast.ai ``` ### 2. Configure Cloud Credentials #### AWS ```bash # Install AWS CLI pip install awscli # Configure credentials aws configure # Test aws sts get-caller-identity ``` Set in `.env`: ```bash AWS_ACCESS_KEY_ID=AKIA... AWS_SECRET_ACCESS_KEY=... AWS_DEFAULT_REGION=us-east-1 ``` #### GCP ```bash # Install gcloud CLI curl https://sdk.cloud.google.com | bash exec -l $SHELL # Authenticate gcloud auth application-default login # Set project gcloud config set project YOUR_PROJECT_ID ``` Or use service account: ```bash # Create service account key gcloud iam service-accounts keys create key.json \ --iam-account=YOUR_SA@YOUR_PROJECT.iam.gserviceaccount.com # Set in .env GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json ``` #### RunPod ```bash # Get API key from https://runpod.io/console/user/settings # Set in .env RUNPOD_API_KEY=your_key_here ``` Test SkyPilot with RunPod: ```bash sky launch --cloud runpod --gpus RTX3060:1 -- echo "Hello from RunPod" ``` #### Vast.ai ```bash # Get API key from https://cloud.vast.ai/account/ # Set in .env VASTAI_API_KEY=your_key_here ``` ### 3. Verify SkyPilot Configuration ```bash # Check all clouds sky check # List available GPUs on RunPod sky show-gpus --cloud runpod # List available GPUs on Vast.ai sky show-gpus --cloud vastai # Check quotas sky quota ``` ## Cloud Provider Configuration ### RunPod Setup 1. **Create Account**: https://runpod.io 2. **Add Credit**: Billing → Add Funds 3. **Get API Key**: Settings → API Keys → Create 4. **Test Connection**: ```bash curl -H "Authorization: Bearer YOUR_KEY" \ https://api.runpod.io/graphql \ -d '{"query": "query { myself { id } }"}' ``` ### Vast.ai Setup 1. **Create Account**: https://cloud.vast.ai 2. **Add Credit**: Account → Billing 3. **Get API Key**: Account → API Keys 4. **Test Connection**: ```bash curl -H "Authorization: Bearer YOUR_KEY" \ https://api.vast.ai/v0/instances ``` ## MCP Deployment ### 1. Clone Repository ```bash git clone https://github.com/your-org/mcp-orchestrator.git cd mcp-orchestrator ``` ### 2. Configure Environment ```bash # Copy template cp .env.example .env # Edit configuration nano .env ``` **Required Variables**: ```bash # At minimum, set these: RUNPOD_API_KEY=xxx VASTAI_API_KEY=xxx CLICKHOUSE_PASSWORD=strong_password_here ``` **Optional but Recommended**: ```bash # For multi-cloud via SkyPilot AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=xxx GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-key.json # For observability OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318 ``` ### 3. Initialize ClickHouse Schema The schema is automatically initialized on first start, but you can pre-create it: ```bash # Start only ClickHouse docker-compose up -d clickhouse # Wait for it to be ready sleep 10 # Initialize schema docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql # Verify tables docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp" ``` Should show: ``` deployments events gpu_rules heartbeats resources ``` ### 4. Deploy All Services ```bash # Make deploy script executable chmod +x scripts/deploy.sh # Run deployment ./scripts/deploy.sh ``` The script will: 1. Build Docker images 2. Start Redis and ClickHouse 3. Initialize database schema 4. Start Envoy Gateway 5. Start MCP API 6. Start Heartbeat worker 7. Verify all health checks ### 5. Deploy with Observability (Optional) ```bash # Deploy with Prometheus, Grafana, OpenTelemetry docker-compose --profile observability up -d ``` Access: - **Prometheus**: http://localhost:9090 - **Grafana**: http://localhost:3000 (admin/admin) - **OpenTelemetry**: http://localhost:8888 ## AI Envoy Configuration ### 1. Verify Envoy is Running ```bash # Check Envoy health curl http://localhost:10000/ready # View Envoy config curl http://localhost:10000/config_dump | jq . # Check clusters curl http://localhost:10000/clusters ``` ### 2. Configure Rate Limits Edit `envoy.yaml`: ```yaml rate_limits: - actions: - request_headers: header_name: "x-provider" descriptor_key: "provider" - request_headers: header_name: "x-user-id" descriptor_key: "user" ``` Reload: ```bash docker-compose restart envoy ``` ### 3. Configure Circuit Breakers In `envoy.yaml`: ```yaml circuit_breakers: thresholds: - priority: DEFAULT max_connections: 100 max_pending_requests: 100 max_requests: 100 max_retries: 3 max_connection_pools: 5 ``` ### 4. View Envoy Metrics ```bash # Prometheus-formatted metrics curl http://localhost:9901/stats/prometheus # Human-readable stats curl http://localhost:9901/stats # Specific cluster stats curl http://localhost:9901/clusters/mcp_api_cluster ``` ## Verification ### 1. Health Checks ```bash # MCP API health curl http://localhost:8000/health # Expected output: { "status": "healthy", "components": { "skypilot": true, "envoy": true, "clickhouse": true, "registry": true } } # Envoy health curl http://localhost:10000/ready # ClickHouse health docker-compose exec clickhouse clickhouse-client --query "SELECT 1" ``` ### 2. Verify SkyPilot Integration ```bash # Check SkyPilot from container docker-compose exec mcp-api sky check # Should show configured clouds docker-compose exec mcp-api sky status ``` ### 3. Verify GPU Rules Loaded ```bash curl http://localhost:8000/api/v1/gpu-rules | jq . # Should show default rules: # - RTX 3060 # - RTX 4090 # - RTX 6000 Ada ``` ### 4. View Logs ```bash # All services docker-compose logs -f # Specific service docker-compose logs -f mcp-api # Filter for SkyPilot docker-compose logs mcp-api | grep -i skypilot # Filter for Envoy docker-compose logs envoy | grep -i rate ``` ### 5. Test API ```bash # Root endpoint curl http://localhost:8000/ # API documentation open http://localhost:8000/docs # Metrics curl http://localhost:8000/metrics ``` ## First Deployment ### Simple Test Deployment ```bash # Deploy to RunPod with defaults (ON_DEMAND + RTX 3060) curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{ "spec": { "name": "test-deployment", "run": "nvidia-smi && sleep 300" } }' ``` ### Monitor Deployment ```bash # Watch logs docker-compose logs -f mcp-api # Check ClickHouse for deployment record docker-compose exec clickhouse clickhouse-client --query \ "SELECT * FROM deployments ORDER BY created_at DESC LIMIT 1 FORMAT Vertical" # Check SkyPilot status docker-compose exec mcp-api sky status ``` ### Verify via Envoy ```bash # Check if request went through Envoy curl http://localhost:9901/stats | grep mcp_api_cluster # Should show: cluster.mcp_api_cluster.upstream_rq_total: 1 cluster.mcp_api_cluster.upstream_rq_200: 1 ``` ### Cleanup Test Deployment ```bash # List resources curl http://localhost:8000/api/v1/providers/runpod/list # Terminate curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/test-deployment ``` ## Troubleshooting ### SkyPilot Issues **Problem**: `sky check` shows clouds as unavailable ```bash # Check credentials sky check -v # For AWS aws sts get-caller-identity # For GCP gcloud auth application-default print-access-token # For RunPod/Vast.ai echo $RUNPOD_API_KEY echo $VASTAI_API_KEY ``` **Problem**: Deployment fails with "No resources available" ```bash # Check available GPUs sky show-gpus --cloud runpod sky show-gpus --cloud vastai # Try different GPU curl -X POST ... -d '{"spec": {"gpu_name": "RTX 4090"}}' ``` ### Envoy Issues **Problem**: Requests not routing through Envoy ```bash # Check Envoy listeners curl http://localhost:10000/listeners # Check if MCP API cluster is up curl http://localhost:10000/clusters | grep mcp_api_cluster # Restart Envoy docker-compose restart envoy ``` **Problem**: Rate limit errors ```bash # Check rate limit stats curl http://localhost:9901/stats | grep ratelimit # Adjust in envoy.yaml or disable temporarily ``` ### ClickHouse Issues **Problem**: Cannot connect to ClickHouse ```bash # Check if running docker-compose ps clickhouse # Check logs docker-compose logs clickhouse # Test connection docker-compose exec clickhouse clickhouse-client --query "SELECT 1" # Reinitialize if needed docker-compose down -v docker-compose up -d clickhouse # Wait 30 seconds ./scripts/deploy.sh ``` ### Deployment Failures ```bash # Check deployment errors in ClickHouse docker-compose exec clickhouse clickhouse-client --query \ "SELECT * FROM events WHERE event_type = 'deployment_error' ORDER BY timestamp DESC LIMIT 5 FORMAT Vertical" # Check SkyPilot logs docker-compose exec mcp-api cat /app/logs/skypilot/sky.log # Check agent logs docker-compose logs mcp-api | grep -i "runpod\|vastai" ``` ## Next Steps 1. **Read the README**: Full feature documentation 2. **API Documentation**: http://localhost:8000/docs 3. **Set up Monitoring**: Deploy with `--profile observability` 4. **Production Hardening**: See README "Production Deployment" section 5. **Custom Rules**: Add your GPU rules via API ## Support - **Documentation**: See README.md and QUICK_START.md - **Issues**: https://github.com/your-org/mcp-orchestrator/issues - **Slack**: [Your Slack workspace] ## Security Checklist Before going to production: - [ ] Change default ClickHouse password - [ ] Enable Envoy TLS - [ ] Rotate all API keys - [ ] Set up secrets management (Vault) - [ ] Enable ClickHouse authentication - [ ] Configure firewall rules - [ ] Set up monitoring and alerting - [ ] Review Envoy rate limits - [ ] Enable audit logging - [ ] Backup ClickHouse data --- **You're ready to orchestrate GPU workloads across clouds! 🚀**

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/nomadslayer/infra-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server