MCP (Multi-Cloud Platform) Server

This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint).

Highlights

Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.
SkyPilot Agent builds dynamic YAML and executes the sky CLI.
OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).
Orchestrator wires agents together using Redis queues and ClickHouse telemetry.
Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.
Phoenix observability hooks and Envoy integration for telemetry events.

Additional files

scripts/resource_heartbeat.py — example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator.

Quick start (local dry-run)

Install Python packages: pip install -r requirements.txt
Start Redis (e.g. docker run -p 6379:6379 -d redis) and optionally ClickHouse.
Start the MCP server: python -m src.mcp.main
Push a demo task into Redis (see scripts/run_demo.sh)
Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in .env).

Notes & caveats

This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline.

MCP Orchestrator - Quick Reference

🚀 Installation (5 minutes)

# 1. Configure environment cp .env.example .env nano .env # Add your API keys # 2. Deploy everything chmod +x scripts/deploy.sh ./scripts/deploy.sh # 3. Verify curl http://localhost:8000/health

📡 Common API Calls

Deploy with Auto GPU Selection

# Inference workload (will select cost-effective GPU) curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{ "task_type": "inference", "spec": { "name": "llm-server", "image": "vllm/vllm-openai:latest", "command": "python -m vllm.entrypoints.api_server" } }' # Training workload (will select powerful GPU) curl -X POST http://localhost:8000/api/v1/providers/vastai/deploy \ -H "Content-Type: application/json" \ -d '{ "task_type": "training", "spec": { "name": "fine-tune-job", "image": "pytorch/pytorch:latest" } }'

Deploy with Specific GPU

curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{ "spec": { "name": "custom-pod", "gpu_name": "RTX 4090", "resources": { "accelerators": "RTX 4090:2" } } }'

Deploy to Provider (Default: ON_DEMAND + RTX 3060)

curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{"spec": {"name": "simple-pod"}}'

Register Existing Infrastructure

# Vast.ai instance curl -X POST http://localhost:8000/api/v1/register \ -H "Content-Type: application/json" \ -d '{ "provider": "vastai", "resource_id": "12345", "credentials": {"api_key": "YOUR_VASTAI_KEY"} }' # Bulk registration curl -X POST http://localhost:8000/api/v1/register \ -H "Content-Type: application/json" \ -d '{ "provider": "vastai", "resource_ids": ["12345", "67890"], "credentials": {"api_key": "YOUR_VASTAI_KEY"} }'

List Resources

# All RunPod resources curl http://localhost:8000/api/v1/providers/runpod/list # All Vast.ai resources curl http://localhost:8000/api/v1/providers/vastai/list

Terminate Resource

curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/pod_abc123

🎯 GPU Rules Management

View Rules

curl http://localhost:8000/api/v1/gpu-rules

Add Rule

curl -X POST http://localhost:8000/api/v1/gpu-rules \ -H "Content-Type: application/json" \ -d '{ "gpu_family": "H100", "type": "Enterprise", "min_use_case": "large-scale training", "optimal_use_case": "foundation models", "power_rating": "700W", "typical_cloud_instance": "RunPod", "priority": 0 }'

Delete Rule

curl -X DELETE http://localhost:8000/api/v1/gpu-rules/RTX%203060

🔍 Monitoring

ClickHouse Queries

-- Active resources SELECT provider, status, count() as total FROM resources WHERE status IN ('running', 'active') GROUP BY provider, status; -- Recent deployments SELECT * FROM deployments ORDER BY created_at DESC LIMIT 10; -- Latest heartbeats SELECT resource_id, status, timestamp FROM heartbeats WHERE timestamp > now() - INTERVAL 5 MINUTE ORDER BY timestamp DESC; -- Cost analysis SELECT provider, sum(price_hour) as total_hourly_cost, avg(price_hour) as avg_cost FROM resources WHERE status = 'running' GROUP BY provider; -- Event volume SELECT event_type, count() as count, toStartOfHour(timestamp) as hour FROM events WHERE timestamp > now() - INTERVAL 24 HOUR GROUP BY event_type, hour ORDER BY hour DESC, count DESC;

View Logs

# All services docker-compose logs -f # API only docker-compose logs -f mcp-api # Heartbeat monitor docker-compose logs -f heartbeat-worker # ClickHouse docker-compose logs -f clickhouse

🛠️ Maintenance

Restart Services

# Restart all docker-compose restart # Restart API only docker-compose restart mcp-api # Reload with new code docker-compose up -d --build

Backup ClickHouse

# Backup database docker-compose exec clickhouse clickhouse-client --query \ "BACKUP DATABASE mcp TO Disk('default', 'backup_$(date +%Y%m%d).zip')" # Export table docker-compose exec clickhouse clickhouse-client --query \ "SELECT * FROM resources FORMAT CSVWithNames" > resources_backup.csv

Clean Up

# Stop all services docker-compose down # Stop and remove volumes (WARNING: deletes data) docker-compose down -v # Prune old data from ClickHouse (events older than 90 days auto-expire) docker-compose exec clickhouse clickhouse-client --query \ "OPTIMIZE TABLE events FINAL"

🐛 Troubleshooting

Service won't start

# Check status docker-compose ps # Check logs docker-compose logs mcp-api # Verify config cat .env | grep -v '^#' | grep -v '^$'

ClickHouse connection issues

# Test connection docker-compose exec clickhouse clickhouse-client --query "SELECT 1" # Reinitialize docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql # Check tables docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp"

API returns 404 for provider

# Check if agent initialized docker-compose logs mcp-api | grep -i "AgentRegistry initialized" # Restart with fresh logs docker-compose restart mcp-api && docker-compose logs -f mcp-api

Heartbeat not working

# Check heartbeat worker docker-compose logs heartbeat-worker # Manual health check curl http://localhost:8000/api/v1/providers/runpod/list

📝 Environment Variables

Key variables in .env:

# Required RUNPOD_API_KEY=xxx # Your RunPod API key VASTAI_API_KEY=xxx # Your Vast.ai API key (used per-request only) # ClickHouse CLICKHOUSE_PASSWORD=xxx # Set strong password # Optional LOG_LEVEL=INFO # DEBUG for verbose logs WORKERS=4 # API worker processes HEARTBEAT_INTERVAL=60 # Seconds between health checks

🔐 Security Checklist

Change default ClickHouse password
Store .env securely (add to .gitignore)
Use separate API keys for prod/staging
Enable ClickHouse authentication
Configure AI Envoy Gateway policies
Rotate API keys regularly
Review ClickHouse access logs
Set up alerting for unhealthy resources

📚 Resources

API Docs: http://localhost:8000/docs
ClickHouse UI: http://localhost:8124 (with --profile debug)
Health Check: http://localhost:8000/health
Full README: See README.md

Multi-Cloud Infrastructure MCP Server