Stores and queries resource lifecycle telemetry, deployment events, heartbeats, and cost analysis data for multi-cloud GPU infrastructure management.
Manages task queues for orchestrating evaluation, deployment, and on-premises agents in the multi-cloud platform workflow.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Multi-Cloud Infrastructure MCP Serverdeploy an inference server with auto GPU selection"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
MCP (Multi-Cloud Platform) Server
This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint).
Highlights
Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.
SkyPilot Agent builds dynamic YAML and executes the
skyCLI.OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).
Orchestrator wires agents together using Redis queues and ClickHouse telemetry.
Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.
Phoenix observability hooks and Envoy integration for telemetry events.
Additional files
scripts/resource_heartbeat.pyβ example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator.
Quick start (local dry-run)
Install Python packages:
pip install -r requirements.txtStart Redis (e.g.
docker run -p 6379:6379 -d redis) and optionally ClickHouse.Start the MCP server:
python -m src.mcp.mainPush a demo task into Redis (see
scripts/run_demo.sh)Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in
.env).
Notes & caveats
This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline.
MCP Orchestrator - Quick Reference
π Installation (5 minutes)
# 1. Configure environment
cp .env.example .env
nano .env # Add your API keys
# 2. Deploy everything
chmod +x scripts/deploy.sh
./scripts/deploy.sh
# 3. Verify
curl http://localhost:8000/healthπ‘ Common API Calls
Deploy with Auto GPU Selection
# Inference workload (will select cost-effective GPU)
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{
"task_type": "inference",
"spec": {
"name": "llm-server",
"image": "vllm/vllm-openai:latest",
"command": "python -m vllm.entrypoints.api_server"
}
}'
# Training workload (will select powerful GPU)
curl -X POST http://localhost:8000/api/v1/providers/vastai/deploy \
-H "Content-Type: application/json" \
-d '{
"task_type": "training",
"spec": {
"name": "fine-tune-job",
"image": "pytorch/pytorch:latest"
}
}'Deploy with Specific GPU
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{
"spec": {
"name": "custom-pod",
"gpu_name": "RTX 4090",
"resources": {
"accelerators": "RTX 4090:2"
}
}
}'Deploy to Provider (Default: ON_DEMAND + RTX 3060)
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{"spec": {"name": "simple-pod"}}'Register Existing Infrastructure
# Vast.ai instance
curl -X POST http://localhost:8000/api/v1/register \
-H "Content-Type: application/json" \
-d '{
"provider": "vastai",
"resource_id": "12345",
"credentials": {"api_key": "YOUR_VASTAI_KEY"}
}'
# Bulk registration
curl -X POST http://localhost:8000/api/v1/register \
-H "Content-Type: application/json" \
-d '{
"provider": "vastai",
"resource_ids": ["12345", "67890"],
"credentials": {"api_key": "YOUR_VASTAI_KEY"}
}'List Resources
# All RunPod resources
curl http://localhost:8000/api/v1/providers/runpod/list
# All Vast.ai resources
curl http://localhost:8000/api/v1/providers/vastai/listTerminate Resource
curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/pod_abc123π― GPU Rules Management
View Rules
curl http://localhost:8000/api/v1/gpu-rulesAdd Rule
curl -X POST http://localhost:8000/api/v1/gpu-rules \
-H "Content-Type: application/json" \
-d '{
"gpu_family": "H100",
"type": "Enterprise",
"min_use_case": "large-scale training",
"optimal_use_case": "foundation models",
"power_rating": "700W",
"typical_cloud_instance": "RunPod",
"priority": 0
}'Delete Rule
curl -X DELETE http://localhost:8000/api/v1/gpu-rules/RTX%203060π Monitoring
ClickHouse Queries
-- Active resources
SELECT provider, status, count() as total
FROM resources
WHERE status IN ('running', 'active')
GROUP BY provider, status;
-- Recent deployments
SELECT *
FROM deployments
ORDER BY created_at DESC
LIMIT 10;
-- Latest heartbeats
SELECT resource_id, status, timestamp
FROM heartbeats
WHERE timestamp > now() - INTERVAL 5 MINUTE
ORDER BY timestamp DESC;
-- Cost analysis
SELECT
provider,
sum(price_hour) as total_hourly_cost,
avg(price_hour) as avg_cost
FROM resources
WHERE status = 'running'
GROUP BY provider;
-- Event volume
SELECT
event_type,
count() as count,
toStartOfHour(timestamp) as hour
FROM events
WHERE timestamp > now() - INTERVAL 24 HOUR
GROUP BY event_type, hour
ORDER BY hour DESC, count DESC;View Logs
# All services
docker-compose logs -f
# API only
docker-compose logs -f mcp-api
# Heartbeat monitor
docker-compose logs -f heartbeat-worker
# ClickHouse
docker-compose logs -f clickhouseπ οΈ Maintenance
Restart Services
# Restart all
docker-compose restart
# Restart API only
docker-compose restart mcp-api
# Reload with new code
docker-compose up -d --buildBackup ClickHouse
# Backup database
docker-compose exec clickhouse clickhouse-client --query \
"BACKUP DATABASE mcp TO Disk('default', 'backup_$(date +%Y%m%d).zip')"
# Export table
docker-compose exec clickhouse clickhouse-client --query \
"SELECT * FROM resources FORMAT CSVWithNames" > resources_backup.csvClean Up
# Stop all services
docker-compose down
# Stop and remove volumes (WARNING: deletes data)
docker-compose down -v
# Prune old data from ClickHouse (events older than 90 days auto-expire)
docker-compose exec clickhouse clickhouse-client --query \
"OPTIMIZE TABLE events FINAL"π Troubleshooting
Service won't start
# Check status
docker-compose ps
# Check logs
docker-compose logs mcp-api
# Verify config
cat .env | grep -v '^#' | grep -v '^$'ClickHouse connection issues
# Test connection
docker-compose exec clickhouse clickhouse-client --query "SELECT 1"
# Reinitialize
docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql
# Check tables
docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp"API returns 404 for provider
# Check if agent initialized
docker-compose logs mcp-api | grep -i "AgentRegistry initialized"
# Restart with fresh logs
docker-compose restart mcp-api && docker-compose logs -f mcp-apiHeartbeat not working
# Check heartbeat worker
docker-compose logs heartbeat-worker
# Manual health check
curl http://localhost:8000/api/v1/providers/runpod/listπ Environment Variables
Key variables in .env:
# Required
RUNPOD_API_KEY=xxx # Your RunPod API key
VASTAI_API_KEY=xxx # Your Vast.ai API key (used per-request only)
# ClickHouse
CLICKHOUSE_PASSWORD=xxx # Set strong password
# Optional
LOG_LEVEL=INFO # DEBUG for verbose logs
WORKERS=4 # API worker processes
HEARTBEAT_INTERVAL=60 # Seconds between health checksπ Security Checklist
Change default ClickHouse password
Store
.envsecurely (add to.gitignore)Use separate API keys for prod/staging
Enable ClickHouse authentication
Configure AI Envoy Gateway policies
Rotate API keys regularly
Review ClickHouse access logs
Set up alerting for unhealthy resources
π Resources
API Docs: http://localhost:8000/docs
ClickHouse UI: http://localhost:8124 (with
--profile debug)Health Check: http://localhost:8000/health
Full README: See README.md
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.