Stores and queries resource lifecycle telemetry, deployment events, heartbeats, and cost analysis data for multi-cloud GPU infrastructure management.
Manages task queues for orchestrating evaluation, deployment, and on-premises agents in the multi-cloud platform workflow.
MCP (Multi-Cloud Platform) Server
This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint).
Highlights
Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.
SkyPilot Agent builds dynamic YAML and executes the
skyCLI.OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).
Orchestrator wires agents together using Redis queues and ClickHouse telemetry.
Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.
Phoenix observability hooks and Envoy integration for telemetry events.
Additional files
scripts/resource_heartbeat.py— example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator.
Quick start (local dry-run)
Install Python packages:
pip install -r requirements.txtStart Redis (e.g.
docker run -p 6379:6379 -d redis) and optionally ClickHouse.Start the MCP server:
python -m src.mcp.mainPush a demo task into Redis (see
scripts/run_demo.sh)Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in
.env).
Notes & caveats
This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline.
MCP Orchestrator - Quick Reference
🚀 Installation (5 minutes)
📡 Common API Calls
Deploy with Auto GPU Selection
Deploy with Specific GPU
Deploy to Provider (Default: ON_DEMAND + RTX 3060)
Register Existing Infrastructure
List Resources
Terminate Resource
🎯 GPU Rules Management
View Rules
Add Rule
Delete Rule
🔍 Monitoring
ClickHouse Queries
View Logs
🛠️ Maintenance
Restart Services
Backup ClickHouse
Clean Up
🐛 Troubleshooting
Service won't start
ClickHouse connection issues
API returns 404 for provider
Heartbeat not working
📝 Environment Variables
Key variables in .env:
🔐 Security Checklist
Change default ClickHouse password
Store
.envsecurely (add to.gitignore)Use separate API keys for prod/staging
Enable ClickHouse authentication
Configure AI Envoy Gateway policies
Rotate API keys regularly
Review ClickHouse access logs
Set up alerting for unhealthy resources
📚 Resources
API Docs: http://localhost:8000/docs
ClickHouse UI: http://localhost:8124 (with
--profile debug)Health Check: http://localhost:8000/health
Full README: See README.md
This server cannot be installed