Stores and queries resource lifecycle telemetry, deployment events, heartbeats, and cost analysis data for multi-cloud GPU infrastructure management.
Manages task queues for orchestrating evaluation, deployment, and on-premises agents in the multi-cloud platform workflow.
MCP (Multi-Cloud Platform) Server
This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint).
Highlights
Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.
SkyPilot Agent builds dynamic YAML and executes the
skyCLI.OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).
Orchestrator wires agents together using Redis queues and ClickHouse telemetry.
Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.
Phoenix observability hooks and Envoy integration for telemetry events.
Additional files
scripts/resource_heartbeat.pyโ example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator.
Quick start (local dry-run)
Install Python packages:
pip install -r requirements.txtStart Redis (e.g.
docker run -p 6379:6379 -d redis) and optionally ClickHouse.Start the MCP server:
python -m src.mcp.mainPush a demo task into Redis (see
scripts/run_demo.sh)Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in
.env).
Notes & caveats
This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline.
MCP Orchestrator - Quick Reference
๐ Installation (5 minutes)
๐ก Common API Calls
Deploy with Auto GPU Selection
Deploy with Specific GPU
Deploy to Provider (Default: ON_DEMAND + RTX 3060)
Register Existing Infrastructure
List Resources
Terminate Resource
๐ฏ GPU Rules Management
View Rules
Add Rule
Delete Rule
๐ Monitoring
ClickHouse Queries
View Logs
๐ ๏ธ Maintenance
Restart Services
Backup ClickHouse
Clean Up
๐ Troubleshooting
Service won't start
ClickHouse connection issues
API returns 404 for provider
Heartbeat not working
๐ Environment Variables
Key variables in .env:
๐ Security Checklist
Change default ClickHouse password
Store
.envsecurely (add to.gitignore)Use separate API keys for prod/staging
Enable ClickHouse authentication
Configure AI Envoy Gateway policies
Rotate API keys regularly
Review ClickHouse access logs
Set up alerting for unhealthy resources
๐ Resources
API Docs: http://localhost:8000/docs
ClickHouse UI: http://localhost:8124 (with
--profile debug)Health Check: http://localhost:8000/health
Full README: See README.md
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
Enables deployment and management of GPU workloads across multiple cloud providers (RunPod, Vast.ai) with intelligent GPU selection, resource monitoring, and telemetry tracking through Redis, ClickHouse, and SkyPilot integration.