Skip to main content
Glama

Multi-Cloud Infrastructure MCP Server

by nomadslayer

MCP (Multi-Cloud Platform) Server

This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint).

Highlights

  • Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.

  • SkyPilot Agent builds dynamic YAML and executes the sky CLI.

  • OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).

  • Orchestrator wires agents together using Redis queues and ClickHouse telemetry.

  • Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.

  • Phoenix observability hooks and Envoy integration for telemetry events.

Additional files

  • scripts/resource_heartbeat.py โ€” example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator.

Quick start (local dry-run)

  1. Install Python packages: pip install -r requirements.txt

  2. Start Redis (e.g. docker run -p 6379:6379 -d redis) and optionally ClickHouse.

  3. Start the MCP server: python -m src.mcp.main

  4. Push a demo task into Redis (see scripts/run_demo.sh)

  5. Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in .env).

Notes & caveats

  • This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline.

MCP Orchestrator - Quick Reference

๐Ÿš€ Installation (5 minutes)

# 1. Configure environment cp .env.example .env nano .env # Add your API keys # 2. Deploy everything chmod +x scripts/deploy.sh ./scripts/deploy.sh # 3. Verify curl http://localhost:8000/health

๐Ÿ“ก Common API Calls

Deploy with Auto GPU Selection

# Inference workload (will select cost-effective GPU) curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{ "task_type": "inference", "spec": { "name": "llm-server", "image": "vllm/vllm-openai:latest", "command": "python -m vllm.entrypoints.api_server" } }' # Training workload (will select powerful GPU) curl -X POST http://localhost:8000/api/v1/providers/vastai/deploy \ -H "Content-Type: application/json" \ -d '{ "task_type": "training", "spec": { "name": "fine-tune-job", "image": "pytorch/pytorch:latest" } }'

Deploy with Specific GPU

curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{ "spec": { "name": "custom-pod", "gpu_name": "RTX 4090", "resources": { "accelerators": "RTX 4090:2" } } }'

Deploy to Provider (Default: ON_DEMAND + RTX 3060)

curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{"spec": {"name": "simple-pod"}}'

Register Existing Infrastructure

# Vast.ai instance curl -X POST http://localhost:8000/api/v1/register \ -H "Content-Type: application/json" \ -d '{ "provider": "vastai", "resource_id": "12345", "credentials": {"api_key": "YOUR_VASTAI_KEY"} }' # Bulk registration curl -X POST http://localhost:8000/api/v1/register \ -H "Content-Type: application/json" \ -d '{ "provider": "vastai", "resource_ids": ["12345", "67890"], "credentials": {"api_key": "YOUR_VASTAI_KEY"} }'

List Resources

# All RunPod resources curl http://localhost:8000/api/v1/providers/runpod/list # All Vast.ai resources curl http://localhost:8000/api/v1/providers/vastai/list

Terminate Resource

curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/pod_abc123

๐ŸŽฏ GPU Rules Management

View Rules

curl http://localhost:8000/api/v1/gpu-rules

Add Rule

curl -X POST http://localhost:8000/api/v1/gpu-rules \ -H "Content-Type: application/json" \ -d '{ "gpu_family": "H100", "type": "Enterprise", "min_use_case": "large-scale training", "optimal_use_case": "foundation models", "power_rating": "700W", "typical_cloud_instance": "RunPod", "priority": 0 }'

Delete Rule

curl -X DELETE http://localhost:8000/api/v1/gpu-rules/RTX%203060

๐Ÿ” Monitoring

ClickHouse Queries

-- Active resources SELECT provider, status, count() as total FROM resources WHERE status IN ('running', 'active') GROUP BY provider, status; -- Recent deployments SELECT * FROM deployments ORDER BY created_at DESC LIMIT 10; -- Latest heartbeats SELECT resource_id, status, timestamp FROM heartbeats WHERE timestamp > now() - INTERVAL 5 MINUTE ORDER BY timestamp DESC; -- Cost analysis SELECT provider, sum(price_hour) as total_hourly_cost, avg(price_hour) as avg_cost FROM resources WHERE status = 'running' GROUP BY provider; -- Event volume SELECT event_type, count() as count, toStartOfHour(timestamp) as hour FROM events WHERE timestamp > now() - INTERVAL 24 HOUR GROUP BY event_type, hour ORDER BY hour DESC, count DESC;

View Logs

# All services docker-compose logs -f # API only docker-compose logs -f mcp-api # Heartbeat monitor docker-compose logs -f heartbeat-worker # ClickHouse docker-compose logs -f clickhouse

๐Ÿ› ๏ธ Maintenance

Restart Services

# Restart all docker-compose restart # Restart API only docker-compose restart mcp-api # Reload with new code docker-compose up -d --build

Backup ClickHouse

# Backup database docker-compose exec clickhouse clickhouse-client --query \ "BACKUP DATABASE mcp TO Disk('default', 'backup_$(date +%Y%m%d).zip')" # Export table docker-compose exec clickhouse clickhouse-client --query \ "SELECT * FROM resources FORMAT CSVWithNames" > resources_backup.csv

Clean Up

# Stop all services docker-compose down # Stop and remove volumes (WARNING: deletes data) docker-compose down -v # Prune old data from ClickHouse (events older than 90 days auto-expire) docker-compose exec clickhouse clickhouse-client --query \ "OPTIMIZE TABLE events FINAL"

๐Ÿ› Troubleshooting

Service won't start

# Check status docker-compose ps # Check logs docker-compose logs mcp-api # Verify config cat .env | grep -v '^#' | grep -v '^$'

ClickHouse connection issues

# Test connection docker-compose exec clickhouse clickhouse-client --query "SELECT 1" # Reinitialize docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql # Check tables docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp"

API returns 404 for provider

# Check if agent initialized docker-compose logs mcp-api | grep -i "AgentRegistry initialized" # Restart with fresh logs docker-compose restart mcp-api && docker-compose logs -f mcp-api

Heartbeat not working

# Check heartbeat worker docker-compose logs heartbeat-worker # Manual health check curl http://localhost:8000/api/v1/providers/runpod/list

๐Ÿ“ Environment Variables

Key variables in .env:

# Required RUNPOD_API_KEY=xxx # Your RunPod API key VASTAI_API_KEY=xxx # Your Vast.ai API key (used per-request only) # ClickHouse CLICKHOUSE_PASSWORD=xxx # Set strong password # Optional LOG_LEVEL=INFO # DEBUG for verbose logs WORKERS=4 # API worker processes HEARTBEAT_INTERVAL=60 # Seconds between health checks

๐Ÿ” Security Checklist

  • Change default ClickHouse password

  • Store .env securely (add to .gitignore)

  • Use separate API keys for prod/staging

  • Enable ClickHouse authentication

  • Configure AI Envoy Gateway policies

  • Rotate API keys regularly

  • Review ClickHouse access logs

  • Set up alerting for unhealthy resources

๐Ÿ“š Resources

-
security - not tested
F
license - not found
-
quality - not tested

hybrid server

The server is able to function both locally and remotely, depending on the configuration or use case.

Enables deployment and management of GPU workloads across multiple cloud providers (RunPod, Vast.ai) with intelligent GPU selection, resource monitoring, and telemetry tracking through Redis, ClickHouse, and SkyPilot integration.

  1. MCP Orchestrator - Quick Reference
    1. ๐Ÿš€ Installation (5 minutes)
    2. ๐Ÿ“ก Common API Calls
    3. ๐ŸŽฏ GPU Rules Management
    4. ๐Ÿ” Monitoring
    5. ๐Ÿ› ๏ธ Maintenance
    6. ๐Ÿ› Troubleshooting
    7. ๐Ÿ“ Environment Variables
    8. ๐Ÿ” Security Checklist
    9. ๐Ÿ“š Resources

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/nomadslayer/infra-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server