Multi-Cloud Infrastructure MCP Server

README.md•8.36 kB

# MCP (Multi-Cloud Platform) Server This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint). Highlights - Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans. - SkyPilot Agent builds dynamic YAML and executes the `sky` CLI. - OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm). - Orchestrator wires agents together using Redis queues and ClickHouse telemetry. - Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1. - Phoenix observability hooks and Envoy integration for telemetry events. Additional files - `scripts/resource_heartbeat.py` — example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator. Quick start (local dry-run) 1. Install Python packages: `pip install -r requirements.txt` 2. Start Redis (e.g. `docker run -p 6379:6379 -d redis`) and optionally ClickHouse. 3. Start the MCP server: `python -m src.mcp.main` 4. Push a demo task into Redis (see `scripts/run_demo.sh`) 5. Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in `.env`). Notes & caveats - This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline. # MCP Orchestrator - Quick Reference ## 🚀 Installation (5 minutes) ```bash # 1. Configure environment cp .env.example .env nano .env # Add your API keys # 2. Deploy everything chmod +x scripts/deploy.sh ./scripts/deploy.sh # 3. Verify curl http://localhost:8000/health ``` ## 📡 Common API Calls ### Deploy with Auto GPU Selection ```bash # Inference workload (will select cost-effective GPU) curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{ "task_type": "inference", "spec": { "name": "llm-server", "image": "vllm/vllm-openai:latest", "command": "python -m vllm.entrypoints.api_server" } }' # Training workload (will select powerful GPU) curl -X POST http://localhost:8000/api/v1/providers/vastai/deploy \ -H "Content-Type: application/json" \ -d '{ "task_type": "training", "spec": { "name": "fine-tune-job", "image": "pytorch/pytorch:latest" } }' ``` ### Deploy with Specific GPU ```bash curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{ "spec": { "name": "custom-pod", "gpu_name": "RTX 4090", "resources": { "accelerators": "RTX 4090:2" } } }' ``` ### Deploy to Provider (Default: ON_DEMAND + RTX 3060) ```bash curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \ -H "Content-Type: application/json" \ -d '{"spec": {"name": "simple-pod"}}' ``` ### Register Existing Infrastructure ```bash # Vast.ai instance curl -X POST http://localhost:8000/api/v1/register \ -H "Content-Type: application/json" \ -d '{ "provider": "vastai", "resource_id": "12345", "credentials": {"api_key": "YOUR_VASTAI_KEY"} }' # Bulk registration curl -X POST http://localhost:8000/api/v1/register \ -H "Content-Type: application/json" \ -d '{ "provider": "vastai", "resource_ids": ["12345", "67890"], "credentials": {"api_key": "YOUR_VASTAI_KEY"} }' ``` ### List Resources ```bash # All RunPod resources curl http://localhost:8000/api/v1/providers/runpod/list # All Vast.ai resources curl http://localhost:8000/api/v1/providers/vastai/list ``` ### Terminate Resource ```bash curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/pod_abc123 ``` ## 🎯 GPU Rules Management ### View Rules ```bash curl http://localhost:8000/api/v1/gpu-rules ``` ### Add Rule ```bash curl -X POST http://localhost:8000/api/v1/gpu-rules \ -H "Content-Type: application/json" \ -d '{ "gpu_family": "H100", "type": "Enterprise", "min_use_case": "large-scale training", "optimal_use_case": "foundation models", "power_rating": "700W", "typical_cloud_instance": "RunPod", "priority": 0 }' ``` ### Delete Rule ```bash curl -X DELETE http://localhost:8000/api/v1/gpu-rules/RTX%203060 ``` ## 🔍 Monitoring ### ClickHouse Queries ```sql -- Active resources SELECT provider, status, count() as total FROM resources WHERE status IN ('running', 'active') GROUP BY provider, status; -- Recent deployments SELECT * FROM deployments ORDER BY created_at DESC LIMIT 10; -- Latest heartbeats SELECT resource_id, status, timestamp FROM heartbeats WHERE timestamp > now() - INTERVAL 5 MINUTE ORDER BY timestamp DESC; -- Cost analysis SELECT provider, sum(price_hour) as total_hourly_cost, avg(price_hour) as avg_cost FROM resources WHERE status = 'running' GROUP BY provider; -- Event volume SELECT event_type, count() as count, toStartOfHour(timestamp) as hour FROM events WHERE timestamp > now() - INTERVAL 24 HOUR GROUP BY event_type, hour ORDER BY hour DESC, count DESC; ``` ### View Logs ```bash # All services docker-compose logs -f # API only docker-compose logs -f mcp-api # Heartbeat monitor docker-compose logs -f heartbeat-worker # ClickHouse docker-compose logs -f clickhouse ``` ## 🛠️ Maintenance ### Restart Services ```bash # Restart all docker-compose restart # Restart API only docker-compose restart mcp-api # Reload with new code docker-compose up -d --build ``` ### Backup ClickHouse ```bash # Backup database docker-compose exec clickhouse clickhouse-client --query \ "BACKUP DATABASE mcp TO Disk('default', 'backup_$(date +%Y%m%d).zip')" # Export table docker-compose exec clickhouse clickhouse-client --query \ "SELECT * FROM resources FORMAT CSVWithNames" > resources_backup.csv ``` ### Clean Up ```bash # Stop all services docker-compose down # Stop and remove volumes (WARNING: deletes data) docker-compose down -v # Prune old data from ClickHouse (events older than 90 days auto-expire) docker-compose exec clickhouse clickhouse-client --query \ "OPTIMIZE TABLE events FINAL" ``` ## 🐛 Troubleshooting ### Service won't start ```bash # Check status docker-compose ps # Check logs docker-compose logs mcp-api # Verify config cat .env | grep -v '^#' | grep -v '^$' ``` ### ClickHouse connection issues ```bash # Test connection docker-compose exec clickhouse clickhouse-client --query "SELECT 1" # Reinitialize docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql # Check tables docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp" ``` ### API returns 404 for provider ```bash # Check if agent initialized docker-compose logs mcp-api | grep -i "AgentRegistry initialized" # Restart with fresh logs docker-compose restart mcp-api && docker-compose logs -f mcp-api ``` ### Heartbeat not working ```bash # Check heartbeat worker docker-compose logs heartbeat-worker # Manual health check curl http://localhost:8000/api/v1/providers/runpod/list ``` ## 📝 Environment Variables Key variables in `.env`: ```bash # Required RUNPOD_API_KEY=xxx # Your RunPod API key VASTAI_API_KEY=xxx # Your Vast.ai API key (used per-request only) # ClickHouse CLICKHOUSE_PASSWORD=xxx # Set strong password # Optional LOG_LEVEL=INFO # DEBUG for verbose logs WORKERS=4 # API worker processes HEARTBEAT_INTERVAL=60 # Seconds between health checks ``` ## 🔐 Security Checklist - [ ] Change default ClickHouse password - [ ] Store `.env` securely (add to `.gitignore`) - [ ] Use separate API keys for prod/staging - [ ] Enable ClickHouse authentication - [ ] Configure AI Envoy Gateway policies - [ ] Rotate API keys regularly - [ ] Review ClickHouse access logs - [ ] Set up alerting for unhealthy resources ## 📚 Resources - **API Docs**: http://localhost:8000/docs - **ClickHouse UI**: http://localhost:8124 (with `--profile debug`) - **Health Check**: http://localhost:8000/health - **Full README**: See README.md

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/nomadslayer/infra-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server