README.md•8.36 kB
# MCP (Multi-Cloud Platform) Server
This repository provides a working, extensible reference implementation of an MCP server
with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now
includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy"
endpoint (a generic HTTP ingestion endpoint).
Highlights
- Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.
- SkyPilot Agent builds dynamic YAML and executes the `sky` CLI.
- OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).
- Orchestrator wires agents together using Redis queues and ClickHouse telemetry.
- Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.
- Phoenix observability hooks and Envoy integration for telemetry events.
Additional files
- `scripts/resource_heartbeat.py` — example script that runs inside a provisioned resource
and posts periodic GPU utilization/heartbeat to the orchestrator.
Quick start (local dry-run)
1. Install Python packages: `pip install -r requirements.txt`
2. Start Redis (e.g. `docker run -p 6379:6379 -d redis`) and optionally ClickHouse.
3. Start the MCP server: `python -m src.mcp.main`
4. Push a demo task into Redis (see `scripts/run_demo.sh`)
5. Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in `.env`).
Notes & caveats
- This is a reference implementation. You will need to install and configure real
services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint)
to get a fully working pipeline.
# MCP Orchestrator - Quick Reference
## 🚀 Installation (5 minutes)
```bash
# 1. Configure environment
cp .env.example .env
nano .env # Add your API keys
# 2. Deploy everything
chmod +x scripts/deploy.sh
./scripts/deploy.sh
# 3. Verify
curl http://localhost:8000/health
```
## 📡 Common API Calls
### Deploy with Auto GPU Selection
```bash
# Inference workload (will select cost-effective GPU)
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{
"task_type": "inference",
"spec": {
"name": "llm-server",
"image": "vllm/vllm-openai:latest",
"command": "python -m vllm.entrypoints.api_server"
}
}'
# Training workload (will select powerful GPU)
curl -X POST http://localhost:8000/api/v1/providers/vastai/deploy \
-H "Content-Type: application/json" \
-d '{
"task_type": "training",
"spec": {
"name": "fine-tune-job",
"image": "pytorch/pytorch:latest"
}
}'
```
### Deploy with Specific GPU
```bash
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{
"spec": {
"name": "custom-pod",
"gpu_name": "RTX 4090",
"resources": {
"accelerators": "RTX 4090:2"
}
}
}'
```
### Deploy to Provider (Default: ON_DEMAND + RTX 3060)
```bash
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
-H "Content-Type: application/json" \
-d '{"spec": {"name": "simple-pod"}}'
```
### Register Existing Infrastructure
```bash
# Vast.ai instance
curl -X POST http://localhost:8000/api/v1/register \
-H "Content-Type: application/json" \
-d '{
"provider": "vastai",
"resource_id": "12345",
"credentials": {"api_key": "YOUR_VASTAI_KEY"}
}'
# Bulk registration
curl -X POST http://localhost:8000/api/v1/register \
-H "Content-Type: application/json" \
-d '{
"provider": "vastai",
"resource_ids": ["12345", "67890"],
"credentials": {"api_key": "YOUR_VASTAI_KEY"}
}'
```
### List Resources
```bash
# All RunPod resources
curl http://localhost:8000/api/v1/providers/runpod/list
# All Vast.ai resources
curl http://localhost:8000/api/v1/providers/vastai/list
```
### Terminate Resource
```bash
curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/pod_abc123
```
## 🎯 GPU Rules Management
### View Rules
```bash
curl http://localhost:8000/api/v1/gpu-rules
```
### Add Rule
```bash
curl -X POST http://localhost:8000/api/v1/gpu-rules \
-H "Content-Type: application/json" \
-d '{
"gpu_family": "H100",
"type": "Enterprise",
"min_use_case": "large-scale training",
"optimal_use_case": "foundation models",
"power_rating": "700W",
"typical_cloud_instance": "RunPod",
"priority": 0
}'
```
### Delete Rule
```bash
curl -X DELETE http://localhost:8000/api/v1/gpu-rules/RTX%203060
```
## 🔍 Monitoring
### ClickHouse Queries
```sql
-- Active resources
SELECT provider, status, count() as total
FROM resources
WHERE status IN ('running', 'active')
GROUP BY provider, status;
-- Recent deployments
SELECT *
FROM deployments
ORDER BY created_at DESC
LIMIT 10;
-- Latest heartbeats
SELECT resource_id, status, timestamp
FROM heartbeats
WHERE timestamp > now() - INTERVAL 5 MINUTE
ORDER BY timestamp DESC;
-- Cost analysis
SELECT
provider,
sum(price_hour) as total_hourly_cost,
avg(price_hour) as avg_cost
FROM resources
WHERE status = 'running'
GROUP BY provider;
-- Event volume
SELECT
event_type,
count() as count,
toStartOfHour(timestamp) as hour
FROM events
WHERE timestamp > now() - INTERVAL 24 HOUR
GROUP BY event_type, hour
ORDER BY hour DESC, count DESC;
```
### View Logs
```bash
# All services
docker-compose logs -f
# API only
docker-compose logs -f mcp-api
# Heartbeat monitor
docker-compose logs -f heartbeat-worker
# ClickHouse
docker-compose logs -f clickhouse
```
## 🛠️ Maintenance
### Restart Services
```bash
# Restart all
docker-compose restart
# Restart API only
docker-compose restart mcp-api
# Reload with new code
docker-compose up -d --build
```
### Backup ClickHouse
```bash
# Backup database
docker-compose exec clickhouse clickhouse-client --query \
"BACKUP DATABASE mcp TO Disk('default', 'backup_$(date +%Y%m%d).zip')"
# Export table
docker-compose exec clickhouse clickhouse-client --query \
"SELECT * FROM resources FORMAT CSVWithNames" > resources_backup.csv
```
### Clean Up
```bash
# Stop all services
docker-compose down
# Stop and remove volumes (WARNING: deletes data)
docker-compose down -v
# Prune old data from ClickHouse (events older than 90 days auto-expire)
docker-compose exec clickhouse clickhouse-client --query \
"OPTIMIZE TABLE events FINAL"
```
## 🐛 Troubleshooting
### Service won't start
```bash
# Check status
docker-compose ps
# Check logs
docker-compose logs mcp-api
# Verify config
cat .env | grep -v '^#' | grep -v '^$'
```
### ClickHouse connection issues
```bash
# Test connection
docker-compose exec clickhouse clickhouse-client --query "SELECT 1"
# Reinitialize
docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql
# Check tables
docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp"
```
### API returns 404 for provider
```bash
# Check if agent initialized
docker-compose logs mcp-api | grep -i "AgentRegistry initialized"
# Restart with fresh logs
docker-compose restart mcp-api && docker-compose logs -f mcp-api
```
### Heartbeat not working
```bash
# Check heartbeat worker
docker-compose logs heartbeat-worker
# Manual health check
curl http://localhost:8000/api/v1/providers/runpod/list
```
## 📝 Environment Variables
Key variables in `.env`:
```bash
# Required
RUNPOD_API_KEY=xxx # Your RunPod API key
VASTAI_API_KEY=xxx # Your Vast.ai API key (used per-request only)
# ClickHouse
CLICKHOUSE_PASSWORD=xxx # Set strong password
# Optional
LOG_LEVEL=INFO # DEBUG for verbose logs
WORKERS=4 # API worker processes
HEARTBEAT_INTERVAL=60 # Seconds between health checks
```
## 🔐 Security Checklist
- [ ] Change default ClickHouse password
- [ ] Store `.env` securely (add to `.gitignore`)
- [ ] Use separate API keys for prod/staging
- [ ] Enable ClickHouse authentication
- [ ] Configure AI Envoy Gateway policies
- [ ] Rotate API keys regularly
- [ ] Review ClickHouse access logs
- [ ] Set up alerting for unhealthy resources
## 📚 Resources
- **API Docs**: http://localhost:8000/docs
- **ClickHouse UI**: http://localhost:8124 (with `--profile debug`)
- **Health Check**: http://localhost:8000/health
- **Full README**: See README.md