list_alerts
Retrieve GPU infrastructure alerts for thermal throttling, OOM kills, low utilization, and hardware errors to investigate performance degradation, unexpected restarts, or inform scaling decisions.
Instructions
List GPU infrastructure alerts (thermal throttling, OOM kills, low utilisation, hardware errors).
Call this when investigating performance degradation, unexpected restarts, or before making scaling decisions. Open alerts (resolved=False) indicate active issues.
Args: severity: Filter by severity — warning | critical. resolved: False for active alerts, True for resolved alerts. Omit for all.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| resolved | No | ||
| severity | No |