check_gpu_health
Detect GPU hardware failures and performance issues by checking for XID errors and throttling events across OpenShift clusters.
Instructions
Check for GPU hardware errors (XID) and throttling events across the cluster.
Why:
- Detect Hardware Failures: XID errors often indicate physical GPU faults.
- Explain Performance Issues: Thermal or Power throttling explains why a model is slow.
Returns:
Markdown report of GPU health issues.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||