gpu_watch
Sample GPU status over a fixed interval to capture utilization, temperature, power, and VRAM usage, returning per-card statistics to assess training stability.
Instructions
Take N snapshots of gpu_status at a fixed interval and return both the raw frames and per-card min/max/avg statistics for utilization, temperature, power, and VRAM usage. Useful for answering “is this training run stable?”. Default: 5 samples at 1000ms intervals.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| samples | No | Number of samples to take (2–60). Default: 5. | |
| interval_ms | No | Milliseconds between samples (100–10000). Default: 1000. |
Implementation Reference
- server.js:244-313 (handler)Main handler function for the gpu_watch tool. Takes N snapshots of GPU status at fixed intervals, computes per-card min/max/avg statistics for utilization, temperature, power, and VRAM usage, and returns the raw frames plus aggregated stats.
async function gpuWatch(args) { const missing = requireRocmSmi(); if (missing) return errorResult(missing); const samples = Math.max(2, Math.min(60, Math.floor(args.samples ?? 5))); const intervalMs = Math.max(100, Math.min(10000, Math.floor(args.interval_ms ?? 1000))); const snapshots = []; for (let i = 0; i < samples; i++) { if (i > 0) await new Promise((r) => setTimeout(r, intervalMs)); const r = await run(BIN.rocmSmi, ['-a', '--json']); const data = parseRocmJson(r.stdout); const vram = parseRocmJson((await run(BIN.rocmSmi, ['--showmeminfo', 'vram', '--json'])).stdout) || {}; const frame = { timestamp: new Date().toISOString(), cards: [] }; if (data) { for (const k of cardKeys(data)) { const c = data[k]; const v = vram[k] || {}; frame.cards.push({ card: k, utilization_percent: numOrNull(c['GPU use (%)']), vram_used_bytes: numOrNull(v['VRAM Total Used Memory (B)']), temp_edge_c: numOrNull(c['Temperature (Sensor edge) (C)']), power_avg_w: numOrNull(c['Average Graphics Package Power (W)']), fan_rpm: numOrNull(c['Fan RPM']), }); } } snapshots.push(frame); } // Compute per-card deltas (min/max/avg of utilization and temp) as a // convenience so the caller doesn't have to aggregate themselves. const summary = {}; for (const frame of snapshots) { for (const c of frame.cards) { if (!summary[c.card]) { summary[c.card] = { utilization: [], temp: [], power: [], vram: [] }; } if (c.utilization_percent !== null) summary[c.card].utilization.push(c.utilization_percent); if (c.temp_edge_c !== null) summary[c.card].temp.push(c.temp_edge_c); if (c.power_avg_w !== null) summary[c.card].power.push(c.power_avg_w); if (c.vram_used_bytes !== null) summary[c.card].vram.push(c.vram_used_bytes); } } function stats(arr) { if (!arr.length) return null; const min = Math.min(...arr); const max = Math.max(...arr); const avg = arr.reduce((a, b) => a + b, 0) / arr.length; return { min, max, avg: Math.round(avg * 100) / 100, samples: arr.length }; } const perCard = {}; for (const [card, s] of Object.entries(summary)) { perCard[card] = { utilization_percent: stats(s.utilization), temp_edge_c: stats(s.temp), power_avg_w: stats(s.power), vram_used_bytes: stats(s.vram), }; } return textResult({ samples, interval_ms: intervalMs, total_duration_ms: intervalMs * (samples - 1), snapshots, per_card_stats: perCard, }); } - server.js:374-386 (schema)Tool registration entry for gpu_watch with name, description, annotations, and inputSchema (samples: min 2 max 60 default 5, interval_ms: min 100 max 10000 default 1000).
{ name: 'gpu_watch', description: 'Take N snapshots of gpu_status at a fixed interval and return both the raw frames and per-card min/max/avg statistics for utilization, temperature, power, and VRAM usage. Useful for answering “is this training run stable?”. Default: 5 samples at 1000ms intervals.', annotations: { title: 'Watch GPU over time', readOnlyHint: true, destructiveHint: false, openWorldHint: false }, inputSchema: { type: 'object', properties: { samples: { type: 'integer', minimum: 2, maximum: 60, description: 'Number of samples to take (2–60). Default: 5.' }, interval_ms: { type: 'integer', minimum: 100, maximum: 10000, description: 'Milliseconds between samples (100–10000). Default: 1000.' }, }, additionalProperties: false, }, }, - server.js:395-401 (registration)HANDLERS mapping that routes the 'gpu_watch' tool name to the gpuWatch function.
const HANDLERS = { gpu_status: gpuStatus, gpu_metrics: gpuMetrics, gpu_processes: gpuProcesses, gpu_watch: gpuWatch, rocm_info: rocmInfo, };