tm_compare_runs
Compares two load-test runs and highlights metric deltas, including regression detection for error rates and latencies.
Instructions
Synthetic side-by-side diff of two runs.
No equivalent /api/v1 endpoint exists — this tool fetches both
runs and computes the delta client-side. Note: the response
shape here is the MCP server's own — distinct from the native
RunComparisonService's RunComparisonResponse. The MCP
shape is leaner (no full run summaries embedded) because the AI
rarely needs every field, and a separate tm_get_run call
surfaces the full per-run detail if needed.
Return shape:
.. code-block:: python
{
"run_id": <int>, # the newer run's id (post-swap)
"baseline_run_id": <int>, # the older run's id (post-swap)
"verdict_change": {
"run": <str|None>, # e.g. "FAIL"
"baseline": <str|None>, # e.g. "PASS"
},
"deltas": {
"<metric_name>": {
"a": <number>, # value from `run`
"b": <number>, # value from `baseline`
"delta_pct": <float|None>, # None when baseline is 0
"delta_abs": <number>, # only when delta_pct is None
"regression": True, # only when above +10% in a "bad" direction
},
...
},
}Metrics covered: totalRequests, totalErrors, avgRps,
peakRps, successRate, durationSeconds (top-level),
plus latency_<quantile> rows for every quantile present in
both runs (latency_p50, latency_p95, etc.).
Regression flag fires only for metrics with a known
"bad direction": totalErrors UP, successRate DOWN, any
latency quantile UP. Throughput / duration deltas are reported
without regression flags — the AI host interprets direction
based on context.
Threshold is intentionally generous (10%) — the server's auto-comparison uses tighter thresholds for the official verdict; this tool is a "draw attention to these metrics" surface for the AI to narrate alongside the verdict.
Invariants enforced — mirror the native
RunComparisonService.compare(...) server method
(RunComparisonService.java line ~61):
run_id != baseline_run_id(comparing a run to itself produces a useless all-zero diff that would mislead the AI).Both runs belong to the same profile (cross-profile compares mix incomparable workloads — e.g. checkout-flow vs login-flow latency — and produce % deltas that look valid but answer no real question).
Most-recent run is returned as
runwhencreatedAtis present and parseable on both runs. The tool compares parsed datetimes (UTC-aware, normalized from naive input) and swaps the caller's arguments if it detects "older first" — the result narrative then reads "newer run regressed vs older baseline" consistently. Order detection uses parsed-datetime comparison, not raw string compare, so ISO-8601 fractional-seconds and offset variants behave correctly.Fallback: when either run is missing
createdAtor the value can't be parsed, the tool keeps the caller- supplied order rather than guessing. That's an intentional safety net — better to surface the diff in the order the AI asked for than to silently rearrange based on weak signals — but it means the "newer = run" invariant only holds when timestamps are usable. Mirrors the native service's own defensive null-handling.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| run_id | Yes | ||
| baseline_run_id | Yes |