Skip to main content
Glama

tm_compare_runs

Compares two load-test runs and highlights metric deltas, including regression detection for error rates and latencies.

Instructions

Synthetic side-by-side diff of two runs.

No equivalent /api/v1 endpoint exists — this tool fetches both runs and computes the delta client-side. Note: the response shape here is the MCP server's own — distinct from the native RunComparisonService's RunComparisonResponse. The MCP shape is leaner (no full run summaries embedded) because the AI rarely needs every field, and a separate tm_get_run call surfaces the full per-run detail if needed.

Return shape:

.. code-block:: python

{
    "run_id": <int>,                    # the newer run's id (post-swap)
    "baseline_run_id": <int>,           # the older run's id (post-swap)
    "verdict_change": {
        "run": <str|None>,              # e.g. "FAIL"
        "baseline": <str|None>,         # e.g. "PASS"
    },
    "deltas": {
        "<metric_name>": {
            "a": <number>,              # value from `run`
            "b": <number>,              # value from `baseline`
            "delta_pct": <float|None>,  # None when baseline is 0
            "delta_abs": <number>,      # only when delta_pct is None
            "regression": True,         # only when above +10% in a "bad" direction
        },
        ...
    },
}

Metrics covered: totalRequests, totalErrors, avgRps, peakRps, successRate, durationSeconds (top-level), plus latency_<quantile> rows for every quantile present in both runs (latency_p50, latency_p95, etc.).

Regression flag fires only for metrics with a known "bad direction": totalErrors UP, successRate DOWN, any latency quantile UP. Throughput / duration deltas are reported without regression flags — the AI host interprets direction based on context.

Threshold is intentionally generous (10%) — the server's auto-comparison uses tighter thresholds for the official verdict; this tool is a "draw attention to these metrics" surface for the AI to narrate alongside the verdict.

Invariants enforced — mirror the native RunComparisonService.compare(...) server method (RunComparisonService.java line ~61):

  1. run_id != baseline_run_id (comparing a run to itself produces a useless all-zero diff that would mislead the AI).

  2. Both runs belong to the same profile (cross-profile compares mix incomparable workloads — e.g. checkout-flow vs login-flow latency — and produce % deltas that look valid but answer no real question).

  3. Most-recent run is returned as run when createdAt is present and parseable on both runs. The tool compares parsed datetimes (UTC-aware, normalized from naive input) and swaps the caller's arguments if it detects "older first" — the result narrative then reads "newer run regressed vs older baseline" consistently. Order detection uses parsed-datetime comparison, not raw string compare, so ISO-8601 fractional-seconds and offset variants behave correctly.

    Fallback: when either run is missing createdAt or the value can't be parsed, the tool keeps the caller- supplied order rather than guessing. That's an intentional safety net — better to surface the diff in the order the AI asked for than to silently rearrange based on weak signals — but it means the "newer = run" invariant only holds when timestamps are usable. Mirrors the native service's own defensive null-handling.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
run_idYes
baseline_run_idYes
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries full burden and delivers extensive behavioral details: client-side computation, response shape, regression logic with thresholds, timestamp-based ordering with fallback, and invariants. It fully discloses how the tool behaves and what the AI should expect.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is long and includes a large code block, but it is structured with clear sections (purpose, return shape, invariants). Not every sentence is essential, but the detailed explanation adds value despite lacking conciseness.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description is very complete given the complexity: it covers the return shape, invariants, regression logic, and fallback behavior. However, it does not specify error handling for cases like invalid run IDs or cross-profile violations, leaving minor gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema provides only integer types with no descriptions. The description compensates by explaining the role of run_id and baseline_run_id in the return shape and the ordering logic, but it does not explicitly define the parameters in the input context. Still, the meaning is sufficiently inferable from the tool name and description.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose as a 'Synthetic side-by-side diff of two runs'. It distinguishes itself from sibling tools by its unique comparison functionality, and the detailed explanation of what it does (compute delta client-side, return leaner response) makes the purpose unmistakable.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides implicit guidance on when to use this tool versus alternatives (e.g., 'a separate tm_get_run call surfaces the full per-run detail if needed'). It also enforces invariants (same profile, different runs) that constrain usage. However, it does not explicitly state when not to use it or list alternative tools for comparison.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/trafficmorph-gif/tm-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server