Fetch the report for a completed run. ONE tool, THREE report kinds — the response's top-level `kind` field discriminates which kind it is (rerecord / sandbox_run / test_suite_run) and which question the report answers (see core glossary's "three reports"). Read `kind` first, then pick the matching reading rules below; do NOT assume the kind from how you got here.
Call this as the final step of the playbook, AFTER you read the terminal NDJSON event (phase=done) and confirmed data.ok=true. Pass app_id and test_run_id — extract test_run_id from data.test_run_id on the phase=done line of the progress_file returned by record_sandbox_test or replay_sandbox_test (for replay_test_suite, the CLI prints test_run_id to stdout instead).
===== OUTPUT SHAPE =====
(Conditional verbosity so the dev isn't drowned in noise on a green run.)
* Always includes totals at the SUITE level only (total_suites / passed_suites / failed_suites) and a per_suite array where each entry carries suite_id, suite_name, total_steps, passed_steps, failed_steps. Aggregate step counts across suites are intentionally omitted — they hide where damage actually is.
* PER-KIND READING of passed_steps / failed_steps — same column names, different meaning per kind:
- RERECORD (kind=rerecord): passed_steps = steps whose auto-replay byte-comparison matched the live capture. failed_steps = steps that diverged on auto-replay. EVEN IF every suite shows passed_steps == total_steps, the rerecord is only successful when every suite is also linked=true (a sandbox test got produced). Always check `linked`; the step counts alone do not indicate "did the rerecord work".
- SANDBOX_RUN (kind=sandbox_run): passed_steps = steps whose assertions held under captured-mock replay. failed_steps = assertion failures or response diffs against the captured baseline.
- TEST_SUITE_RUN (kind=test_suite_run): passed_steps = steps whose assertions held against the live app. failed_steps = same against live, no mocks involved. No linkage to report.
* Top-level `kind` discriminates the report: `"rerecord"` for record_sandbox_test runs (rerecord report — answers "did the sandbox test get created and linked?"), `"sandbox_run"` for replay_sandbox_test runs (sandbox run report — answers "does the suite still hold up against its captured baseline?"), `"test_suite_run"` for replay_test_suite runs (test suite report — live execution, no mocks; answers "does the suite hold up against the actual current system?"). Use kind to pick the right reading; do NOT mix them in one response.
* RERECORD runs (kind="rerecord") carry a `linked` bool + `test_set_id` string on every per_suite[] entry. linked=true means the rerecord produced a sandbox test for the suite (replay-ready). linked=false means rerecord did NOT produce a sandbox test for the suite — it cannot be replayed until rerecord succeeds. ALWAYS surface this on rerecord output — even when every step's capture passed at the wire level, a suite without a sandbox test is a real failure. For the per-suite table, add a "Linked" column (yes/no from per_suite[].linked). For the one-line all-green reply, report "N/N suites passed, L/N have a sandbox test (test_run_id=<id>)".
* When any suite has failures (or verbose=true), also includes failed_steps[] with per-step diagnostics (suite, step name, method+url, diff excerpt, error, mock_mismatches, assertion_failures, mock_mismatch_failure, authored_assertions, authored_response_body) PLUS mock_mismatch_failed_steps (count) and mock_mismatch_dominant (bool — true when the majority of failed steps have unconsumed recorded mocks, which points at a keploy-side egress-hook issue rather than dev app breakage). On RERECORD, failed_steps[] also carries `linked` (whether the owning suite has a sandbox test after this rerecord) and the mock_mismatch_* fields are suppressed (irrelevant in rerecord context).
* authored_assertions / authored_response_body — the SUITE's authored contract for the failing step (the assert array and response.body as defined when the suite was created/updated). Surfaced inline so route B vs route C can be decided without a second getTestSuite round-trip. KEY DECISION POINT: if any authored_assertions entry is pinned to the value the diff shows as "expected" (e.g. assert {path: "$.order.status", expected: "created order"} and the diff says "expected 'created order', got 'created'"), route C is MANDATORY — re-record alone leaves that assertion stuck on the old contract and the next rerecord/replay will gate-1-fail on the same step. If authored_assertions is empty/absent (suite asserts nothing structural on that field), route B or route-C-without-assertion-edit may suffice.
* When everything passes and verbose is false, failed_steps is omitted.
===== HOW TO RESPOND TO THE DEV =====
* status == "all_passed" AND kind == "sandbox_run" → ONE-LINER: "<passed_suites>/<total_suites> suites passed (test_run_id=<id>)". Do not dump the JSON, do not list per-suite rows unless asked.
* status == "all_passed" AND kind == "test_suite_run" → ONE-LINER: "<passed_suites>/<total_suites> suites passed live (test_run_id=<id>)". No mocks involved, no linkage to report.
* status == "all_passed" AND kind == "rerecord" → ONE-LINER including linkage: "<passed_suites>/<total_suites> suites passed, <linked>/<total> linked (test_run_id=<id>)" where <linked> = count of per_suite[] entries with linked=true. If linked < total, ALSO list the unlinked suite names so the dev knows which ones are silently broken (skip sandbox replay on them, or investigate the linking failure). Never drop linkage reporting on rerecord even when it's all green.
* status == "has_failures" → response MUST contain (in order, no collapsing rows even when failures look homogeneous — the dev needs the full inventory):
1. per-suite table — one row per suite in per_suite (passing suites included), columns = Suite name | passed/total steps.
2. failed-steps table — ONE ROW per entry in failed_steps[], columns = Suite | Step name | Method + URL | Expected → Actual status | mock_mismatch y/n.
3. Diagnosis + Recommendation (rules below).
Do NOT print aggregate step totals across suites.
Frame the diagnosis from the glossary: a mock mismatch IS the signal that the sandbox test has drifted from current app behavior. The three routes below (SKIP / FIX-CODE / FIX-TEST-RERECORD) are not separate buckets — they're three possible SOURCES of that drift:
* keploy proxy didn't replay correctly → drift is artificial, no real change → route A (SKIP).
* app regressed → drift is unintended, fix the code → route B.
* contract changed on purpose → drift is intentional, refresh the sandbox test → route C.
Your repo inspection picks which source applies; the routes are the prescription for that source.
DIAGNOSE WITH THE REPO, NOT THE DEV. Before recommending anything on a failing run, inspect the source tree yourself (git log / git diff against the last green run or main, read the failing handler + its downstream call sites). DO NOT ask the dev "did you change X since the last green run" — you have the repo, find the answer. Only come back with a concrete conclusion.
* mock_mismatch_dominant == true → failure signature is "keploy didn't intercept the app's egress traffic". Use git to check whether the failing endpoints or their dependency wiring have been modified recently:
(a) NO relevant changes → tell the dev this is almost certainly a KEPLOY-SIDE issue and ask them to file a keploy issue with test_run_id. Do NOT ask them to re-record.
(b) Relevant changes EXIST → name them (file:line or commit hash), explain how each plausibly caused the failure, say whether the change looks intended or accidental, and tell the dev exactly what to fix.
* status == "has_failures" AND mock_mismatch_dominant == false → same discipline: identify the commit(s) / diff hunks that most likely caused each failure, state whether they look intended, and prescribe a fix (rerecord, revert, patch the handler). Don't hand the investigation back to the dev.
===== HANDLING "FIX IT" FOLLOW-UPS =====
(After the dev has seen the analysis and asks you to fix.)
═══════════════════════════════════════════════════════════════════
DO NOT JUMP TO RECORD — diagnose FIRST.
═══════════════════════════════════════════════════════════════════
A sandbox-replay failure is NOT a signal to rerecord. Re-recording without diagnosis silently captures the broken behavior as the new "expected" — masking a real app regression and erasing the evidence the dev needs.
When sandbox replay fails, your FIRST move is ALWAYS the diagnosis below (B vs C vs SKIP). You only call record_sandbox_test as part of route C, AND only AFTER update_test_suite has updated the suite to match the new intentional contract. If the contract hasn't changed (route B), DO NOT record — the captured mocks are still valid; only the app needs fixing.
If you find yourself thinking "let me just rerecord to fix this", STOP. Read failed_steps, inspect the repo for what changed, decide which route applies. Re-recording is a tool for capturing a NEW intentional contract, not a remedy for a failed run.
You have exactly THREE options for each failing step. Pick one per step based on your repo inspection; do not ask the dev which branch to take, decide:
A. SKIP — do nothing code-side. Pick this when mock_mismatch_dominant=true AND your repo inspection found no relevant changes in the failing handler or its dependencies. Rationale: this is a keploy egress-hook / proxy issue; editing the app or the test won't help. Tell the dev "flagged for keploy support, no app or test change needed" and move on to the next step (if any) or close.
B. FIX THE CODE — edit the handler / dependency wiring. Pick this when your repo inspection shows a recent change that broke the endpoint's contract AND the ORIGINAL test intent still matches what the endpoint SHOULD do (the test is correct, the code regressed). Make the minimal edit to restore expected behavior, tell the dev exactly which file:line you changed and why, then re-run: call replay_sandbox_test for the suite(s) whose steps you just un-broke. DO NOT record — the captured mocks are still valid if the contract hasn't changed intentionally.
C. UPDATE-FIRST, THEN RECORD — order matters: (1) update_test_suite first, (2) record_sandbox_test second, (3) replay_sandbox_test to verify. Calling record before update means you'd capture mocks against the OLD suite shape — defeats the purpose. Pick this when the endpoint's contract LEGITIMATELY changed (a deliberate new field, renamed response key, different status code, new required header) AND your repo inspection confirms the change is intended (commit message, surrounding diff, or obvious product direction). The update_test_suite call should edit the step's body / expected response / assertions / extract to match the new contract. Tell the dev which assertions you updated and why the contract change is considered intentional.
╔═══ ROUTE C — DECISION + RECOMMENDATION TEMPLATE (use verbatim) ═══╗
Decision input: read failed_steps[].authored_assertions and authored_response_body INLINE in this report. Do NOT call getTestSuite again unless those fields are absent (older runs).
* If an authored assertion's expected value matches the diff's "expected" side → route C is MANDATORY. The suite's contract pins the old value; you MUST update_test_suite before record_sandbox_test, otherwise the next rerecord gate-1-fails on the same assertion and the suite comes back unlinked.
* If authored_response_body has the old value but no assert is pinned to it → route C is still recommended (the captured response baseline drifts), but record_sandbox_test alone CAN succeed; choosing update_test_suite first keeps the suite source-of-truth aligned with the new contract.
* If neither pins the diverging value → route C without assertion edits is sufficient (or route B if the change is unintentional).
Mandatory recommendation phrasing for the dev (one bullet per failing step that routes to C):
"(1) update_test_suite for suite '<suite_name>' (id=<suite_id>) — change step '<step_name>' (id=<step_id>): set <field_path> from '<old>' to '<new>' and update assertion <assert_index> on the same path; (2) record_sandbox_test on that suite to refresh the captured baseline; (3) replay_sandbox_test to verify."
BANNED wording — never write any of these on a route-C recommendation:
× "re-record the sandbox tests so the baseline picks up the new value"
× "just rerecord to refresh the captured response"
× "re-record and the new value will become the expected"
× "re-record OR update assertions" (or any phrasing that joins update_test_suite and record_sandbox_test with "or" / "either … or" / "one of these two")
× "you can either update the assertions or re-record"
× "options: (a) update assertions, (b) re-record the suite"
All five drop step (1) or present the two steps as interchangeable. They are NOT alternatives — they are sequential steps in a single route-C flow: (1) update_test_suite, (2) record_sandbox_test, (3) replay_sandbox_test. Skipping (1) leaves the suite's authored assertion pinned on the old value; the next replay gate-1-fails on the same diff. If you catch yourself reaching for "or" between these two tools on a route-C recommendation, restate using the mandatory template.
╚════════════════════════════════════════════════════════════════════╝
Multiple failing steps can land in DIFFERENT branches — e.g. one step is a real app regression (B), another is a contract change (C). In that case, explain the split up-front, apply each fix, and run sandbox replay once at the end covering every affected suite.
After any B or C branch completes, the final message uses the same 3-subsection format (per-suite table → failed-steps table → diagnosis + recommendation) on the follow-up sandbox replay, PLUS a short "Fix applied" preamble naming the file:line edits (for B) or update_test_suite calls (for C). For A-only responses (all failures route to keploy), no follow-up run is needed — just restate the keploy-issue recommendation.
===== REPLAY / "EXPLAIN MY LATEST SANDBOX REPORT" =====
When the dev asks "explain my latest sandbox report" / "analyse the last run" / "why did it fail" — call this tool again with the SAME app_id + test_run_id and verbose=true so the full diagnostics come back even if nothing failed. Use that detail to answer their question. If you don't have the test_run_id to hand, list the app's most recent runs OF THE RIGHT KIND via /client/v1/apps/{app_id}/test-runs?kind=<rerecord|sandbox_run|test_suite_run> and pick the top one. NEVER list /test-runs without the kind filter and pick the latest blindly — different kinds are co-mingled in that collection, and an unfiltered list will surface a rerecord run when the dev asked for the latest sandbox replay (or vice versa). Match the kind to what the dev asked: "explain my latest record" → kind=rerecord; "explain my latest sandbox replay" / "integration test report" → kind=sandbox_run; "explain my latest live run" → kind=test_suite_run. If the dev's verb is ambiguous, ASK which kind first (per the verb-routing's explain-branch rule).