run_check
Compare test results against a golden baseline to detect regressions. Returns diff summaries and flags score drops as blocking failures, with warnings for behavioral shifts.
Instructions
Check for regressions against the golden baseline. Returns a diff summary for each test: PASSED, OUTPUT_CHANGED, TOOLS_CHANGED, or REGRESSION. REGRESSION means the score dropped significantly — treat this as a blocking failure. TOOLS_CHANGED / OUTPUT_CHANGED are warnings: the agent's behavior shifted but may be intentional. Also returns observability signals: behavioral anomalies (tool loops, stalls), trust scores (benchmark gaming detection), and coherence issues (multi-turn context loss). Use this after any code change (prompt, model, tools) to confirm nothing broke. If you see a regression, show the diff to the user and offer to fix it before moving on. Use heal=true to auto-retry flaky failures and distinguish non-determinism from real drift. IMPORTANT: Automatically detect test_path by looking for a 'tests/evalview/' directory in the current project. If it exists, pass it as test_path. If the project has a custom test location, use that instead.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| test | No | Check only this specific test by name (optional, checks all by default) | |
| test_path | No | Path to the test directory. Auto-detect: use 'tests/evalview/' if it exists, otherwise 'tests'. | |
| heal | No | Auto-retry flaky failures, propose candidate variants, distinguish non-determinism from real regressions. Default: false. | |
| strict | No | Fail on any change (REGRESSION, TOOLS_CHANGED, OUTPUT_CHANGED). Default: false. | |
| ai_root_cause | No | Use AI to explain low-confidence regressions with root-cause analysis. Default: false. | |
| statistical | No | Run each test N times for variance analysis (e.g. 5). Omit for single run. | |
| auto_variant | No | Auto-discover distinct execution paths as golden variants. Requires statistical. Default: false. | |
| budget | No | Maximum total budget in dollars (e.g. 0.50). Remaining tests skipped when limit hit. | |
| dry_run | No | Preview test plan and estimate cost without executing. Default: false. | |
| tag | No | Check only tests tagged with these behaviors (OR match). E.g. ['tool_use', 'retrieval']. | |
| fail_on | No | Comma-separated statuses to fail on (default: REGRESSION). E.g. 'REGRESSION,TOOLS_CHANGED'. | |
| timeout | No | Timeout per test in seconds (default: 120). | |
| report | No | Generate HTML report at this path (auto-opens in browser). | |
| judge | No | Judge model for scoring (e.g. 'gpt-5', 'sonnet'). |