compare_runs
Compare two evaluation runs to detect regressions by flagging metrics that exceed configurable tolerance.
Instructions
Compare two evaluation runs and detect regressions. Flags metrics that worsened beyond configurable tolerance.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| baseline_path | Yes | Path to baseline evaluation summary JSON. | |
| current_path | Yes | Path to current evaluation summary JSON. | |
| tolerance | No | Per-metric regression tolerance. |