eval_compare_runs
Compare two report JSONs to calculate pass-rate and score deltas, list regressions and improvements, and provide a McNemar p-value for statistical significance.
Instructions
Compare two multivon-eval report JSONs and return a structured diff.
Loads both reports from disk (the JSON produced by
EvalReport.to_json()), pairs cases by case_input, and
returns pass-rate / average-score deltas plus the per-case
regressions and improvements lists. Includes a McNemar
p-value so the agent can tell a real shift from small-sample
noise.
Use this when you've made a prompt / retrieval / model change and want to know if the new run actually improved over the baseline — not just on aggregate, but case-by-case.
Args:
baseline_json_path: Filesystem path to the baseline report
JSON (e.g. "runs/baseline.json").
new_json_path: Filesystem path to the new / proposal report
JSON to compare against the baseline.
Returns:
A dict with:
- pass_rate_delta: float, new - baseline pass rate
- avg_score_delta: float, new - baseline average score
- regressions: list of dicts with input,
baseline_status, proposal_status,
baseline_score, proposal_score
- improvements: same shape as regressions
- mcnemar_p_value: float or null — paired-test p-value
- baseline / proposal: summary blocks with
name, pass_rate, avg_score, errors,
flaky
- paired_count / added_count / removed_count:
pairing stats so the caller can see how many cases
lined up vs. drifted between runs
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| baseline_json_path | Yes | ||
| new_json_path | Yes |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||