compare_agents
Compare two AI agent endpoints side-by-side on the same test suite to A/B test model, prompt, or architecture changes. Results include per-test score deltas, tool diffs, and an optional HTML report.
Instructions
Compare two agent endpoints side-by-side on the same test suite. Useful for A/B testing a new model, prompt change, or architecture swap. Returns per-test score deltas, tool diffs, and an optional HTML report.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| v1 | Yes | Baseline agent endpoint URL (e.g. 'http://localhost:8000/invoke') | |
| v2 | Yes | Candidate agent endpoint URL (e.g. 'http://localhost:8001/invoke') | |
| tests | No | Test directory (default: 'tests/') | |
| adapter | No | Adapter type (http, langgraph, crewai, etc.). Default: from config or http. | |
| label_v1 | No | Label for v1 in report (default: 'baseline') | |
| label_v2 | No | Label for v2 in report (default: 'candidate') | |
| no_judge | No | Skip LLM-as-judge, deterministic scoring only. Default: false. |