benchmark_run
Records a tool-measured benchmark run. Setting arm to 'baseline' establishes the bar that challenger runs must surpass.
Instructions
Record a tool-measured run of an arm through the frozen benchmark. arm:"baseline" sets the bar challengers must beat. Requires a measurementRef → a recorded raw artifact; model self-report never sets the bar.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| arm | Yes | "baseline" or a hypothesis id | |
| runId | Yes | ||
| measurementRef | Yes |