test_hypothesis
Records a full test of a hypothesis by running 3-5 frontier agents end-to-end, comparing aggregated measurements against a frozen baseline to determine improvement or log a failure.
Instructions
Record ONE full test of a hypothesis = 3–5 frontier agents that actually ran the loop end-to-end. Every agent run must carry a measurementRef (tool-measured). Aggregates vs the frozen baseline bar; a no-improvement run is NO_IMPROVEMENT, never "perfect", and bumps the failure counter.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| runId | Yes | ||
| fullTest | Yes | ||
| hypothesisId | Yes |