iterate
Run an agent repeatedly, using evaluator feedback from each attempt to correct outputs until a success threshold is reached or time/budget constraints trigger a halt.
Instructions
Iteratively refine an agent's output until an evaluator is satisfied.
Runs the agent up to max_iterations times. Each iteration sees the
prior attempts' outputs, scores, and issues injected into its prompt, so
the agent can correct the specific shortcomings the evaluator called out
on the last pass. Halts when any of these become true:
Evaluator score >=
success_threshold(success).patienceiterations pass with no improvement to the best score.Total agent cost exceeds
max_budget(best-so-far).Wall-time exceeds
max_wall_timeseconds (best-so-far).max_iterationsreached (best-so-far).
Evaluator forms (pass exactly one of target_type or evaluator):
target_type="some-type"— shorthand forevaluator="validate:some-type". LLM validator against a registered type. Best for text artifacts whose correctness is a matter of shape and content.evaluator="validate:<type>"— explicit form of the above.evaluator="exec:<shell cmd>"— ground-truth executor. The agent's output is written to a tempfile and the command runs with$ARTIFACTset to the path (and{artifact}substituted in the template). Exit 0 scores 1.0; non-zero scores 0.0 with stderr parsed into issues. Use for artifacts with compile/build/test semantics (docker build,pytest, etc.) — this is the only way to get ground-truth feedback.evaluator="score:<criterion>"— ad-hoc LLM rubric scoring via haiku.
Args:
prompt: Base task description sent to the agent on iteration 1; on
subsequent iterations it is augmented with a "Prior attempts"
section summarising previous outputs, scores, and issues.
target_type: Shorthand for evaluator="validate:<target_type>". Also
injects the type as output_type context on the agent prompt.
evaluator: Full evaluator directive (validate:, exec:, or
score:). Takes precedence over target_type if both given.
max_iterations: Hard cap on iteration count (default: 10).
success_threshold: Score at or above which we declare success
(default: 0.9). Must be in [0.0, 1.0].
patience: Halt if patience consecutive iterations fail to improve
on the best score so far (default: 3).
max_budget: Optional USD cap on total agent (proposer) cost. Evaluator
cost is not counted. Halts best-so-far on breach.
max_wall_time: Optional wall-time cap in seconds. Halts best-so-far
on breach.
sandbox: Named sandbox spec or inline JSON for the agent.
model: Agent (proposer) model (default: sonnet).
timeout: Per-iteration agent timeout in seconds.
mcps: JSON array of MCP server names attached to the agent.
Returns:
JSON with run_id, iterations, halted_because,
best_iteration, best_ref (validated-stamped), best_score,
total_cost, and the full attempts trace with per-iteration
ref / score / issues / cost.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| prompt | Yes | ||
| target_type | No | ||
| evaluator | No | ||
| max_iterations | No | ||
| success_threshold | No | ||
| patience | No | ||
| max_budget | No | ||
| max_wall_time | No | ||
| sandbox | No | ||
| model | No | sonnet | |
| timeout | No | ||
| mcps | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |