aa_evaluate_critpt
Submit a complete set of CritPt problem submissions for official benchmark evaluation and receive accuracy, timeout, and error metrics.
Instructions
Submit a complete CritPt benchmark batch for official evaluation.
The upstream endpoint requires submissions for all public CritPt problems in
one request. Each submission should include problem_id, generated_code,
model, and generation_config. This tool can take substantial time because
the upstream grading service runs benchmark evaluation jobs.
Args:
submissions: Complete list of CritPt submission objects.
batch_metadata: Optional metadata for the batch.
Returns:
JSON response with accuracy, timeout rate, and judge/server error counts.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| submissions | Yes | ||
| batch_metadata | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |