benchmark_series
Track an AI model's performance evolution on a specific benchmark (e.g., swe_bench, mmlu_pro) over a custom date range. Get score changes to evaluate progress or regression. Costs 1 credit per query.
Instructions
Score evolution for a single benchmark on one AI model. Costs 1 credit. Benchmark keys: swe_bench, mmlu_pro, gpqa_diamond, math, human_eval.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | Model id or display name | |
| benchmark | Yes | Benchmark key (e.g. swe_bench, mmlu_pro, gpqa_diamond, math, human_eval) | |
| from | No | Start date YYYY-MM-DD UTC (default: 30 days ago) | |
| to | No | End date YYYY-MM-DD UTC (default: today) |