Plan Reward Hacking Guardrails
plan_reward_hacking_guardrailsDetect reward-hacking patterns in agent outputs, including sycophancy, benchmark overfitting, and proxy-only metrics, to prevent evaluation manipulation.
Instructions
Detect reward-hacking patterns such as unsupported completion claims, sycophancy, verbosity-as-proof, benchmark overfitting, evaluator manipulation, and proxy-only metrics.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | No | Candidate response, claim, summary, or verifier output to inspect. | |
| metrics | No | Proxy metrics or reward scores used by the workflow. | |
| evidence | No | Evidence artifacts attached to the claim. | |
| workflow | No | Agent workflow or release lane being evaluated. | |
| hasHoldout | No | Whether holdout, regression, or real-workflow evidence exists. | |
| multimodal | No | Whether claims depend on screenshots, PDFs, charts, images, or video. | |
| hasVerifierTrace | No | Whether verifier trace, run log, or proof artifact exists. | |
| hasHumanObjective | No | Whether proxy metrics are mapped to a user objective. | |
| optimizedForScore | No | Whether an eval, benchmark, or reward score is being optimized. |