Plan Reward Hacking Guardrails
plan_reward_hacking_guardrailsInspect agent responses and metrics to detect reward-hacking patterns including sycophancy, benchmark overfitting, and unsupported claims.
Instructions
Detect reward-hacking patterns such as unsupported completion claims, sycophancy, verbosity-as-proof, benchmark overfitting, evaluator manipulation, and proxy-only metrics.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| workflow | No | Agent workflow or release lane being evaluated. | |
| text | No | Candidate response, claim, summary, or verifier output to inspect. | |
| evidence | No | Evidence artifacts attached to the claim. | |
| metrics | No | Proxy metrics or reward scores used by the workflow. | |
| hasHoldout | No | Whether holdout, regression, or real-workflow evidence exists. | |
| hasHumanObjective | No | Whether proxy metrics are mapped to a user objective. | |
| hasVerifierTrace | No | Whether verifier trace, run log, or proof artifact exists. | |
| optimizedForScore | No | Whether an eval, benchmark, or reward score is being optimized. | |
| multimodal | No | Whether claims depend on screenshots, PDFs, charts, images, or video. |