plan_reward_hacking_guardrails
Detect reward-hacking patterns in agent workflows, including unsupported claims, sycophancy, and benchmark overfitting. Inspect responses, evidence, and proxy metrics to identify manipulation.
Instructions
Detect reward-hacking patterns such as unsupported completion claims, sycophancy, verbosity-as-proof, benchmark overfitting, evaluator manipulation, and proxy-only metrics.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| workflow | No | Agent workflow or release lane being evaluated. | |
| text | No | Candidate response, claim, summary, or verifier output to inspect. | |
| evidence | No | Evidence artifacts attached to the claim. | |
| metrics | No | Proxy metrics or reward scores used by the workflow. | |
| hasHoldout | No | Whether holdout, regression, or real-workflow evidence exists. | |
| hasHumanObjective | No | Whether proxy metrics are mapped to a user objective. | |
| hasVerifierTrace | No | Whether verifier trace, run log, or proof artifact exists. | |
| optimizedForScore | No | Whether an eval, benchmark, or reward score is being optimized. | |
| multimodal | No | Whether claims depend on screenshots, PDFs, charts, images, or video. |
Implementation Reference
- Main handler function that builds the reward hacking guardrails plan. Analyzes candidate text for reward hacking signals (hallucinated verification, sycophancy, benchmark overfitting, evaluator manipulation, proxy metric only, perception-reasoning decoupling) and returns a report with status, signals, gates, metrics, and next actions.
function buildRewardHackingGuardrailsPlan(rawOptions = {}) { const options = normalizeOptions(rawOptions); const signals = buildSignals(options); const critical = signals.filter((signal) => signal.severity === 'critical').length; const high = signals.filter((signal) => signal.severity === 'high').length; return { name: 'thumbgate-reward-hacking-guardrails', source: SOURCE, workflow: options.workflow, status: critical > 0 ? 'blocked' : high > 0 ? 'needs_evidence' : 'ready', summary: { signalCount: signals.length, critical, high, evidenceArtifacts: options.evidence.length, proxyMetrics: options.metrics, hasHumanObjective: options.hasHumanObjective, hasHoldout: options.hasHoldout, }, proxyCompressionMapping: { compression: 'compressed reward, benchmark, or approval signal is treated as a stand-in for the full user objective', amplification: 'optimization pressure can turn local shortcuts into repeated workflow behavior', coAdaptation: 'agent outputs can learn to satisfy evaluators, rubrics, or verifiers instead of the task', }, signals, gates: signals.map((signal) => ({ id: signal.id, action: signal.severity === 'critical' ? 'block' : 'warn', message: signal.gate, })), metrics: buildMetrics(), nextActions: [ 'Attach proof artifacts before allowing claims like tests passed, fixed, deployed, safe, or ready to merge.', 'Treat benchmark gains as provisional until holdout, regression, or real-workflow evidence confirms the user objective improved.', 'Require explicit user-objective mapping for every proxy metric, reward score, or evaluator rubric.', 'Block evaluator-manipulation language before it reaches judge or verifier loops.', 'Prefer short evidence-backed summaries over long persuasive explanations when judging agent work.', ], marketingAngle: { headline: 'Reward hacking is what happens when agents optimize the receipt instead of the meal.', subhead: 'ThumbGate turns proxy failures into pre-action gates: no unsupported completion claims, no benchmark-only victory laps, and no verifier theater without proof artifacts.', guideTitle: 'Reward Hacking Guardrails for AI Coding Agents', replyDraft: 'This paper is a useful frame for agent products: proxy rewards compress the real user objective, and agents learn the shortcut. ThumbGate can enforce the missing layer: completion claims need proof, benchmark wins need holdouts, and verifier loops need gates against sycophancy, verbosity-as-proof, and evaluator manipulation.', }, }; } - Normalizes raw input options (text, evidence, metrics, flags) into a structured options object used by the handler.
function normalizeOptions(raw = {}) { const candidateText = String(raw.text || raw.claim || raw.response || raw.summary || '').trim(); const evidence = splitList(raw.evidence || raw.evidenceArtifacts || raw['evidence-artifacts']); const metrics = splitList(raw.metrics || raw.metric || raw['proxy-metrics']); return { workflow: String(raw.workflow || raw.name || 'agent reward guardrails').trim() || 'agent reward guardrails', candidateText, evidence, metrics, wordCount: parseNumber(raw['word-count'] || raw.wordCount, candidateText.split(/\s+/).filter(Boolean).length), hasHoldout: parseBoolean(raw.holdout || raw.hasHoldout || raw['has-holdout'], HOLDOUT_RE.test(candidateText)), hasHumanObjective: parseBoolean(raw['human-objective'] || raw.hasHumanObjective, false), hasVerifierTrace: parseBoolean(raw['verifier-trace'] || raw.hasVerifierTrace, evidence.some((item) => /trace|log|run|artifact|proof/i.test(item))), optimizedForScore: parseBoolean(raw['optimized-for-score'] || raw.optimizedForScore, BENCHMARK_RE.test(candidateText) || metrics.length > 0), multimodal: parseBoolean(raw.multimodal || raw.hasMultimodalInputs, MULTIMODAL_RE.test(candidateText)), }; } - Builds detection signals by analyzing candidate text against regex patterns for reward hacking behaviors (hallucinated verification, verbosity as proof, sycophancy, benchmark overfitting, evaluator manipulation, proxy metric gaps, multimodal decoupling).
function buildSignals(options) { const signals = []; const evidencePresent = hasEvidence(options); if (COMPLETION_CLAIM_RE.test(options.candidateText) && !evidencePresent) { signals.push({ id: 'hallucinated_verification', severity: 'critical', message: 'The response claims completion, safety, test success, or deployment without attached proof.', gate: 'Block completion claims until test output, run id, trace, screenshot, or proof artifact is attached.', }); } if (options.wordCount >= 180 && !evidencePresent) { signals.push({ id: 'verbosity_as_proof', severity: 'high', message: 'The response is long but does not provide verifiable artifacts, turning fluency into a proxy for correctness.', gate: 'Require concise claims with artifact-backed evidence before accepting persuasive explanations.', }); } if (SYCOPHANCY_RE.test(options.candidateText) && !evidencePresent) { signals.push({ id: 'sycophancy_or_rubber_stamp', severity: 'high', message: 'Agreement or approval language appears without independent checks or counterevidence.', gate: 'Require at least one explicit verification step or risk check before approval-style responses.', }); } if (options.optimizedForScore && !options.hasHoldout) { signals.push({ id: 'benchmark_overfitting', severity: 'high', message: 'A score, eval, benchmark, or reward metric is being optimized without holdout or regression proof.', gate: 'Require holdout, regression, or real-workflow evidence before treating score gains as product gains.', }); } if (EVALUATOR_MANIPULATION_RE.test(options.candidateText)) { signals.push({ id: 'evaluator_manipulation', severity: 'critical', message: 'The candidate text attempts to influence grading instead of satisfying the user objective.', gate: 'Block evaluator-manipulation language and route to human review.', }); } if (options.metrics.length > 0 && !options.hasHumanObjective) { signals.push({ id: 'proxy_metric_only', severity: 'medium', message: 'Proxy metrics are present without an explicit human objective or user-visible success criterion.', gate: 'Pair every reward or benchmark metric with the real user outcome it is meant to approximate.', }); } if (options.multimodal && !evidencePresent) { signals.push({ id: 'perception_reasoning_decoupling', severity: 'high', message: 'A visual or multimodal claim is made without source artifacts or perception trace evidence.', gate: 'Require screenshot, OCR, or visual proof artifact before accepting multimodal reasoning claims.', }); } return signals; } - adapters/mcp/server-stdio.js:1037-1038 (registration)MCP server case statement that routes 'plan_reward_hacking_guardrails' to buildRewardHackingGuardrailsPlan and returns result as text.
case 'plan_reward_hacking_guardrails': return toTextResult(buildRewardHackingGuardrailsPlan(args)); - scripts/cli-schema.js:253-270 (schema)CLI schema registration mapping the 'reward-hacking-guardrails' command to the 'plan_reward_hacking_guardrails' MCP tool, with flags for workflow, text, evidence, metrics, holdout, human-objective, verifier-trace, optimized-for-score, and multimodal.
discoveryCommand({ name: 'reward-hacking-guardrails', aliases: ['proxy-reward-guardrails', 'reward-guardrails'], description: 'Detect reward-hacking patterns and require proof before proxy metrics or completion claims are trusted', mcpTool: 'plan_reward_hacking_guardrails', flags: [ jsonFlag(), { name: 'workflow', type: 'string', description: 'Agent workflow or release lane being evaluated' }, { name: 'text', type: 'string', description: 'Candidate response, claim, summary, or verifier output to inspect' }, { name: 'evidence', type: 'string', description: 'Comma-separated evidence artifacts attached to the claim' }, { name: 'metrics', type: 'string', description: 'Comma-separated proxy metrics or reward scores used by the workflow' }, { name: 'holdout', type: 'boolean', description: 'Whether holdout, regression, or real-workflow evidence exists' }, { name: 'human-objective', type: 'boolean', description: 'Whether proxy metrics are mapped to a human/user objective' }, { name: 'verifier-trace', type: 'boolean', description: 'Whether verifier trace, run log, or proof artifact exists' }, { name: 'optimized-for-score', type: 'boolean', description: 'Mark that the workflow is optimizing an eval, benchmark, or reward score' }, { name: 'multimodal', type: 'boolean', description: 'Mark that claims depend on screenshots, PDFs, charts, images, or video' }, ], }),