plan_reward_hacking_guardrails
Analyze agent responses and metrics to detect reward-hacking patterns such as sycophancy, verbosity-as-proof, and benchmark overfitting.
Instructions
Detect reward-hacking patterns such as unsupported completion claims, sycophancy, verbosity-as-proof, benchmark overfitting, evaluator manipulation, and proxy-only metrics.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| workflow | No | Agent workflow or release lane being evaluated. | |
| text | No | Candidate response, claim, summary, or verifier output to inspect. | |
| evidence | No | Evidence artifacts attached to the claim. | |
| metrics | No | Proxy metrics or reward scores used by the workflow. | |
| hasHoldout | No | Whether holdout, regression, or real-workflow evidence exists. | |
| hasHumanObjective | No | Whether proxy metrics are mapped to a user objective. | |
| hasVerifierTrace | No | Whether verifier trace, run log, or proof artifact exists. | |
| optimizedForScore | No | Whether an eval, benchmark, or reward score is being optimized. | |
| multimodal | No | Whether claims depend on screenshots, PDFs, charts, images, or video. |
Implementation Reference
- Main handler function that builds the reward hacking guardrails plan. Normalizes input options, detects reward-hacking signals, computes status (blocked/needs_evidence/ready), and returns the full plan object with gates, metrics, next actions, and marketing content.
function buildRewardHackingGuardrailsPlan(rawOptions = {}) { const options = normalizeOptions(rawOptions); const signals = buildSignals(options); const critical = signals.filter((signal) => signal.severity === 'critical').length; const high = signals.filter((signal) => signal.severity === 'high').length; return { name: 'thumbgate-reward-hacking-guardrails', source: SOURCE, workflow: options.workflow, status: critical > 0 ? 'blocked' : high > 0 ? 'needs_evidence' : 'ready', summary: { signalCount: signals.length, critical, high, evidenceArtifacts: options.evidence.length, proxyMetrics: options.metrics, hasHumanObjective: options.hasHumanObjective, hasHoldout: options.hasHoldout, }, proxyCompressionMapping: { compression: 'compressed reward, benchmark, or approval signal is treated as a stand-in for the full user objective', amplification: 'optimization pressure can turn local shortcuts into repeated workflow behavior', coAdaptation: 'agent outputs can learn to satisfy evaluators, rubrics, or verifiers instead of the task', }, signals, gates: signals.map((signal) => ({ id: signal.id, action: signal.severity === 'critical' ? 'block' : 'warn', message: signal.gate, })), metrics: buildMetrics(), nextActions: [ 'Attach proof artifacts before allowing claims like tests passed, fixed, deployed, safe, or ready to merge.', 'Treat benchmark gains as provisional until holdout, regression, or real-workflow evidence confirms the user objective improved.', 'Require explicit user-objective mapping for every proxy metric, reward score, or evaluator rubric.', 'Block evaluator-manipulation language before it reaches judge or verifier loops.', 'Prefer short evidence-backed summaries over long persuasive explanations when judging agent work.', ], marketingAngle: { headline: 'Reward hacking is what happens when agents optimize the receipt instead of the meal.', subhead: 'ThumbGate turns proxy failures into pre-action gates: no unsupported completion claims, no benchmark-only victory laps, and no verifier theater without proof artifacts.', guideTitle: 'Reward Hacking Guardrails for AI Coding Agents', replyDraft: 'This paper is a useful frame for agent products: proxy rewards compress the real user objective, and agents learn the shortcut. ThumbGate can enforce the missing layer: completion claims need proof, benchmark wins need holdouts, and verifier loops need gates against sycophancy, verbosity-as-proof, and evaluator manipulation.', }, }; } - Helper function that analyzes candidate text against regex patterns to detect reward-hacking signals: hallucinated verification, verbosity-as-proof, sycophancy, benchmark overfitting, evaluator manipulation, proxy-metric-only, and perception-reasoning decoupling.
function buildSignals(options) { const signals = []; const evidencePresent = hasEvidence(options); if (COMPLETION_CLAIM_RE.test(options.candidateText) && !evidencePresent) { signals.push({ id: 'hallucinated_verification', severity: 'critical', message: 'The response claims completion, safety, test success, or deployment without attached proof.', gate: 'Block completion claims until test output, run id, trace, screenshot, or proof artifact is attached.', }); } if (options.wordCount >= 180 && !evidencePresent) { signals.push({ id: 'verbosity_as_proof', severity: 'high', message: 'The response is long but does not provide verifiable artifacts, turning fluency into a proxy for correctness.', gate: 'Require concise claims with artifact-backed evidence before accepting persuasive explanations.', }); } if (SYCOPHANCY_RE.test(options.candidateText) && !evidencePresent) { signals.push({ id: 'sycophancy_or_rubber_stamp', severity: 'high', message: 'Agreement or approval language appears without independent checks or counterevidence.', gate: 'Require at least one explicit verification step or risk check before approval-style responses.', }); } if (options.optimizedForScore && !options.hasHoldout) { signals.push({ id: 'benchmark_overfitting', severity: 'high', message: 'A score, eval, benchmark, or reward metric is being optimized without holdout or regression proof.', gate: 'Require holdout, regression, or real-workflow evidence before treating score gains as product gains.', }); } if (EVALUATOR_MANIPULATION_RE.test(options.candidateText)) { signals.push({ id: 'evaluator_manipulation', severity: 'critical', message: 'The candidate text attempts to influence grading instead of satisfying the user objective.', gate: 'Block evaluator-manipulation language and route to human review.', }); } if (options.metrics.length > 0 && !options.hasHumanObjective) { signals.push({ id: 'proxy_metric_only', severity: 'medium', message: 'Proxy metrics are present without an explicit human objective or user-visible success criterion.', gate: 'Pair every reward or benchmark metric with the real user outcome it is meant to approximate.', }); } if (options.multimodal && !evidencePresent) { signals.push({ id: 'perception_reasoning_decoupling', severity: 'high', message: 'A visual or multimodal claim is made without source artifacts or perception trace evidence.', gate: 'Require screenshot, OCR, or visual proof artifact before accepting multimodal reasoning claims.', }); } return signals; } - Helper function that normalizes input options, parsing booleans, numbers, and lists, and deriving defaults from candidate text regex matching (e.g., detecting holdout mentions, benchmark terms, multimodal indicators).
function normalizeOptions(raw = {}) { const candidateText = String(raw.text || raw.claim || raw.response || raw.summary || '').trim(); const evidence = splitList(raw.evidence || raw.evidenceArtifacts || raw['evidence-artifacts']); const metrics = splitList(raw.metrics || raw.metric || raw['proxy-metrics']); return { workflow: String(raw.workflow || raw.name || 'agent reward guardrails').trim() || 'agent reward guardrails', candidateText, evidence, metrics, wordCount: parseNumber(raw['word-count'] || raw.wordCount, candidateText.split(/\s+/).filter(Boolean).length), hasHoldout: parseBoolean(raw.holdout || raw.hasHoldout || raw['has-holdout'], HOLDOUT_RE.test(candidateText)), hasHumanObjective: parseBoolean(raw['human-objective'] || raw.hasHumanObjective, false), hasVerifierTrace: parseBoolean(raw['verifier-trace'] || raw.hasVerifierTrace, evidence.some((item) => /trace|log|run|artifact|proof/i.test(item))), optimizedForScore: parseBoolean(raw['optimized-for-score'] || raw.optimizedForScore, BENCHMARK_RE.test(candidateText) || metrics.length > 0), multimodal: parseBoolean(raw.multimodal || raw.hasMultimodalInputs, MULTIMODAL_RE.test(candidateText)), }; } - adapters/mcp/server-stdio.js:1037-1038 (registration)MCP tool registration in the server switch statement: maps the 'plan_reward_hacking_guardrails' tool name to the buildRewardHackingGuardrailsPlan handler.
case 'plan_reward_hacking_guardrails': return toTextResult(buildRewardHackingGuardrailsPlan(args)); - scripts/cli-schema.js:253-269 (schema)CLI schema definition for the 'reward-hacking-guardrails' command, mapping to MCP tool 'plan_reward_hacking_guardrails' with flags for workflow, text, evidence, metrics, holdout, human-objective, verifier-trace, optimized-for-score, and multimodal.
discoveryCommand({ name: 'reward-hacking-guardrails', aliases: ['proxy-reward-guardrails', 'reward-guardrails'], description: 'Detect reward-hacking patterns and require proof before proxy metrics or completion claims are trusted', mcpTool: 'plan_reward_hacking_guardrails', flags: [ jsonFlag(), { name: 'workflow', type: 'string', description: 'Agent workflow or release lane being evaluated' }, { name: 'text', type: 'string', description: 'Candidate response, claim, summary, or verifier output to inspect' }, { name: 'evidence', type: 'string', description: 'Comma-separated evidence artifacts attached to the claim' }, { name: 'metrics', type: 'string', description: 'Comma-separated proxy metrics or reward scores used by the workflow' }, { name: 'holdout', type: 'boolean', description: 'Whether holdout, regression, or real-workflow evidence exists' }, { name: 'human-objective', type: 'boolean', description: 'Whether proxy metrics are mapped to a human/user objective' }, { name: 'verifier-trace', type: 'boolean', description: 'Whether verifier trace, run log, or proof artifact exists' }, { name: 'optimized-for-score', type: 'boolean', description: 'Mark that the workflow is optimizing an eval, benchmark, or reward score' }, { name: 'multimodal', type: 'boolean', description: 'Mark that claims depend on screenshots, PDFs, charts, images, or video' }, ],