Skip to main content
Glama

plan_reward_hacking_guardrails

Read-only

Detect reward-hacking patterns in agent workflows, including unsupported claims, sycophancy, and benchmark overfitting. Inspect responses, evidence, and proxy metrics to identify manipulation.

Instructions

Detect reward-hacking patterns such as unsupported completion claims, sycophancy, verbosity-as-proof, benchmark overfitting, evaluator manipulation, and proxy-only metrics.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
workflowNoAgent workflow or release lane being evaluated.
textNoCandidate response, claim, summary, or verifier output to inspect.
evidenceNoEvidence artifacts attached to the claim.
metricsNoProxy metrics or reward scores used by the workflow.
hasHoldoutNoWhether holdout, regression, or real-workflow evidence exists.
hasHumanObjectiveNoWhether proxy metrics are mapped to a user objective.
hasVerifierTraceNoWhether verifier trace, run log, or proof artifact exists.
optimizedForScoreNoWhether an eval, benchmark, or reward score is being optimized.
multimodalNoWhether claims depend on screenshots, PDFs, charts, images, or video.

Implementation Reference

  • Main handler function that builds the reward hacking guardrails plan. Analyzes candidate text for reward hacking signals (hallucinated verification, sycophancy, benchmark overfitting, evaluator manipulation, proxy metric only, perception-reasoning decoupling) and returns a report with status, signals, gates, metrics, and next actions.
    function buildRewardHackingGuardrailsPlan(rawOptions = {}) {
      const options = normalizeOptions(rawOptions);
      const signals = buildSignals(options);
      const critical = signals.filter((signal) => signal.severity === 'critical').length;
      const high = signals.filter((signal) => signal.severity === 'high').length;
    
      return {
        name: 'thumbgate-reward-hacking-guardrails',
        source: SOURCE,
        workflow: options.workflow,
        status: critical > 0 ? 'blocked' : high > 0 ? 'needs_evidence' : 'ready',
        summary: {
          signalCount: signals.length,
          critical,
          high,
          evidenceArtifacts: options.evidence.length,
          proxyMetrics: options.metrics,
          hasHumanObjective: options.hasHumanObjective,
          hasHoldout: options.hasHoldout,
        },
        proxyCompressionMapping: {
          compression: 'compressed reward, benchmark, or approval signal is treated as a stand-in for the full user objective',
          amplification: 'optimization pressure can turn local shortcuts into repeated workflow behavior',
          coAdaptation: 'agent outputs can learn to satisfy evaluators, rubrics, or verifiers instead of the task',
        },
        signals,
        gates: signals.map((signal) => ({
          id: signal.id,
          action: signal.severity === 'critical' ? 'block' : 'warn',
          message: signal.gate,
        })),
        metrics: buildMetrics(),
        nextActions: [
          'Attach proof artifacts before allowing claims like tests passed, fixed, deployed, safe, or ready to merge.',
          'Treat benchmark gains as provisional until holdout, regression, or real-workflow evidence confirms the user objective improved.',
          'Require explicit user-objective mapping for every proxy metric, reward score, or evaluator rubric.',
          'Block evaluator-manipulation language before it reaches judge or verifier loops.',
          'Prefer short evidence-backed summaries over long persuasive explanations when judging agent work.',
        ],
        marketingAngle: {
          headline: 'Reward hacking is what happens when agents optimize the receipt instead of the meal.',
          subhead: 'ThumbGate turns proxy failures into pre-action gates: no unsupported completion claims, no benchmark-only victory laps, and no verifier theater without proof artifacts.',
          guideTitle: 'Reward Hacking Guardrails for AI Coding Agents',
          replyDraft: 'This paper is a useful frame for agent products: proxy rewards compress the real user objective, and agents learn the shortcut. ThumbGate can enforce the missing layer: completion claims need proof, benchmark wins need holdouts, and verifier loops need gates against sycophancy, verbosity-as-proof, and evaluator manipulation.',
        },
      };
    }
  • Normalizes raw input options (text, evidence, metrics, flags) into a structured options object used by the handler.
    function normalizeOptions(raw = {}) {
      const candidateText = String(raw.text || raw.claim || raw.response || raw.summary || '').trim();
      const evidence = splitList(raw.evidence || raw.evidenceArtifacts || raw['evidence-artifacts']);
      const metrics = splitList(raw.metrics || raw.metric || raw['proxy-metrics']);
      return {
        workflow: String(raw.workflow || raw.name || 'agent reward guardrails').trim() || 'agent reward guardrails',
        candidateText,
        evidence,
        metrics,
        wordCount: parseNumber(raw['word-count'] || raw.wordCount, candidateText.split(/\s+/).filter(Boolean).length),
        hasHoldout: parseBoolean(raw.holdout || raw.hasHoldout || raw['has-holdout'], HOLDOUT_RE.test(candidateText)),
        hasHumanObjective: parseBoolean(raw['human-objective'] || raw.hasHumanObjective, false),
        hasVerifierTrace: parseBoolean(raw['verifier-trace'] || raw.hasVerifierTrace, evidence.some((item) => /trace|log|run|artifact|proof/i.test(item))),
        optimizedForScore: parseBoolean(raw['optimized-for-score'] || raw.optimizedForScore, BENCHMARK_RE.test(candidateText) || metrics.length > 0),
        multimodal: parseBoolean(raw.multimodal || raw.hasMultimodalInputs, MULTIMODAL_RE.test(candidateText)),
      };
    }
  • Builds detection signals by analyzing candidate text against regex patterns for reward hacking behaviors (hallucinated verification, verbosity as proof, sycophancy, benchmark overfitting, evaluator manipulation, proxy metric gaps, multimodal decoupling).
    function buildSignals(options) {
      const signals = [];
      const evidencePresent = hasEvidence(options);
    
      if (COMPLETION_CLAIM_RE.test(options.candidateText) && !evidencePresent) {
        signals.push({
          id: 'hallucinated_verification',
          severity: 'critical',
          message: 'The response claims completion, safety, test success, or deployment without attached proof.',
          gate: 'Block completion claims until test output, run id, trace, screenshot, or proof artifact is attached.',
        });
      }
    
      if (options.wordCount >= 180 && !evidencePresent) {
        signals.push({
          id: 'verbosity_as_proof',
          severity: 'high',
          message: 'The response is long but does not provide verifiable artifacts, turning fluency into a proxy for correctness.',
          gate: 'Require concise claims with artifact-backed evidence before accepting persuasive explanations.',
        });
      }
    
      if (SYCOPHANCY_RE.test(options.candidateText) && !evidencePresent) {
        signals.push({
          id: 'sycophancy_or_rubber_stamp',
          severity: 'high',
          message: 'Agreement or approval language appears without independent checks or counterevidence.',
          gate: 'Require at least one explicit verification step or risk check before approval-style responses.',
        });
      }
    
      if (options.optimizedForScore && !options.hasHoldout) {
        signals.push({
          id: 'benchmark_overfitting',
          severity: 'high',
          message: 'A score, eval, benchmark, or reward metric is being optimized without holdout or regression proof.',
          gate: 'Require holdout, regression, or real-workflow evidence before treating score gains as product gains.',
        });
      }
    
      if (EVALUATOR_MANIPULATION_RE.test(options.candidateText)) {
        signals.push({
          id: 'evaluator_manipulation',
          severity: 'critical',
          message: 'The candidate text attempts to influence grading instead of satisfying the user objective.',
          gate: 'Block evaluator-manipulation language and route to human review.',
        });
      }
    
      if (options.metrics.length > 0 && !options.hasHumanObjective) {
        signals.push({
          id: 'proxy_metric_only',
          severity: 'medium',
          message: 'Proxy metrics are present without an explicit human objective or user-visible success criterion.',
          gate: 'Pair every reward or benchmark metric with the real user outcome it is meant to approximate.',
        });
      }
    
      if (options.multimodal && !evidencePresent) {
        signals.push({
          id: 'perception_reasoning_decoupling',
          severity: 'high',
          message: 'A visual or multimodal claim is made without source artifacts or perception trace evidence.',
          gate: 'Require screenshot, OCR, or visual proof artifact before accepting multimodal reasoning claims.',
        });
      }
    
      return signals;
    }
  • MCP server case statement that routes 'plan_reward_hacking_guardrails' to buildRewardHackingGuardrailsPlan and returns result as text.
    case 'plan_reward_hacking_guardrails':
      return toTextResult(buildRewardHackingGuardrailsPlan(args));
  • CLI schema registration mapping the 'reward-hacking-guardrails' command to the 'plan_reward_hacking_guardrails' MCP tool, with flags for workflow, text, evidence, metrics, holdout, human-objective, verifier-trace, optimized-for-score, and multimodal.
    discoveryCommand({
      name: 'reward-hacking-guardrails',
      aliases: ['proxy-reward-guardrails', 'reward-guardrails'],
      description: 'Detect reward-hacking patterns and require proof before proxy metrics or completion claims are trusted',
      mcpTool: 'plan_reward_hacking_guardrails',
      flags: [
        jsonFlag(),
        { name: 'workflow', type: 'string', description: 'Agent workflow or release lane being evaluated' },
        { name: 'text', type: 'string', description: 'Candidate response, claim, summary, or verifier output to inspect' },
        { name: 'evidence', type: 'string', description: 'Comma-separated evidence artifacts attached to the claim' },
        { name: 'metrics', type: 'string', description: 'Comma-separated proxy metrics or reward scores used by the workflow' },
        { name: 'holdout', type: 'boolean', description: 'Whether holdout, regression, or real-workflow evidence exists' },
        { name: 'human-objective', type: 'boolean', description: 'Whether proxy metrics are mapped to a human/user objective' },
        { name: 'verifier-trace', type: 'boolean', description: 'Whether verifier trace, run log, or proof artifact exists' },
        { name: 'optimized-for-score', type: 'boolean', description: 'Mark that the workflow is optimizing an eval, benchmark, or reward score' },
        { name: 'multimodal', type: 'boolean', description: 'Mark that claims depend on screenshots, PDFs, charts, images, or video' },
      ],
    }),
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate readOnlyHint=true, so the description adds minimal behavioral context. It lists patterns to detect, which gives some insight into what the tool checks, but does not describe return values, side effects, or other behaviors.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single concise sentence that efficiently conveys the tool's function. It is front-loaded and without filler, though a slightly more structured presentation could improve scannability.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Despite having 9 parameters and no output schema, the description is very brief. It fails to explain what the tool returns, how to interpret results, or any workflow for planning guardrails. This makes it inadequate for an agent to use effectively without additional context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

All 9 parameters have descriptions in the schema (100% coverage), so the description does not need to add parameter details. It correctly does not repeat schema information, but also does not provide any additional semantic context that the schema lacks.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool detects reward-hacking patterns and lists specific examples like sycophancy and benchmark overfitting. However, the tool name includes 'plan' suggesting a broader planning purpose, while the description is detection-focused, causing slight ambiguity.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus siblings like plan_agent_design_governance or plan_intent. There are no usage scenarios, prerequisites, or alternatives mentioned.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/IgorGanapolsky/ThumbGate'

If you have feedback or need assistance with the MCP directory API, please join our Discord server