Skip to main content
Glama

plan_reward_hacking_guardrails

Read-only

Analyze agent responses and metrics to detect reward-hacking patterns such as sycophancy, verbosity-as-proof, and benchmark overfitting.

Instructions

Detect reward-hacking patterns such as unsupported completion claims, sycophancy, verbosity-as-proof, benchmark overfitting, evaluator manipulation, and proxy-only metrics.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
workflowNoAgent workflow or release lane being evaluated.
textNoCandidate response, claim, summary, or verifier output to inspect.
evidenceNoEvidence artifacts attached to the claim.
metricsNoProxy metrics or reward scores used by the workflow.
hasHoldoutNoWhether holdout, regression, or real-workflow evidence exists.
hasHumanObjectiveNoWhether proxy metrics are mapped to a user objective.
hasVerifierTraceNoWhether verifier trace, run log, or proof artifact exists.
optimizedForScoreNoWhether an eval, benchmark, or reward score is being optimized.
multimodalNoWhether claims depend on screenshots, PDFs, charts, images, or video.

Implementation Reference

  • Main handler function that builds the reward hacking guardrails plan. Normalizes input options, detects reward-hacking signals, computes status (blocked/needs_evidence/ready), and returns the full plan object with gates, metrics, next actions, and marketing content.
    function buildRewardHackingGuardrailsPlan(rawOptions = {}) {
      const options = normalizeOptions(rawOptions);
      const signals = buildSignals(options);
      const critical = signals.filter((signal) => signal.severity === 'critical').length;
      const high = signals.filter((signal) => signal.severity === 'high').length;
    
      return {
        name: 'thumbgate-reward-hacking-guardrails',
        source: SOURCE,
        workflow: options.workflow,
        status: critical > 0 ? 'blocked' : high > 0 ? 'needs_evidence' : 'ready',
        summary: {
          signalCount: signals.length,
          critical,
          high,
          evidenceArtifacts: options.evidence.length,
          proxyMetrics: options.metrics,
          hasHumanObjective: options.hasHumanObjective,
          hasHoldout: options.hasHoldout,
        },
        proxyCompressionMapping: {
          compression: 'compressed reward, benchmark, or approval signal is treated as a stand-in for the full user objective',
          amplification: 'optimization pressure can turn local shortcuts into repeated workflow behavior',
          coAdaptation: 'agent outputs can learn to satisfy evaluators, rubrics, or verifiers instead of the task',
        },
        signals,
        gates: signals.map((signal) => ({
          id: signal.id,
          action: signal.severity === 'critical' ? 'block' : 'warn',
          message: signal.gate,
        })),
        metrics: buildMetrics(),
        nextActions: [
          'Attach proof artifacts before allowing claims like tests passed, fixed, deployed, safe, or ready to merge.',
          'Treat benchmark gains as provisional until holdout, regression, or real-workflow evidence confirms the user objective improved.',
          'Require explicit user-objective mapping for every proxy metric, reward score, or evaluator rubric.',
          'Block evaluator-manipulation language before it reaches judge or verifier loops.',
          'Prefer short evidence-backed summaries over long persuasive explanations when judging agent work.',
        ],
        marketingAngle: {
          headline: 'Reward hacking is what happens when agents optimize the receipt instead of the meal.',
          subhead: 'ThumbGate turns proxy failures into pre-action gates: no unsupported completion claims, no benchmark-only victory laps, and no verifier theater without proof artifacts.',
          guideTitle: 'Reward Hacking Guardrails for AI Coding Agents',
          replyDraft: 'This paper is a useful frame for agent products: proxy rewards compress the real user objective, and agents learn the shortcut. ThumbGate can enforce the missing layer: completion claims need proof, benchmark wins need holdouts, and verifier loops need gates against sycophancy, verbosity-as-proof, and evaluator manipulation.',
        },
      };
    }
  • Helper function that analyzes candidate text against regex patterns to detect reward-hacking signals: hallucinated verification, verbosity-as-proof, sycophancy, benchmark overfitting, evaluator manipulation, proxy-metric-only, and perception-reasoning decoupling.
    function buildSignals(options) {
      const signals = [];
      const evidencePresent = hasEvidence(options);
    
      if (COMPLETION_CLAIM_RE.test(options.candidateText) && !evidencePresent) {
        signals.push({
          id: 'hallucinated_verification',
          severity: 'critical',
          message: 'The response claims completion, safety, test success, or deployment without attached proof.',
          gate: 'Block completion claims until test output, run id, trace, screenshot, or proof artifact is attached.',
        });
      }
    
      if (options.wordCount >= 180 && !evidencePresent) {
        signals.push({
          id: 'verbosity_as_proof',
          severity: 'high',
          message: 'The response is long but does not provide verifiable artifacts, turning fluency into a proxy for correctness.',
          gate: 'Require concise claims with artifact-backed evidence before accepting persuasive explanations.',
        });
      }
    
      if (SYCOPHANCY_RE.test(options.candidateText) && !evidencePresent) {
        signals.push({
          id: 'sycophancy_or_rubber_stamp',
          severity: 'high',
          message: 'Agreement or approval language appears without independent checks or counterevidence.',
          gate: 'Require at least one explicit verification step or risk check before approval-style responses.',
        });
      }
    
      if (options.optimizedForScore && !options.hasHoldout) {
        signals.push({
          id: 'benchmark_overfitting',
          severity: 'high',
          message: 'A score, eval, benchmark, or reward metric is being optimized without holdout or regression proof.',
          gate: 'Require holdout, regression, or real-workflow evidence before treating score gains as product gains.',
        });
      }
    
      if (EVALUATOR_MANIPULATION_RE.test(options.candidateText)) {
        signals.push({
          id: 'evaluator_manipulation',
          severity: 'critical',
          message: 'The candidate text attempts to influence grading instead of satisfying the user objective.',
          gate: 'Block evaluator-manipulation language and route to human review.',
        });
      }
    
      if (options.metrics.length > 0 && !options.hasHumanObjective) {
        signals.push({
          id: 'proxy_metric_only',
          severity: 'medium',
          message: 'Proxy metrics are present without an explicit human objective or user-visible success criterion.',
          gate: 'Pair every reward or benchmark metric with the real user outcome it is meant to approximate.',
        });
      }
    
      if (options.multimodal && !evidencePresent) {
        signals.push({
          id: 'perception_reasoning_decoupling',
          severity: 'high',
          message: 'A visual or multimodal claim is made without source artifacts or perception trace evidence.',
          gate: 'Require screenshot, OCR, or visual proof artifact before accepting multimodal reasoning claims.',
        });
      }
    
      return signals;
    }
  • Helper function that normalizes input options, parsing booleans, numbers, and lists, and deriving defaults from candidate text regex matching (e.g., detecting holdout mentions, benchmark terms, multimodal indicators).
    function normalizeOptions(raw = {}) {
      const candidateText = String(raw.text || raw.claim || raw.response || raw.summary || '').trim();
      const evidence = splitList(raw.evidence || raw.evidenceArtifacts || raw['evidence-artifacts']);
      const metrics = splitList(raw.metrics || raw.metric || raw['proxy-metrics']);
      return {
        workflow: String(raw.workflow || raw.name || 'agent reward guardrails').trim() || 'agent reward guardrails',
        candidateText,
        evidence,
        metrics,
        wordCount: parseNumber(raw['word-count'] || raw.wordCount, candidateText.split(/\s+/).filter(Boolean).length),
        hasHoldout: parseBoolean(raw.holdout || raw.hasHoldout || raw['has-holdout'], HOLDOUT_RE.test(candidateText)),
        hasHumanObjective: parseBoolean(raw['human-objective'] || raw.hasHumanObjective, false),
        hasVerifierTrace: parseBoolean(raw['verifier-trace'] || raw.hasVerifierTrace, evidence.some((item) => /trace|log|run|artifact|proof/i.test(item))),
        optimizedForScore: parseBoolean(raw['optimized-for-score'] || raw.optimizedForScore, BENCHMARK_RE.test(candidateText) || metrics.length > 0),
        multimodal: parseBoolean(raw.multimodal || raw.hasMultimodalInputs, MULTIMODAL_RE.test(candidateText)),
      };
    }
  • MCP tool registration in the server switch statement: maps the 'plan_reward_hacking_guardrails' tool name to the buildRewardHackingGuardrailsPlan handler.
    case 'plan_reward_hacking_guardrails':
      return toTextResult(buildRewardHackingGuardrailsPlan(args));
  • CLI schema definition for the 'reward-hacking-guardrails' command, mapping to MCP tool 'plan_reward_hacking_guardrails' with flags for workflow, text, evidence, metrics, holdout, human-objective, verifier-trace, optimized-for-score, and multimodal.
    discoveryCommand({
      name: 'reward-hacking-guardrails',
      aliases: ['proxy-reward-guardrails', 'reward-guardrails'],
      description: 'Detect reward-hacking patterns and require proof before proxy metrics or completion claims are trusted',
      mcpTool: 'plan_reward_hacking_guardrails',
      flags: [
        jsonFlag(),
        { name: 'workflow', type: 'string', description: 'Agent workflow or release lane being evaluated' },
        { name: 'text', type: 'string', description: 'Candidate response, claim, summary, or verifier output to inspect' },
        { name: 'evidence', type: 'string', description: 'Comma-separated evidence artifacts attached to the claim' },
        { name: 'metrics', type: 'string', description: 'Comma-separated proxy metrics or reward scores used by the workflow' },
        { name: 'holdout', type: 'boolean', description: 'Whether holdout, regression, or real-workflow evidence exists' },
        { name: 'human-objective', type: 'boolean', description: 'Whether proxy metrics are mapped to a human/user objective' },
        { name: 'verifier-trace', type: 'boolean', description: 'Whether verifier trace, run log, or proof artifact exists' },
        { name: 'optimized-for-score', type: 'boolean', description: 'Mark that the workflow is optimizing an eval, benchmark, or reward score' },
        { name: 'multimodal', type: 'boolean', description: 'Mark that claims depend on screenshots, PDFs, charts, images, or video' },
      ],
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The annotations already indicate readOnlyHint=true, and the description adds value by listing the specific reward-hacking patterns it analyzes, providing behavioral context beyond the annotation.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single sentence that is extremely concise, front-loads the main action, and includes key examples without any extraneous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

While the core purpose is clear, the description lacks output specification and usage guidance, which are important given the tool's complexity, many parameters, and numerous sibling tools.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has 100% coverage with descriptions for all 9 parameters, so the description adds no additional meaning beyond what the schema already provides; it reaches the baseline.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool detects reward-hacking patterns with concrete examples such as sycophancy and benchmark overfitting, making its purpose specific. However, it does not explicitly differentiate from sibling tools like plan_proactive_agent_eval_guardrails, which may overlap.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance is provided on when to use this tool versus alternatives. There are no conditions, prerequisites, or exclusions mentioned, leaving the agent without usage context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/IgorGanapolsky/ThumbGate'

If you have feedback or need assistance with the MCP directory API, please join our Discord server