Skip to main content
Glama

review_output

Review any AI-generated output for errors using an independent adversarial checker. Get a PASS/FAIL/CONDITIONAL_PASS verdict, score, and categorized issues with severity. Works for code, content, summaries, translations, and more.

Instructions

Adversarial quality review of any AI-generated output. An independent reviewer assumes the author made mistakes and actively looks for problems. Returns structured verdict (PASS/FAIL/CONDITIONAL_PASS), score (0-100), categorized issues with severity, and evidence-based checklist. Works for any output type: code, content, summaries, translations, data extraction, etc.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
outputYesThe AI-generated output to review (max 100K chars)
criteriaNoCustom review criteria — what specifically to check for
review_typeNoReview category label (e.g., "code", "content", "factual", "translation")
modelNoReviewer model ID (default: claude-sonnet-4-6)

Implementation Reference

  • The actual handler that executes the review logic: calls Anthropic API with a review prompt, parses the result, validates checklist, and returns a ReviewResult.
    export async function reviewOutput(options: ReviewOptions): Promise<ReviewResult> {
      const { output, criteria, reviewType, model } = options
      const client = getClient()
    
      const reviewPrompt = buildReviewPrompt(output, criteria, reviewType)
    
      const startTime = Date.now()
      const response = await client.messages.create({
        model: model || DEFAULT_MODEL,
        max_tokens: MAX_REVIEW_TOKENS,
        messages: [{ role: 'user', content: reviewPrompt }],
      })
    
      const rawText = response.content
        .filter((block): block is Anthropic.TextBlock => block.type === 'text')
        .map(block => block.text)
        .join('')
    
      const result = parseReviewResult(rawText)
      result.reviewer_model = model || DEFAULT_MODEL
    
      validateChecklist(result)
    
      console.error(`[REVIEW] Completed in ${Date.now() - startTime}ms — verdict: ${result.verdict}, score: ${result.score}`)
      return result
    }
  • Internal ReviewOptions interface defining input parameters: output, criteria, reviewType, and model.
    interface ReviewOptions {
      output: string
      criteria?: string
      reviewType?: string
      model?: string
    }
  • Type definitions for ReviewResult, ReviewIssue, and ChecklistItem — the return types of the review tool.
    export interface ReviewResult {
      verdict: 'PASS' | 'FAIL' | 'CONDITIONAL_PASS'
      score: number
      issues: ReviewIssue[]
      checklist: ChecklistItem[]
      summary: string
      reviewer_model: string
    }
    
    export interface ReviewIssue {
      severity: 'critical' | 'high' | 'medium' | 'low'
      category: string
      description: string
      suggestion: string
    }
    
    export interface ChecklistItem {
      item: string
      status: 'pass' | 'fail'
      evidence: string
    }
  • src/index.ts:44-64 (registration)
    Registers the 'review_output' MCP tool with name, description, Zod input schema, and safeAsyncTool handler that delegates to reviewOutput().
    server.tool(
      'review_output',
      'Adversarial quality review of any AI-generated output. An independent reviewer assumes the author made mistakes and actively looks for problems. Returns structured verdict (PASS/FAIL/CONDITIONAL_PASS), score (0-100), categorized issues with severity, and evidence-based checklist. Works for any output type: code, content, summaries, translations, data extraction, etc.',
      {
        output: z.string().max(100000).describe('The AI-generated output to review (max 100K chars)'),
        criteria: z.string().optional().describe('Custom review criteria — what specifically to check for'),
        review_type: z.string().optional().describe('Review category label (e.g., "code", "content", "factual", "translation")'),
        model: z.string().optional().describe('Reviewer model ID (default: claude-sonnet-4-6)'),
      },
      safeAsyncTool(async ({ output, criteria, review_type, model }) => {
        if (!process.env.ANTHROPIC_API_KEY) {
          throw new Error('ANTHROPIC_API_KEY environment variable is required. Set it in your MCP server config.')
        }
        return await reviewOutput({
          output,
          criteria: criteria || undefined,
          reviewType: review_type || undefined,
          model: model || undefined,
        })
      })
    )
  • buildReviewPrompt constructs the adversarial review prompt sent to the LLM. Also: extractJson, parseReviewResult, validateChecklist, sanitizeIssue, sanitizeChecklistItem — all helpers supporting reviewOutput.
    export function buildReviewPrompt(output: string, criteria?: string, reviewType?: string): string {
      let prompt = `You are an independent, adversarial quality reviewer. Your job is to find problems.
    Assume the author may have made mistakes, taken shortcuts, or missed edge cases.
    Do NOT give the benefit of the doubt. Be thorough and critical.
    
    IMPORTANT RULES:
    1. Every checklist item MUST have specific evidence (a quote or concrete observation).
    2. If you cannot find evidence for a PASS item, mark it as FAIL.
    3. A single critical issue means the overall verdict MUST be FAIL.
    4. Score must reflect the issues found: critical = max 30, high = max 60.
    5. Do not be impressed by length or formatting — judge substance.
    
    `
    
      if (criteria) {
        prompt += `REVIEW CRITERIA:\n${criteria}\n\n`
      }
    
      if (reviewType) {
        prompt += `REVIEW TYPE: ${reviewType}\n\n`
      }
    
      prompt += `OUTPUT TO REVIEW:
    ---
    ${output}
    ---
    
    Respond in this exact JSON format (no other text):
    ${REVIEW_JSON_TEMPLATE}`
    
      return prompt
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries full burden for behavioral transparency. It describes the output structure (verdict, score, issues) but does not disclose any potential side effects, destructive actions, authentication needs, or rate limits. The 'adversarial' nature is mentioned but not elaborated.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise at two sentences, front-loading the core purpose and then detailing the output. Every sentence adds value; no redundant or verbose phrasing.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given no output schema, the description effectively explains the return values (verdict, score, issues, checklist). It covers the tool's broad applicability and key inputs. Minor omissions: it does not clarify that 'criteria' is optional or describe defaults for 'review_type' and 'model'.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%—all four parameters have descriptions in the schema. The description does not add additional meaning beyond the schema; it only summarizes the output format. Baseline score of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it performs an adversarial quality review of AI-generated output, using specific verbs ('review') and a resource type ('output'). It does not differentiate from the sibling tool 'review_dual', suggesting both may perform reviews, so it misses the top score for sibling distinction.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for quality checking of any AI output, but provides no explicit guidance on when to use this tool versus alternatives (e.g., 'review_dual') or when not to use it. It lacks clear context for exclusion or alternative selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rih0z/agentdesk-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server