Skip to main content
Glama

judge_receipt

Initiate auditable judgment of receipt outputs by creating Ed25519-signed pending receipts with structured evaluation prompts. Enables weighted rubric assessment beyond pass/fail with partial verdicts and confidence scoring.

Instructions

Start an AI judgment evaluation for a receipt by creating a pending judgment receipt and returning a structured evaluation prompt. The host model (you) evaluates the receipt's output against the provided rubric criteria and then calls complete_judgment with the results. Use to assess output quality beyond simple pass/fail constraints — supports weighted criteria, partial verdicts, and confidence scores. Judgment receipts are themselves Ed25519-signed for auditability.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
receipt_idYesThe receipt ID to evaluate — the original action receipt
rubricYesEvaluation rubric with criteria array. Each criterion needs: name (string), description (string), weight (0.0-1.0), and optional passing_threshold (0.0-1.0, default 0.7). Also set: passing_threshold (overall, default 0.7) and require_all (boolean, default false)
output_summary_for_reviewNoThe actual output content to evaluate — provide if output_summary on the receipt is insufficient for evaluation
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full disclosure burden. It effectively describes the state change (creates 'pending judgment receipt'), return value ('structured evaluation prompt'), security properties ('Ed25519-signed for auditability'), and the critical behavioral expectation that 'the host model (you) evaluates.' However, it lacks details on error conditions, idempotency, or the specific structure of the returned prompt.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description comprises four sentences with zero waste: sentence 1 defines the core action, sentence 2 establishes the workflow involving the host model and complete_judgment, sentence 3 specifies usage context (beyond pass/fail), and sentence 4 notes security properties. Information is front-loaded and every sentence earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complex nested schema and multi-step workflow, the description provides excellent context by explaining the two-phase process (start then complete) and auditability features. However, without an output schema, it could further clarify what the 'structured evaluation prompt' contains to fully prepare the agent for handling the response.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage, the schema adequately documents all parameters (receipt_id, rubric, output_summary_for_review). The description references these in context ('provided rubric criteria,' 'receipt's output') but does not add significant semantic meaning beyond what the schema already provides, warranting the baseline score of 3.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description explicitly states the tool 'Start[s] an AI judgment evaluation' by creating a 'pending judgment receipt,' using specific verbs and resources. It clearly distinguishes itself from sibling tools by explicitly mentioning the subsequent step requires calling 'complete_judgment with the results,' establishing its role in a multi-step workflow.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit workflow guidance: 'Use to assess output quality beyond simple pass/fail constraints' (when to use vs. simple verification alternatives), and explicitly names the required follow-up action 'then calls complete_judgment.' This clearly maps the tool's place in the evaluation pipeline relative to siblings like verify_receipt or get_receipt.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/webaesbyamin/agent-receipts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server