Skip to main content
Glama

Evaluate Output

evaluate_output
Idempotent

Score agent output against configurable eval rules and return a 0..1 score with per-rule breakdown.

Instructions

Score agent output against configurable eval rules and return a 0..1 score + per-rule breakdown.

Sibling tools — evaluate_with_llm_judge runs semantic LLM-based scoring (slower, costs money; this tool is heuristic, free, deterministic), verify_citations checks citation grounding specifically, log_trace records executions, get_traces queries them, list_rules / deploy_rule / delete_rule manage the custom-rule lifecycle. evaluate_output is the FAST PATH for length / keyword / PII / injection / cost-threshold checks where rules are sufficient.

Behavior. Deterministic, in-process scoring — same inputs always produce the same result. Writes one eval_result row to Iris storage (linked to trace_id if provided; unlinked otherwise). No external network calls in heuristic mode (v0.4 adds an llm_as_judge eval_type that DOES call LLM APIs; see the separate evaluate_with_llm_judge tool for that). Rate-limited to 20 req/min on HTTP MCP, unlimited on stdio. Runs in ~5-50ms for rule-based evaluation.

Output shape. Returns JSON: { "id": "<uuid>", "score": 0..1, "passed": boolean, "rule_results": [{ "ruleName", "passed", "score", "message", "skipped?" }], "suggestions": string[], "rules_evaluated": number, "rules_skipped": number, "insufficient_data": boolean }. insufficient_data=true means no applicable rules fired (e.g., safety eval with only cost data).

Use when you want a quality score on a specific output — typically after log_trace records the execution. Pass eval_type to route to the right rule bundle: completeness (length, sentence count, relevance to input), relevance (keyword overlap, topic consistency), safety (PII leak, prompt injection, hallucination markers, stub-output detection), cost (budget threshold), or custom (bring your own rules via custom_rules).

Don't use when the output is empty or has no applicable rules — the eval_type decides which rules apply, and invalid combinations return score=0 + insufficient_data=true (not an error, but not actionable). Don't use to VALIDATE JSON schemas directly (use your language's JSON Schema validator — Iris's json_schema custom rule type is for output-shape assertions, not arbitrary validation).

Parameters. expected is REQUIRED when eval_type="relevance" (used as the comparison target for keyword overlap + topic consistency); ignored for other eval_types. cost_usd + token_usage are ONLY consulted when eval_type="cost" (ignored otherwise). custom_rules ALWAYS fires regardless of eval_type — pass eval_type="custom" if you want ONLY your rules to run (otherwise both your rules AND the eval_type bundle run together). trace_id is optional but recommended (linking the eval to its trace surfaces it in the dashboard's drill-through). input adds context to keyword-overlap relevance checks; ignored otherwise. Defaults: eval_type="completeness".

Error modes. Throws on malformed custom_rules (Zod rejects). Returns 400 on regex patterns that fail safe-regex2 ReDoS check or exceed 1000-char limit. Returns 429 when HTTP rate limit exceeded. Storage failures propagate as 500. The eval itself never throws — failing rules report passed: false with a message, they don't bubble exceptions.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
outputYesThe output text to evaluate (the agent's response that gets scored against rules)
eval_typeNoRule bundle to apply: completeness | relevance | safety | cost | custom — picks which built-in rules firecompleteness
expectedNoExpected output for comparison — REQUIRED when eval_type="relevance" (used as keyword-overlap target)
inputNoOriginal input for context — improves relevance scoring (keyword overlap vs input)
trace_idNoLink evaluation to a trace — surfaces this eval in the dashboard's trace drill-through
custom_rulesNoCustom evaluation rules — fires REGARDLESS of eval_type; pass eval_type="custom" if you want ONLY these
cost_usdNoCost in USD — only consulted when eval_type="cost" (compared against cost_threshold rules)
token_usageNoToken usage breakdown — only consulted when eval_type="cost" (used for token-budget rules)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses deterministic in-process scoring, writing to Iris storage, rate limits (20 req/min HTTP, unlimited stdio), typical runtime (5-50ms), and error modes (400, 429, 500). This goes well beyond the idempotentHint annotation, adding concrete behavioral details.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is long (~600 words) but well-structured with clear sections (Behavior, Output shape, Use when, Don't use, Parameters, Error modes). Front-loaded with main purpose and fast-path guidance. Minor conciseness penalty for detailed error mode descriptions that could be summarized.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity (8 parameters, 1 required, nested objects, no output schema), the description is extremely complete. It describes the output shape in detail, all error modes, rate limits, performance, and sibling interactions. No gaps for an agent to misuse the tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema coverage, baseline is 3, but the description adds significant meaning: explains when 'expected' is required, that 'cost_usd' and 'token_usage' are only consulted for cost eval_type, 'custom_rules' always fires, and 'trace_id' recommended for dashboard linking. Clarifies default eval_type and parameter interactions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool scores agent output against configurable eval rules and returns a 0..1 score with per-rule breakdown. It distinguishes itself from siblings like evaluate_with_llm_judge (semantic, slower, costs) and verify_citations (citation grounding), making the purpose unambiguous.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly states when to use (fast path for length/keyword/PII/injection/cost-threshold checks), when not to use (empty output, no applicable rules, direct JSON schema validation), and references alternatives (evaluate_with_llm_judge, verify_citations). Provides clear context for each eval_type.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/iris-eval/mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server