Skip to main content
Glama

Evaluate Output

evaluate_output
Idempotent

Score agent outputs against configurable evaluation rules to obtain a quality score and per-rule results. Choose from completeness, relevance, safety, cost, or custom rule bundles.

Instructions

Score agent output against configurable eval rules and return a 0..1 score + per-rule breakdown.

Sibling tools — evaluate_with_llm_judge runs semantic LLM-based scoring (slower, costs money; this tool is heuristic, free, deterministic), verify_citations checks citation grounding specifically, log_trace records executions, get_traces queries them, list_rules / deploy_rule / delete_rule manage the custom-rule lifecycle. evaluate_output is the FAST PATH for length / keyword / PII / injection / cost-threshold checks where rules are sufficient.

Behavior. Deterministic, in-process scoring — same inputs always produce the same result. Writes one eval_result row to Iris storage (linked to trace_id if provided; unlinked otherwise). No external network calls in heuristic mode (v0.4 adds an llm_as_judge eval_type that DOES call LLM APIs; see the separate evaluate_with_llm_judge tool for that). Rate-limited to 20 req/min on HTTP MCP, unlimited on stdio. Runs in ~5-50ms for rule-based evaluation.

Output shape. Returns JSON: { "id": "<uuid>", "score": 0..1, "passed": boolean, "rule_results": [{ "ruleName", "passed", "score", "message", "skipped?" }], "suggestions": string[], "rules_evaluated": number, "rules_skipped": number, "insufficient_data": boolean }. insufficient_data=true means no applicable rules fired (e.g., safety eval with only cost data).

Use when you want a quality score on a specific output — typically after log_trace records the execution. Pass eval_type to route to the right rule bundle: completeness (length, sentence count, relevance to input), relevance (keyword overlap, topic consistency), safety (PII leak, prompt injection, hallucination markers, stub-output detection), cost (budget threshold), or custom (bring your own rules via custom_rules).

Don't use when the output is empty or has no applicable rules — the eval_type decides which rules apply, and invalid combinations return score=0 + insufficient_data=true (not an error, but not actionable). Don't use to VALIDATE JSON schemas directly (use your language's JSON Schema validator — Iris's json_schema custom rule type is for output-shape assertions, not arbitrary validation).

Parameters. expected is REQUIRED when eval_type="relevance" (used as the comparison target for keyword overlap + topic consistency); ignored for other eval_types. cost_usd + token_usage are ONLY consulted when eval_type="cost" (ignored otherwise). custom_rules ALWAYS fires regardless of eval_type — pass eval_type="custom" if you want ONLY your rules to run (otherwise both your rules AND the eval_type bundle run together). trace_id is optional but recommended (linking the eval to its trace surfaces it in the dashboard's drill-through). input adds context to keyword-overlap relevance checks; ignored otherwise. Defaults: eval_type="completeness".

Error modes. Throws on malformed custom_rules (Zod rejects). Returns 400 on regex patterns that fail safe-regex2 ReDoS check or exceed 1000-char limit. Returns 429 when HTTP rate limit exceeded. Storage failures propagate as 500. The eval itself never throws — failing rules report passed: false with a message, they don't bubble exceptions.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
outputYesThe output text to evaluate (the agent's response that gets scored against rules)
eval_typeNoRule bundle to apply: completeness | relevance | safety | cost | custom — picks which built-in rules firecompleteness
expectedNoExpected output for comparison — REQUIRED when eval_type="relevance" (used as keyword-overlap target)
inputNoOriginal input for context — improves relevance scoring (keyword overlap vs input)
trace_idNoLink evaluation to a trace — surfaces this eval in the dashboard's trace drill-through
custom_rulesNoCustom evaluation rules — fires REGARDLESS of eval_type; pass eval_type="custom" if you want ONLY these
cost_usdNoCost in USD — only consulted when eval_type="cost" (compared against cost_threshold rules)
token_usageNoToken usage breakdown — only consulted when eval_type="cost" (used for token-budget rules)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses many behavioral traits beyond annotations: deterministic and idempotent (consistent with idempotentHint=true), writes one eval_result row to Iris storage, no external network calls in heuristic mode, rate-limited to 20 req/min on HTTP MCP, typical runtime of 5-50ms, and detailed error modes. No contradiction with annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

While the description is lengthy, it is well-structured with clear sections: summary, sibling context, behavior, output shape, usage guidelines, parameter details, and error modes. Every sentence adds value given the tool's complexity, but some detail could be condensed slightly without losing clarity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (8 parameters, 5 eval types, custom rules, multiple error modes) and the lack of an output schema, the description fully covers the output shape with field explanations, error modes, and edge cases (e.g., insufficient_data). The agent has all needed information to use the tool correctly.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Adds significant meaning beyond the already comprehensive input schema: explains when parameters are ignored (expected ignored except for relevance, cost_usd and token_usage only for cost), that custom_rules always fire regardless of eval_type, and clarifies defaults (eval_type='completeness'). Also details special cases like eval_type='custom' to run only custom rules.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Score agent output against configurable eval rules and return a 0..1 score + per-rule breakdown.' It also distinguishes itself from sibling tools by naming evaluate_with_llm_judge, verify_citations, etc., and explicitly calls itself the 'FAST PATH for length/keyword/PII/injection/cost-threshold checks where rules are sufficient.'

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides explicit guidance on when to use ('Use when you want a quality score on a specific output') and when not to ('Don't use when the output is empty or has no applicable rules' and 'Don't use to VALIDATE JSON schemas directly'). Also explains alternative tools like evaluate_with_llm_judge for semantic scoring and verify_citations for citation grounding.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/iris-eval/mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server