Skip to main content
Glama

Evaluate Output

evaluate_output
Idempotent

Score agent output against configurable evaluation rules for quality, compliance, and correctness. Returns a score and per-rule breakdown for completeness, relevance, safety, cost, or custom checks.

Instructions

Score agent output against configurable eval rules and return a 0..1 score + per-rule breakdown.

Sibling tools — evaluate_with_llm_judge runs semantic LLM-based scoring (slower, costs money; this tool is heuristic, free, deterministic), verify_citations checks citation grounding specifically, log_trace records executions, get_traces queries them, list_rules / deploy_rule / delete_rule manage the custom-rule lifecycle. evaluate_output is the FAST PATH for length / keyword / PII / injection / cost-threshold checks where rules are sufficient.

Behavior. Deterministic, in-process scoring — same inputs always produce the same result. Writes one eval_result row to Iris storage (linked to trace_id if provided; unlinked otherwise). No external network calls in heuristic mode (v0.4 adds an llm_as_judge eval_type that DOES call LLM APIs; see the separate evaluate_with_llm_judge tool for that). Rate-limited to 20 req/min on HTTP MCP, unlimited on stdio. Runs in ~5-50ms for rule-based evaluation.

Output shape. Returns JSON: { "id": "<uuid>", "score": 0..1, "passed": boolean, "rule_results": [{ "ruleName", "passed", "score", "message", "skipped?" }], "suggestions": string[], "rules_evaluated": number, "rules_skipped": number, "insufficient_data": boolean }. insufficient_data=true means no applicable rules fired (e.g., safety eval with only cost data).

Use when you want a quality score on a specific output — typically after log_trace records the execution. Pass eval_type to route to the right rule bundle: completeness (length, sentence count, relevance to input), relevance (keyword overlap, topic consistency), safety (PII leak, prompt injection, hallucination markers, stub-output detection), cost (budget threshold), or custom (bring your own rules via custom_rules).

Don't use when the output is empty or has no applicable rules — the eval_type decides which rules apply, and invalid combinations return score=0 + insufficient_data=true (not an error, but not actionable). Don't use to VALIDATE JSON schemas directly (use your language's JSON Schema validator — Iris's json_schema custom rule type is for output-shape assertions, not arbitrary validation).

Parameters. expected is REQUIRED when eval_type="relevance" (used as the comparison target for keyword overlap + topic consistency); ignored for other eval_types. cost_usd + token_usage are ONLY consulted when eval_type="cost" (ignored otherwise). custom_rules ALWAYS fires regardless of eval_type — pass eval_type="custom" if you want ONLY your rules to run (otherwise both your rules AND the eval_type bundle run together). trace_id is optional but recommended (linking the eval to its trace surfaces it in the dashboard's drill-through). input adds context to keyword-overlap relevance checks; ignored otherwise. Defaults: eval_type="completeness".

Error modes. Throws on malformed custom_rules (Zod rejects). Returns 400 on regex patterns that fail safe-regex2 ReDoS check or exceed 1000-char limit. Returns 429 when HTTP rate limit exceeded. Storage failures propagate as 500. The eval itself never throws — failing rules report passed: false with a message, they don't bubble exceptions.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
outputYesThe output text to evaluate (the agent's response that gets scored against rules)
eval_typeNoRule bundle to apply: completeness | relevance | safety | cost | custom — picks which built-in rules firecompleteness
expectedNoExpected output for comparison — REQUIRED when eval_type="relevance" (used as keyword-overlap target)
inputNoOriginal input for context — improves relevance scoring (keyword overlap vs input)
trace_idNoLink evaluation to a trace — surfaces this eval in the dashboard's trace drill-through
custom_rulesNoCustom evaluation rules — fires REGARDLESS of eval_type; pass eval_type="custom" if you want ONLY these
cost_usdNoCost in USD — only consulted when eval_type="cost" (compared against cost_threshold rules)
token_usageNoToken usage breakdown — only consulted when eval_type="cost" (used for token-budget rules)
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Disclosures beyond annotations: deterministic and in-process scoring, writes to Iris storage, no external network calls in heuristic mode (llm_as_judge type is separate), rate limits (20 req/min HTTP, unlimited stdio), typical latency (~5-50ms), error modes (400 on regex issues, 429 rate limit, 500 storage failures, throws on malformed rules). The description does not contradict annotations (idempotentHint is consistent).

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (sibling tools, behavior, output shape, usage, parameters, error modes) and is front-loaded with the purpose. It is somewhat lengthy, but every sentence provides value. Minor trimming could improve conciseness without losing information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity (8 parameters, nested objects, multiple eval types, error conditions), the description covers all necessary aspects: explains what each parameter does, interactions between parameters and eval types, output shape fields like insufficient_data, error modes, and even provides an example of when insufficient_data occurs. No output schema exists, so the description fully compensates by detailing the return values.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, but the description adds significant context beyond schema: e.g., 'expected is REQUIRED when eval_type="relevance"', 'custom_rules ALWAYS fires regardless of eval_type', 'trace_id is optional but recommended', and explains default behavior. This adds meaning that the schema alone does not convey.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Score agent output against configurable eval rules and return a 0..1 score + per-rule breakdown.' It uses a specific verb ('score') and resource ('agent output'), and distinguishes from siblings like evaluate_with_llm_judge (semantic scoring) and verify_citations (citation grounding).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides explicit guidance on when to use (fast path for length/keyword/PII/injection/cost-threshold checks after log_trace), when not to use (empty output, invalid combinations, direct JSON schema validation), and alternatives (evaluate_with_llm_judge, verify_citations). Also explains parameter prerequisites (expected required for relevance, etc.).

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/iris-eval/mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server