Evaluate Output
evaluate_outputScore agent outputs against configurable evaluation rules to obtain a quality score and per-rule results. Choose from completeness, relevance, safety, cost, or custom rule bundles.
Instructions
Score agent output against configurable eval rules and return a 0..1 score + per-rule breakdown.
Sibling tools — evaluate_with_llm_judge runs semantic LLM-based scoring (slower, costs money; this tool is heuristic, free, deterministic), verify_citations checks citation grounding specifically, log_trace records executions, get_traces queries them, list_rules / deploy_rule / delete_rule manage the custom-rule lifecycle. evaluate_output is the FAST PATH for length / keyword / PII / injection / cost-threshold checks where rules are sufficient.
Behavior. Deterministic, in-process scoring — same inputs always produce the same result. Writes one eval_result row to Iris storage (linked to trace_id if provided; unlinked otherwise). No external network calls in heuristic mode (v0.4 adds an llm_as_judge eval_type that DOES call LLM APIs; see the separate evaluate_with_llm_judge tool for that). Rate-limited to 20 req/min on HTTP MCP, unlimited on stdio. Runs in ~5-50ms for rule-based evaluation.
Output shape. Returns JSON: { "id": "<uuid>", "score": 0..1, "passed": boolean, "rule_results": [{ "ruleName", "passed", "score", "message", "skipped?" }], "suggestions": string[], "rules_evaluated": number, "rules_skipped": number, "insufficient_data": boolean }. insufficient_data=true means no applicable rules fired (e.g., safety eval with only cost data).
Use when you want a quality score on a specific output — typically after log_trace records the execution. Pass eval_type to route to the right rule bundle: completeness (length, sentence count, relevance to input), relevance (keyword overlap, topic consistency), safety (PII leak, prompt injection, hallucination markers, stub-output detection), cost (budget threshold), or custom (bring your own rules via custom_rules).
Don't use when the output is empty or has no applicable rules — the eval_type decides which rules apply, and invalid combinations return score=0 + insufficient_data=true (not an error, but not actionable). Don't use to VALIDATE JSON schemas directly (use your language's JSON Schema validator — Iris's json_schema custom rule type is for output-shape assertions, not arbitrary validation).
Parameters. expected is REQUIRED when eval_type="relevance" (used as the comparison target for keyword overlap + topic consistency); ignored for other eval_types. cost_usd + token_usage are ONLY consulted when eval_type="cost" (ignored otherwise). custom_rules ALWAYS fires regardless of eval_type — pass eval_type="custom" if you want ONLY your rules to run (otherwise both your rules AND the eval_type bundle run together). trace_id is optional but recommended (linking the eval to its trace surfaces it in the dashboard's drill-through). input adds context to keyword-overlap relevance checks; ignored otherwise. Defaults: eval_type="completeness".
Error modes. Throws on malformed custom_rules (Zod rejects). Returns 400 on regex patterns that fail safe-regex2 ReDoS check or exceed 1000-char limit. Returns 429 when HTTP rate limit exceeded. Storage failures propagate as 500. The eval itself never throws — failing rules report passed: false with a message, they don't bubble exceptions.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| output | Yes | The output text to evaluate (the agent's response that gets scored against rules) | |
| eval_type | No | Rule bundle to apply: completeness | relevance | safety | cost | custom — picks which built-in rules fire | completeness |
| expected | No | Expected output for comparison — REQUIRED when eval_type="relevance" (used as keyword-overlap target) | |
| input | No | Original input for context — improves relevance scoring (keyword overlap vs input) | |
| trace_id | No | Link evaluation to a trace — surfaces this eval in the dashboard's trace drill-through | |
| custom_rules | No | Custom evaluation rules — fires REGARDLESS of eval_type; pass eval_type="custom" if you want ONLY these | |
| cost_usd | No | Cost in USD — only consulted when eval_type="cost" (compared against cost_threshold rules) | |
| token_usage | No | Token usage breakdown — only consulted when eval_type="cost" (used for token-budget rules) |