Skip to main content
Glama

Evaluate With LLM Judge

evaluate_with_llm_judge

Score agent output with an LLM judge for semantic quality. Returns a 0–1 score, rationale, and per-dimension breakdown for accuracy, helpfulness, safety, correctness, or faithfulness.

Instructions

Score agent output using an LLM as the judge (Anthropic or OpenAI). Returns a calibrated 0..1 score with rationale, per-dimension breakdown, and exact cost.

Sibling tools — evaluate_output runs heuristic rules (free, deterministic, ~ms latency, no API key needed); this tool runs LLM-based semantic scoring (paid, 1-10s latency, requires API key). verify_citations is a SPECIALIZED form of LLM judging that focuses on citation grounding only. log_trace / get_traces handle trace I/O; list_rules / deploy_rule / delete_rule manage heuristic-rule lifecycle. evaluate_with_llm_judge is the GENERAL semantic-scoring path.

Behavior. Calls an external LLM API (Anthropic or OpenAI) — costs money per call, takes 1-10 seconds, respects an IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL cap. Non-deterministic at temperature > 0; default temperature=0 gives near-deterministic scores. Writes one eval_result row to Iris storage (linked to trace_id if provided) plus captures provider response id + latency + token counts + cost in the rule_results payload. Rate-limited to 20 req/min on HTTP MCP; your LLM provider also enforces its own rate limits (we transparently retry once on 429).

Output shape. Returns JSON: { "id": "<uuid>", "score": 0..1, "passed": boolean, "rationale": string, "dimensions": {...}, "model": string, "provider": "anthropic"|"openai", "template": string, "input_tokens": number, "output_tokens": number, "cost_usd": number, "latency_ms": number }. dimensions has per-dimension sub-scores (e.g., accuracy template returns {factual_claims, citations, internal_consistency}).

Use when heuristic rules (via evaluate_output) are too coarse for the quality signal you need — semantic correctness, factual accuracy vs a reference, RAG faithfulness to sources, nuanced safety/helpfulness. Pick the template that matches: accuracy (hallucination detection), helpfulness (does it address the ask), safety (harm potential beyond regex PII), correctness (vs reference answer — pass expected), faithfulness (RAG grounding — pass source_material).

Don't use for simple regex/length/keyword checks (use evaluate_output with heuristic rules — they're free, deterministic, 1000x faster). Don't use without an API key set (IRIS_ANTHROPIC_API_KEY or IRIS_OPENAI_API_KEY). Don't use on very large outputs (>8K tokens) without raising max_cost_usd — the pre-check will refuse the call.

Parameters. model is required (no default — pick consciously since cost varies 100x across models). provider is auto-detected from the model name; override only for ambiguous IDs. expected is REQUIRED when template="correctness" (the reference answer to compare against); ignored for other templates. source_material is REQUIRED when template="faithfulness" (the RAG sources to ground against); ignored otherwise. input is optional but improves scoring on helpfulness/safety templates (gives the judge the user prompt that produced the output). max_cost_usd defaults to env var IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL or $0.25 — the worst-case cost is computed BEFORE the call (input_tokens × prompt_price + max_output_tokens × completion_price); call refused upfront if it would exceed. max_output_tokens caps the judge response (default 512, max 4096); higher = more rationale detail + more cost. temperature default 0 (deterministic). timeout_ms default 60000. trace_id optional but recommended (links eval to trace in dashboard). Defaults: temperature=0, max_output_tokens=512, max_cost_usd=$0.25, timeout_ms=60000.

Error modes. Throws when the required API key env var is missing. Throws when the estimated worst-case cost exceeds max_cost_usd (raise the cap or trim prompts). Throws LLMJudgeError on provider errors — kind=auth on 401/403, rate_limit on 429 (auto-retried once), server_error on 5xx, timeout on abort, malformed_response when the judge fails to emit valid JSON on both attempts. Throws "Unknown model" for unsupported model IDs — update src/eval/llm-judge/pricing.ts first.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
outputYesThe agent output text to evaluate
templateYesJudge dimension: accuracy (factual correctness), helpfulness (does it address the ask), safety (harm potential), correctness (vs reference answer — requires `expected`), faithfulness (RAG grounding — requires `source_material`).
modelYesModel ID. Supported: anthropic = claude-opus-4-7 | claude-sonnet-4-6 | claude-haiku-4-5 | claude-haiku-4-5-20251001; openai = gpt-4o | gpt-4o-mini | o1-mini.
providerNoAuto-detected from model when omitted
inputNoUser question / prompt that produced the output (improves accuracy for helpfulness/safety)
expectedNoReference answer (required for correctness template)
source_materialNoProvided RAG sources (required for faithfulness template)
trace_idNoLink this evaluation to a trace
max_cost_usdNoCost cap in USD; defaults to IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL or 0.25
max_output_tokensNoJudge output token cap; default 512
temperatureNoSampling temperature; default 0 (deterministic)
timeout_msNoPer-request timeout; default 60_000

Implementation Reference

  • Core handler function that executes the LLM judge evaluation: builds prompts from templates, estimates/checks cost, calls the LLM provider (Anthropic/OpenAI) with retry on malformed JSON, and returns the scored result with metadata.
    export async function evaluateWithLLMJudge(
      params: LLMJudgeEvaluateParams,
    ): Promise<LLMJudgeEvaluationResult> {
      const template = getTemplate(params.template);
      const maxOutputTokens = params.maxOutputTokens ?? 512;
      const temperature = params.temperature ?? 0;
      const maxCost = params.maxCostUsdPerEval ?? 0.25;
    
      // Pre-check pricing exists — if the model is unknown we can't enforce
      // the cap, so refuse upfront rather than silently skip cost control.
      if (!findPricing(params.model)) {
        throw new Error(
          `Unknown model "${params.model}" for provider "${params.provider}". Add its pricing to src/eval/llm-judge/pricing.ts before use, or pick a supported model.`,
        );
      }
    
      const systemPrompt = template.buildSystem();
      const userPrompt = template.buildUser({
        output: params.output,
        expected: params.expected,
        input: params.input,
        sourceMaterial: params.sourceMaterial,
      });
    
      // Estimate worst-case cost (treat all output as billable at full
      // maxOutputTokens) and reject before the network call if it would
      // exceed the cap. This is intentionally pessimistic — real usage is
      // usually half, but we want the cap to be a hard ceiling, not a soft
      // hope.
      const estimatedCost = estimateCostUsd(
        params.model,
        Math.ceil((systemPrompt.length + userPrompt.length) / 4),
        maxOutputTokens,
      );
      if (estimatedCost !== null && estimatedCost > maxCost) {
        throw new Error(
          `Estimated max cost ${estimatedCost.toFixed(4)} USD exceeds cap ${maxCost.toFixed(4)} USD — refusing to call. Raise IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL or trim prompts/maxOutputTokens.`,
        );
      }
    
      // First attempt
      let raw;
      try {
        raw = await callLLMJudge({
          provider: params.provider,
          model: params.model,
          systemPrompt,
          userPrompt,
          maxOutputTokens,
          temperature,
          apiKey: params.apiKey,
          timeoutMs: params.timeoutMs,
          maxInputTokensEstimate: params.maxInputTokensEstimate,
        });
      } catch (err) {
        throw err;
      }
    
      let parsed;
      try {
        parsed = parseJudgeResponse(raw.content);
      } catch (err) {
        if (!(err instanceof LLMJudgeError) || err.kind !== 'malformed_response') throw err;
        // Retry once with a stricter prompt. The second retry also counts
        // against the cost cap — we use a smaller maxOutputTokens.
        const strictSystem = systemPrompt + '\n\nIMPORTANT: your previous response was not valid JSON. Respond with ONLY the JSON object, no prefatory text, no code fences.';
        raw = await callLLMJudge({
          provider: params.provider,
          model: params.model,
          systemPrompt: strictSystem,
          userPrompt,
          maxOutputTokens: Math.min(maxOutputTokens, 256),
          temperature,
          apiKey: params.apiKey,
          timeoutMs: params.timeoutMs,
          maxInputTokensEstimate: params.maxInputTokensEstimate,
        });
        parsed = parseJudgeResponse(raw.content);
      }
    
      const passed = parsed.passed ?? parsed.score >= template.passThreshold;
      const costUsd = estimateCostUsd(params.model, raw.inputTokens, raw.outputTokens);
    
      return {
        passed,
        score: parsed.score,
        rationale: parsed.rationale,
        dimensions: parsed.dimensions,
        model: params.model,
        provider: params.provider,
        template: params.template,
        inputTokens: raw.inputTokens,
        outputTokens: raw.outputTokens,
        costUsd,
        latencyMs: raw.latencyMs,
        rawResponseId: raw.rawProviderResponseId,
      };
    }
  • MCP tool registration and handler: registers 'evaluate_with_llm_judge' tool, resolves API key/provider/cost, calls the core evaluator, persists the eval result to storage, and formats the response.
    export function registerEvaluateWithLLMJudgeTool(
      server: McpServer,
      storage: IStorageAdapter,
    ): void {
      server.registerTool(
        'evaluate_with_llm_judge',
        {
          title: 'Evaluate With LLM Judge',
          description: [
            'Score agent output using an LLM as the judge (Anthropic or OpenAI). Returns a calibrated 0..1 score with rationale, per-dimension breakdown, and exact cost.',
            '',
            'Sibling tools — evaluate_output runs heuristic rules (free, deterministic, ~ms latency, no API key needed); this tool runs LLM-based semantic scoring (paid, 1-10s latency, requires API key). verify_citations is a SPECIALIZED form of LLM judging that focuses on citation grounding only. log_trace / get_traces handle trace I/O; list_rules / deploy_rule / delete_rule manage heuristic-rule lifecycle. evaluate_with_llm_judge is the GENERAL semantic-scoring path.',
            '',
            'Behavior. Calls an external LLM API (Anthropic or OpenAI) — costs money per call, takes 1-10 seconds, respects an IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL cap. Non-deterministic at temperature > 0; default temperature=0 gives near-deterministic scores. Writes one eval_result row to Iris storage (linked to trace_id if provided) plus captures provider response id + latency + token counts + cost in the rule_results payload. Rate-limited to 20 req/min on HTTP MCP; your LLM provider also enforces its own rate limits (we transparently retry once on 429).',
            '',
            'Output shape. Returns JSON: `{ "id": "<uuid>", "score": 0..1, "passed": boolean, "rationale": string, "dimensions": {...}, "model": string, "provider": "anthropic"|"openai", "template": string, "input_tokens": number, "output_tokens": number, "cost_usd": number, "latency_ms": number }`. `dimensions` has per-dimension sub-scores (e.g., accuracy template returns `{factual_claims, citations, internal_consistency}`).',
            '',
            'Use when heuristic rules (via evaluate_output) are too coarse for the quality signal you need — semantic correctness, factual accuracy vs a reference, RAG faithfulness to sources, nuanced safety/helpfulness. Pick the template that matches: `accuracy` (hallucination detection), `helpfulness` (does it address the ask), `safety` (harm potential beyond regex PII), `correctness` (vs reference answer — pass `expected`), `faithfulness` (RAG grounding — pass `source_material`).',
            '',
            "Don't use for simple regex/length/keyword checks (use evaluate_output with heuristic rules — they're free, deterministic, 1000x faster). Don't use without an API key set (IRIS_ANTHROPIC_API_KEY or IRIS_OPENAI_API_KEY). Don't use on very large outputs (>8K tokens) without raising max_cost_usd — the pre-check will refuse the call.",
            '',
            'Parameters. model is required (no default — pick consciously since cost varies 100x across models). provider is auto-detected from the model name; override only for ambiguous IDs. expected is REQUIRED when template="correctness" (the reference answer to compare against); ignored for other templates. source_material is REQUIRED when template="faithfulness" (the RAG sources to ground against); ignored otherwise. input is optional but improves scoring on helpfulness/safety templates (gives the judge the user prompt that produced the output). max_cost_usd defaults to env var IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL or $0.25 — the worst-case cost is computed BEFORE the call (input_tokens × prompt_price + max_output_tokens × completion_price); call refused upfront if it would exceed. max_output_tokens caps the judge response (default 512, max 4096); higher = more rationale detail + more cost. temperature default 0 (deterministic). timeout_ms default 60000. trace_id optional but recommended (links eval to trace in dashboard). Defaults: temperature=0, max_output_tokens=512, max_cost_usd=$0.25, timeout_ms=60000.',
            '',
            'Error modes. Throws when the required API key env var is missing. Throws when the estimated worst-case cost exceeds max_cost_usd (raise the cap or trim prompts). Throws LLMJudgeError on provider errors — kind=`auth` on 401/403, `rate_limit` on 429 (auto-retried once), `server_error` on 5xx, `timeout` on abort, `malformed_response` when the judge fails to emit valid JSON on both attempts. Throws "Unknown model" for unsupported model IDs — update src/eval/llm-judge/pricing.ts first.',
          ].join('\n'),
          inputSchema,
          annotations: {
            readOnlyHint: false,      // Writes eval_result; also spends money (external API cost)
            destructiveHint: false,   // Creates data; doesn't overwrite or delete
            idempotentHint: false,    // Temperature > 0 may vary; even at T=0 provider non-determinism is possible; cost also varies per call
            openWorldHint: true,      // Calls external APIs (Anthropic / OpenAI) — touches the world beyond local process
          },
        },
        async (args) => {
          const provider = (args.provider as LLMProvider | undefined) ?? inferProvider(args.model);
          const apiKey = resolveApiKey(provider);
          const maxCostUsd = resolveMaxCost(args.max_cost_usd);
    
          const result = await evaluateWithLLMJudge({
            output: args.output,
            template: args.template as TemplateName,
            provider,
            model: args.model,
            apiKey,
            input: args.input,
            expected: args.expected,
            sourceMaterial: args.source_material,
            maxCostUsdPerEval: maxCostUsd,
            maxOutputTokens: args.max_output_tokens,
            temperature: args.temperature,
            timeoutMs: args.timeout_ms,
          });
    
          const evalId = generateEvalId();
    
          // Persist as a normal eval_result so the dashboard picks it up
          // alongside heuristic scores. eval_type is 'custom' because LLM
          // judge doesn't fit completeness/relevance/safety/cost taxonomy
          // cleanly — it spans all four. The rule_results payload carries
          // the full judge provenance.
          await storage.insertEvalResult(LOCAL_TENANT, {
            id: evalId,
            trace_id: args.trace_id,
            eval_type: 'custom',
            output_text: args.output,
            expected_text: args.expected,
            score: result.score,
            passed: result.passed,
            rule_results: [
              {
                ruleName: `llm_judge:${result.template}:${result.provider}/${result.model}`,
                passed: result.passed,
                score: result.score,
                message: result.rationale || 'LLM judge evaluation',
              },
            ],
            suggestions: result.passed ? [] : [result.rationale],
            rules_evaluated: 1,
            rules_skipped: 0,
            insufficient_data: false,
          });
    
          return {
            content: [
              {
                type: 'text' as const,
                text: JSON.stringify({
                  id: evalId,
                  score: result.score,
                  passed: result.passed,
                  rationale: result.rationale,
                  dimensions: result.dimensions,
                  model: result.model,
                  provider: result.provider,
                  template: result.template,
                  input_tokens: result.inputTokens,
                  output_tokens: result.outputTokens,
                  cost_usd: result.costUsd,
                  latency_ms: result.latencyMs,
                  raw_response_id: result.rawResponseId,
                }),
              },
            ],
          };
        },
      );
    }
  • Input schema using Zod: defines all 12 parameters for the tool including output, template (5 dimensions), model, provider, optional context fields (input, expected, source_material), cost/latency controls.
    const inputSchema = {
      output: z.string().min(1).describe('The agent output text to evaluate'),
      template: z
        .enum(['accuracy', 'helpfulness', 'safety', 'correctness', 'faithfulness'])
        .describe(
          'Judge dimension: accuracy (factual correctness), helpfulness (does it address the ask), safety (harm potential), correctness (vs reference answer — requires `expected`), faithfulness (RAG grounding — requires `source_material`).',
        ),
      model: z
        .string()
        .describe(
          'Model ID. Supported: anthropic = claude-opus-4-7 | claude-sonnet-4-6 | claude-haiku-4-5 | claude-haiku-4-5-20251001; openai = gpt-4o | gpt-4o-mini | o1-mini.',
        ),
      provider: z.enum(['anthropic', 'openai']).optional().describe('Auto-detected from model when omitted'),
      input: z.string().optional().describe('User question / prompt that produced the output (improves accuracy for helpfulness/safety)'),
      expected: z.string().optional().describe('Reference answer (required for correctness template)'),
      source_material: z.string().optional().describe('Provided RAG sources (required for faithfulness template)'),
      trace_id: z.string().optional().describe('Link this evaluation to a trace'),
      max_cost_usd: z.number().positive().optional().describe('Cost cap in USD; defaults to IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL or 0.25'),
      max_output_tokens: z.number().int().positive().max(4096).optional().describe('Judge output token cap; default 512'),
      temperature: z.number().min(0).max(2).optional().describe('Sampling temperature; default 0 (deterministic)'),
      timeout_ms: z.number().int().positive().optional().describe('Per-request timeout; default 60_000'),
    };
  • Tool registration entry point: imports and calls registerEvaluateWithLLMJudgeTool from the central registerAllTools function.
    import { registerEvaluateWithLLMJudgeTool } from './evaluate-with-llm-judge.js';
    import { registerVerifyCitationsTool } from './verify-citations.js';
    
    export function registerAllTools(
      server: McpServer,
      storage: IStorageAdapter,
      evalEngine: EvalEngine,
      customRuleStore: CustomRuleStore,
    ): void {
      registerLogTraceTool(server, storage);
      registerEvaluateOutputTool(server, storage, evalEngine);
      registerGetTracesTool(server, storage);
      registerListRulesTool(server, customRuleStore);
      registerDeployRuleTool(server, customRuleStore);
      registerDeleteRuleTool(server, customRuleStore);
      registerDeleteTraceTool(server, storage);
      registerEvaluateWithLLMJudgeTool(server, storage);
      registerVerifyCitationsTool(server, storage);
    }
  • LLM client helper: dispatches to Anthropic or OpenAI, handles rate-limit retry (once), and returns parsed response with token usage and latency.
    export async function callLLMJudge(req: LLMJudgeRequest): Promise<LLMJudgeResponse> {
      const estimatedInput = estimateInputTokens(req.systemPrompt, req.userPrompt);
      if (req.maxInputTokensEstimate && estimatedInput > req.maxInputTokensEstimate) {
        throw new LLMJudgeError(
          `Estimated input tokens (${estimatedInput}) exceed cap (${req.maxInputTokensEstimate}) — refusing to call`,
          'bad_request',
        );
      }
    
      const call = req.provider === 'anthropic' ? callAnthropic : callOpenAI;
    
      try {
        return await call(req);
      } catch (err) {
        if (err instanceof LLMJudgeError && err.kind === 'rate_limit') {
          const waitSeconds = err.retryAfterSeconds ?? 2;
          await new Promise((r) => setTimeout(r, waitSeconds * 1000));
          return await call(req);
        }
        throw err;
      }
    }
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses all relevant behavioral traits beyond minimal annotations: calls external API, costs money, non-deterministic at temperature>0, writes to Iris storage, rate-limited, auto-retries on 429, error modes. Annotations are minimal, so description carries full burden and does so comprehensively. No contradiction.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is long but well-structured with clear sections (purpose, siblings, behavior, output shape, usage, parameters, error modes). Every sentence adds value; no redundancy. Slightly verbose but earns its length given tool complexity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Despite no output schema, the description fully documents the return JSON structure, cost computation, error modes, and environmental dependencies (API keys, env vars). For a complex tool with 12 parameters, this leaves no gaps for the agent to infer incorrectly.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Adds substantial meaning beyond schema descriptions: explains when parameters are required/ignored (e.g., expected only for correctness template), default values, cost cap behavior, and provider auto-detection. Schema coverage is 100% but description layers crucial behavioral context.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool scores agent output using an LLM judge, with specific verb 'Score' and resource 'agent output'. It distinguishes from siblings like evaluate_output (heuristic) and verify_citations (specialized citation checking), making its unique role explicit.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides explicit when-to-use and when-not-to-use guidance: use when heuristic rules are too coarse, don't use for simple checks, don't use without API key, and mentions alternatives (evaluate_output, verify_citations). Also specifies suitable templates and preconditions.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/iris-eval/mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server