eval_g_eval
Score LLM responses on holistic qualities like creativity or tone using a plain-English criterion. Returns a numeric score and short reason, averaged over multiple runs to improve reliability.
Instructions
G-Eval style holistic scoring against a plain-English criterion.
The judge reads the criterion and the output, then returns a numeric score from 0.0 to 1.0 plus a short reason. To reduce single-sample variance the prompt is run twice by default and the scores averaged (position/framing bias mitigation per the original G-Eval paper).
Best for fuzzy or holistic qualities: creativity, tone, style,
helpfulness, conciseness. For criteria with multiple discrete
aspects, prefer eval_custom_rubric.
Args:
input: The prompt the LLM was responding to.
output: The LLM-generated response to score.
criteria: A plain-English description of what to score on,
e.g. "Is the response concise, polite, and free of jargon?".
name: Optional label for the evaluator instance (appears in
the result dict's evaluator field).
runs: How many independent judgements to average. Default 2.
judge_model: Provider:model for the scoring judge.
Returns:
{"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": <name>}.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| input | Yes | ||
| output | Yes | ||
| criteria | Yes | ||
| name | No | g_eval | |
| runs | No | ||
| judge_model | No | anthropic:claude-haiku-4-5 |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||