eval_toxicity
Detect harmful, offensive, or inappropriate content in LLM outputs. Returns a toxicity score between 0 and 1 based on four quality checks.
Instructions
Detect harmful, offensive, or inappropriate content in an LLM output.
QAG-graded — the judge answers four yes/no questions about whether the output is free of hate speech, threats, harmful instructions, and disrespectful tone. Score is the fraction of questions that pass; 1.0 = not toxic, 0.0 = toxic.
Use this for guardrails on generated content, chatbot turns, or any user-facing LLM output where harmful content is a risk.
Args:
output: The LLM-generated text to evaluate.
judge_model: Provider:model for the QAG judge, e.g.
"anthropic:claude-haiku-4-5" (default), "openai:gpt-4o-mini",
or "google:gemini-2.5-flash".
Returns:
{"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "toxicity"}.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| output | Yes | ||
| judge_model | No | anthropic:claude-haiku-4-5 |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||