hallucination_eval_benchmark.ipynb•11.9 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
" <br>\n",
" <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
" </p>\n",
"</center>\n",
"<h1 align=\"center\">Benchmarking Hallucination Evals</h1>\n",
"\n",
"The purpose of this notebook is:\n",
"\n",
"- to benchmark the performance of LLM-assisted approaches to detecting hallucinations,\n",
"- to leverage Phoenix experiments to iterate and improve on the evaluation approach.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!uv pip install arize-phoenix openinference-instrumentation openinference-instrumentation-anthropic openinference-instrumentation-openai nest-asyncio openai pandas anthropic"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"import os\n",
"from typing import Any\n",
"\n",
"import pandas as pd\n",
"\n",
"from phoenix.evals import (\n",
" HALLUCINATION_PROMPT_RAILS_MAP,\n",
" HALLUCINATION_PROMPT_TEMPLATE,\n",
" AnthropicModel,\n",
" OpenAIModel,\n",
" download_benchmark_dataset,\n",
" llm_classify,\n",
")\n",
"\n",
"pd.set_option(\"display.max_colwidth\", None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.\n",
"\n",
"Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set up tracing to log runs to your Phoenix instance. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.otel import register\n",
"\n",
"# PHOENIX_COLLECTOR_ENDPOINT and PHOENIX_API_KEY are pulled from the environment\n",
"tracer_provider = register(project_name=\"hallucination_benchmark\", auto_instrument=True, batch=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare Benchmark Dataset\n",
"\n",
"We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include \"halueval_qa_data\" from the HaluEval benchmark:\n",
"\n",
"- https://arxiv.org/abs/2305.11747\n",
"- https://github.com/RUCAIBox/HaluEval"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = download_benchmark_dataset(\n",
" task=\"binary-hallucination-classification\", dataset_name=\"halueval_qa_data\"\n",
")\n",
"print(\"`halueval_qa_data` dataset has\", df.shape[0], \"rows\")\n",
"# rename columns\n",
"df.rename(columns={\"reference\": \"context\", \"is_hallucination\": \"expected\"}, inplace=True)\n",
"df[\"expected\"] = (1 - df[\"expected\"]).astype(int) # no hallucination = 1 (bc higher is better)\n",
"df_subset = df.sample(10, random_state=42) # increase size for larger experiments"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.client import Client\n",
"\n",
"phoenix_client = Client()\n",
"dataset_name = \"halueval_qa_data_subset_xs\"\n",
"try:\n",
" dataset = phoenix_client.datasets.create_dataset(\n",
" name=dataset_name,\n",
" dataframe=df_subset,\n",
" input_keys=[\"context\", \"query\", \"response\"],\n",
" output_keys=[\"expected\"],\n",
" timeout=30, # large dataset takes a while to upload\n",
" )\n",
" print(f\"Dataset {dataset_name} created.\")\n",
"except Exception:\n",
" print(f\"Dataset {dataset_name} already exists. Getting existing dataset.\")\n",
" dataset = phoenix_client.datasets.get_dataset(dataset=dataset_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Baseline: Arize's Built-In Binary Hallucination Classification Template\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure the LLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from getpass import getpass\n",
"\n",
"import openai\n",
"\n",
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"openai.api_key = openai_api_key\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
"\n",
"if not (anthropic_api_key := os.getenv(\"ANTHROPIC_API_KEY\")):\n",
" anthropic_api_key = getpass(\"🔑 Enter your Anthropic API key: \")\n",
"\n",
"os.environ[\"ANTHROPIC_API_KEY\"] = anthropic_api_key"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(HALLUCINATION_PROMPT_TEMPLATE.explanation_template[0].template)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define the experiment task (hallucination classification)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def sync_classify(input, model_name: str) -> dict[str, Any]:\n",
" \"\"\"\n",
" Runs the llm_classify function on a single input in a sync context using the specified model.\n",
" \"\"\"\n",
" if \"claude\" in model_name:\n",
" model = AnthropicModel(\n",
" model=model_name,\n",
" temperature=0.0,\n",
" initial_rate_limit=100, # change depending on your rate limit\n",
" )\n",
" else:\n",
" model = OpenAIModel(\n",
" model=model_name,\n",
" temperature=0.0,\n",
" initial_rate_limit=100, # change depending on your rate limit\n",
" )\n",
" rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())\n",
" label_to_score = {\"hallucinated\": 0, \"factual\": 1} # explicit mapping\n",
" single_df = pd.DataFrame(\n",
" [{\"reference\": input[\"context\"], \"input\": input[\"query\"], \"output\": input[\"response\"]}]\n",
" )\n",
" result = llm_classify(\n",
" data=single_df,\n",
" template=HALLUCINATION_PROMPT_TEMPLATE,\n",
" provide_explanation=True,\n",
" use_function_calling_if_available=True,\n",
" model=model,\n",
" rails=rails,\n",
" run_sync=True,\n",
" max_retries=3,\n",
" )\n",
" score = label_to_score[result[\"label\"].iloc[0]] # map label to 0 or 1\n",
" return {\"hallucination_score\": score, \"explanation\": result[\"explanation\"].iloc[0]}\n",
"\n",
"\n",
"async def async_classify(input, model_name: str) -> dict[str, Any]:\n",
" \"\"\"\n",
" Runs the sync_classify function on a single input in an async context.\n",
" \"\"\"\n",
" loop = asyncio.get_event_loop()\n",
" return await loop.run_in_executor(None, sync_classify, input, model_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define evaluators"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accuracy: does the expected label match the eval? \n",
"\n",
"Note: The accuracy evaluator rounds the output/expected to the nearest integer, so if we move to continuous score [0,1], then the accuracy evaluator still works. We could also add a mean-squared-error (MSE) evaluator for continuous scores down the line.\n",
"\n",
"We should also calculate F1, precision, recall at the dataset level for better understanding of the eval performance. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def accuracy(output: dict[str, Any], expected: dict[str, Any]) -> bool:\n",
" # rounds to 0 or 1 if score is continuous (e.g. 0.7 -> 1, 0.3 -> 0)\n",
" return round(output[\"hallucination_score\"]) == round(float(expected[\"expected\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run Experiment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# List of models to benchmark as judges\n",
"judge_models = [\n",
" # OpenAI models\n",
" \"gpt-4o\", # GPT-4 Omni (May 2024)\n",
" \"gpt-4o-mini\", # Smaller version of GPT-4o\n",
" \"gpt-3.5-turbo\", # GPT-3.5 Turbo (March 2023)\n",
" \"o3\", # Successor to o1 with improved reasoning\n",
" \"o3-mini\", # Smaller version of o3\n",
" \"o4-mini\", # Latest mini reasoning model (April 2025)\n",
" # Claude (Anthropic) models\n",
" \"claude-3-opus-20240229\", # Claude 3 Opus\n",
" \"claude-3-sonnet-20240229\", # Claude 3 Sonnet\n",
" \"claude-3-haiku-20240307\", # Claude 3 Haiku\n",
" \"claude-opus-4-20250514\", # Claude Opus 4\n",
" \"claude-sonnet-4-20250514\", # Claude Sonnet 4\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from functools import partial\n",
"\n",
"from phoenix.client import AsyncClient\n",
"\n",
"async_client = AsyncClient()\n",
"model_name = \"gpt-4o-mini\" # loop through judge_models if you want or just try one for now\n",
"\n",
"experiment = await async_client.experiments.run_experiment(\n",
" dataset=dataset,\n",
" task=partial(async_classify, model_name=model_name),\n",
" evaluators=[accuracy],\n",
" experiment_name=f\"baseline-{model_name}\",\n",
" experiment_description=\"Built-in hallucination eval with python SDK\",\n",
" experiment_metadata={\"sdk\": \"phoenix\", \"sdk_type\": \"python\", \"model\": model_name},\n",
" concurrency=10,\n",
" # dry_run=10,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 4
}