@arizeai/phoenix-mcp

Official

Overview Schema Related Servers Score Discussions

hallucination_eval.ipynb•16.8 KiB

{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "7cc25f103eaf89fb", "metadata": {}, "outputs": [], "source": [ "%pip install -Uqqq arize-phoenix-client arize-phoenix-otel arize-phoenix anthropic openai google-generativeai groq openinference-instrumentation-anthropic openinference-instrumentation-openai openinference-instrumentation-groq openinference-instrumentation-vertexai" ] }, { "cell_type": "code", "execution_count": null, "id": "b53451a7a3929ae", "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "from secrets import token_hex\n", "\n", "import anthropic\n", "import google.generativeai as genai\n", "import groq\n", "import nest_asyncio\n", "import openai\n", "import pandas as pd\n", "from google.generativeai.generative_models import GenerativeModel\n", "from openinference.instrumentation.anthropic import AnthropicInstrumentor\n", "from openinference.instrumentation.groq import GroqInstrumentor\n", "from openinference.instrumentation.openai import OpenAIInstrumentor\n", "from openinference.instrumentation.vertexai import VertexAIInstrumentor\n", "from sklearn.metrics import accuracy_score\n", "\n", "import phoenix as px\n", "from phoenix.client import Client\n", "from phoenix.client.types import PromptVersion\n", "from phoenix.experiments import run_experiment\n", "from phoenix.otel import register\n", "\n", "nest_asyncio.apply()" ] }, { "cell_type": "markdown", "id": "95e284cd72cf50be", "metadata": {}, "source": [ "# Launch Phoenix" ] }, { "cell_type": "code", "execution_count": null, "id": "eeb7a40f6a9857e7", "metadata": {}, "outputs": [], "source": [ "px.launch_app()" ] }, { "cell_type": "markdown", "id": "f1c2f73f0ed08b38", "metadata": {}, "source": [ "# LLM Vendor API Keys" ] }, { "cell_type": "markdown", "id": "f25999aaf117ceab", "metadata": {}, "source": [ "Enter a blank value if not available." ] }, { "cell_type": "code", "execution_count": null, "id": "b2bb6e2bf69f1c23", "metadata": {}, "outputs": [], "source": [ "if not os.getenv(\"OPENAI_API_KEY\"):\n", " os.environ[\"OPENAI_API_KEY\"] = getpass(\"OpenAI API key: \")\n", "if not os.getenv(\"ANTHROPIC_API_KEY\"):\n", " os.environ[\"ANTHROPIC_API_KEY\"] = getpass(\"Anthropic API key: \")\n", "if not os.getenv(\"GOOGLE_API_KEY\"):\n", " os.environ[\"GOOGLE_API_KEY\"] = getpass(\"Google API key: \")\n", "if not os.getenv(\"GROQ_API_KEY\"):\n", " os.environ[\"GROQ_API_KEY\"] = getpass(\"Groq API key: \")" ] }, { "cell_type": "markdown", "id": "1740a1f2be6a4122", "metadata": {}, "source": [ "# Instrumentation" ] }, { "cell_type": "code", "execution_count": null, "id": "3054e33971e462a0", "metadata": {}, "outputs": [], "source": [ "tracer_provider = register()\n", "OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)\n", "AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)\n", "GroqInstrumentor().instrument(tracer_provider=tracer_provider)\n", "VertexAIInstrumentor().instrument(tracer_provider=tracer_provider)" ] }, { "cell_type": "code", "execution_count": null, "id": "358e6af302f62421", "metadata": {}, "outputs": [], "source": [ "url = \"https://raw.githubusercontent.com/RUCAIBox/HaluEval/refs/heads/main/data/qa_data.json\"\n", "qa = pd.read_json(url, lines=True)\n", "qa.sample(5).iloc[:, ::-1]" ] }, { "cell_type": "code", "execution_count": null, "id": "45240701e82b3c16", "metadata": {}, "outputs": [], "source": [ "SAMPLE_SIZE = 10\n", "\n", "k = qa.iloc[:, :2]\n", "df = pd.concat(\n", " [\n", " pd.concat([k, qa.iloc[:, 2].rename(\"answer\")], axis=1).assign(true_label=\"factual\"),\n", " pd.concat([k, qa.iloc[:, 3].rename(\"answer\")], axis=1).assign(true_label=\"hallucinated\"),\n", " ]\n", ")\n", "df = df.sample(SAMPLE_SIZE, random_state=42).reset_index(drop=True).iloc[:, ::-1]\n", "df" ] }, { "cell_type": "markdown", "id": "adaa17ab05027868", "metadata": {}, "source": [ "# Upload Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "914e224df0ed812e", "metadata": {}, "outputs": [], "source": [ "dataset_name = f\"hallu-eval-{token_hex(4)}\" # adding a random suffix for demo purposes\n", "\n", "ds = px.Client().upload_dataset(\n", " dataframe=df,\n", " dataset_name=dataset_name,\n", " input_keys=[\"question\", \"knowledge\", \"answer\"],\n", " output_keys=[\"true_label\"],\n", ")" ] }, { "cell_type": "markdown", "id": "a84b657d82b93280", "metadata": {}, "source": [ "# Create Prompt\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c14d2428768bfd1", "metadata": {}, "outputs": [], "source": [ "prompt_name = f\"hallu-eval-{token_hex(4)}\" # adding a random suffix for demo purposes" ] }, { "cell_type": "markdown", "id": "2e236a4cec095006", "metadata": {}, "source": [ "Send this [prompt](https://github.com/Arize-ai/phoenix/blob/390cfaa42c5b2c28d3f9f83fbf7c694b8c2beeff/packages/phoenix-evals/src/phoenix/evals/default_templates.py#L56) to Phoenix." ] }, { "cell_type": "code", "execution_count": null, "id": "7c45f7eb9449c47d", "metadata": {}, "outputs": [], "source": [ "content = \"\"\"\\\n", "In this task, you will be presented with a query, a reference text and an answer. The answer is\n", "generated to the question based on the reference text. The answer may contain false information. You\n", "must use the reference text to determine if the answer to the question contains false information,\n", "if the answer is a hallucination of facts. Your objective is to determine whether the answer text\n", "contains factual information and is not a hallucination. A 'hallucination' refers to\n", "an answer that is not based on the reference text or assumes information that is not available in\n", "the reference text. Your response should be a single word: either \"factual\" or \"hallucinated\", and\n", "it should not include any other text or characters. \"hallucinated\" indicates that the answer\n", "provides factually inaccurate information to the query based on the reference text. \"factual\"\n", "indicates that the answer to the question is correct relative to the reference text, and does not\n", "contain made up information. Please read the query and reference text carefully before determining\n", "your response.\n", "\n", "[BEGIN DATA]\n", "************\n", "[Query]: {{ question }}\n", "************\n", "[Reference text]: {{ knowledge }}\n", "************\n", "[Answer]: {{ answer }}\n", "************\n", "[END DATA]\n", "\n", "Is the answer above factual or hallucinated based on the query and reference text?\n", "\"\"\"\n", "_ = Client().prompts.create(\n", " name=prompt_name,\n", " prompt_description=\"Determining if an answer is factual or hallucinated based on a query and reference text\",\n", " version=PromptVersion(\n", " [{\"role\": \"user\", \"content\": content}],\n", " model_name=\"gpt-4o-mini\",\n", " ),\n", ")" ] }, { "cell_type": "markdown", "id": "f1a0dd3016df78d", "metadata": {}, "source": [ "# Get Prompt" ] }, { "cell_type": "code", "execution_count": null, "id": "3d192e418ca18794", "metadata": {}, "outputs": [], "source": [ "prompt = Client().prompts.get(prompt_identifier=prompt_name)" ] }, { "cell_type": "markdown", "id": "8ccdcbbee7959c4a", "metadata": {}, "source": [ "# OpenAI" ] }, { "cell_type": "code", "execution_count": null, "id": "4454b01bd239d2c0", "metadata": {}, "outputs": [], "source": [ "variables = dict(ds.as_dataframe().input.iloc[0])\n", "formatted_prompt = prompt.format(variables=variables)\n", "response = openai.OpenAI().chat.completions.create(**formatted_prompt)\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "56df6e085a044186", "metadata": {}, "source": [ "### Run Experiment" ] }, { "cell_type": "code", "execution_count": null, "id": "a627ca4d35934320", "metadata": {}, "outputs": [], "source": [ "def openai_eval(input):\n", " formatted_prompt = prompt.format(variables=dict(input))\n", " response = openai.OpenAI().chat.completions.create(**formatted_prompt)\n", " return {\"label\": response.choices[0].message.content}" ] }, { "cell_type": "code", "execution_count": null, "id": "b2cf52f88a0aa675", "metadata": {}, "outputs": [], "source": [ "exp_openai = run_experiment(ds, openai_eval)" ] }, { "cell_type": "markdown", "id": "7f017851cdbf4a45", "metadata": {}, "source": [ "### Calculate Accuracy" ] }, { "cell_type": "code", "execution_count": null, "id": "6ffae1704b6a3639", "metadata": {}, "outputs": [], "source": [ "result_openai = pd.concat(\n", " [df.true_label, pd.json_normalize(exp_openai.as_dataframe().output)], axis=1\n", ")\n", "print(f\"Accuracy: {accuracy_score(result_openai.true_label, result_openai.label) * 100:.0f}%\")" ] }, { "cell_type": "markdown", "id": "2f0325c15d05eeae", "metadata": {}, "source": [ "# Anthropic" ] }, { "cell_type": "code", "execution_count": null, "id": "816f54c07cfe21b", "metadata": {}, "outputs": [], "source": [ "anthropic_model = \"claude-3-5-haiku-latest\" # @param {type: \"string\"}" ] }, { "cell_type": "code", "execution_count": null, "id": "55f979c5f1e184b3", "metadata": {}, "outputs": [], "source": [ "variables = dict(ds.as_dataframe().input.iloc[0])\n", "formatted_prompt = prompt.format(variables=variables, sdk=\"anthropic\")\n", "response = anthropic.Anthropic().messages.create(**{**formatted_prompt, \"model\": anthropic_model})\n", "print(response.content[0].text)" ] }, { "cell_type": "markdown", "id": "3cf000a7500f7089", "metadata": {}, "source": [ "### Run Experiment" ] }, { "cell_type": "code", "execution_count": null, "id": "867e54c538f5f48c", "metadata": {}, "outputs": [], "source": [ "def anthropic_eval(input):\n", " formatted_prompt = prompt.format(variables=dict(input), sdk=\"anthropic\")\n", " response = anthropic.Anthropic().messages.create(\n", " **{**formatted_prompt, \"model\": anthropic_model}\n", " )\n", " return {\"label\": response.content[0].text}" ] }, { "cell_type": "code", "execution_count": null, "id": "3682243f5d5fd90b", "metadata": {}, "outputs": [], "source": [ "exp_anthropic = run_experiment(ds, anthropic_eval)" ] }, { "cell_type": "markdown", "id": "f5412ce1beabadd1", "metadata": {}, "source": [ "### Calculate Accuracy" ] }, { "cell_type": "code", "execution_count": null, "id": "35f0e49abb7dfbf9", "metadata": {}, "outputs": [], "source": [ "result_anthropic = pd.concat(\n", " [df.true_label, pd.json_normalize(exp_anthropic.as_dataframe().output)], axis=1\n", ")\n", "print(f\"Accuracy: {accuracy_score(result_anthropic.true_label, result_anthropic.label) * 100:.0f}%\")" ] }, { "cell_type": "markdown", "id": "b1da7ee9a70b1c44", "metadata": {}, "source": [ "# Gemini" ] }, { "cell_type": "code", "execution_count": null, "id": "31eda9b5f1c9aa19", "metadata": {}, "outputs": [], "source": [ "genai.configure(api_key=os.getenv(\"GOOGLE_API_KEY\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "2d0452fe79b40711", "metadata": {}, "outputs": [], "source": [ "gemini_model = \"gemini-1.5-flash\"" ] }, { "cell_type": "code", "execution_count": null, "id": "fbd7170fe0919a9e", "metadata": {}, "outputs": [], "source": [ "variables = dict(ds.as_dataframe().input.iloc[0])\n", "formatted_prompt = prompt.format(variables=variables, sdk=\"google_generativeai\")\n", "response = (\n", " GenerativeModel(**{**formatted_prompt.kwargs, \"model_name\": gemini_model})\n", " .start_chat(history=formatted_prompt.messages[:-1])\n", " .send_message(formatted_prompt.messages[-1])\n", ")\n", "print(response.candidates[0].content.parts[0].text)" ] }, { "cell_type": "markdown", "id": "7561c6bbe7dc625c", "metadata": {}, "source": [ "### Run Experiment" ] }, { "cell_type": "code", "execution_count": null, "id": "881f2b1010a8e6a7", "metadata": {}, "outputs": [], "source": [ "def gemini_eval(input):\n", " formatted_prompt = prompt.format(variables=variables, sdk=\"google_generativeai\")\n", " response = (\n", " GenerativeModel(**{**formatted_prompt.kwargs, \"model_name\": gemini_model})\n", " .start_chat(history=formatted_prompt.messages[:-1])\n", " .send_message(formatted_prompt.messages[-1])\n", " )\n", " return {\"label\": response.candidates[0].content.parts[0].text}" ] }, { "cell_type": "code", "execution_count": null, "id": "1891cdf062f03a27", "metadata": {}, "outputs": [], "source": [ "exp_gemini = run_experiment(ds, gemini_eval)" ] }, { "cell_type": "markdown", "id": "698fe9b13a78fcca", "metadata": {}, "source": [ "### Calculate Accuracy" ] }, { "cell_type": "code", "execution_count": null, "id": "81acf2e36a4369b5", "metadata": {}, "outputs": [], "source": [ "result_anthropic = pd.concat(\n", " [df.true_label, pd.json_normalize(exp_anthropic.as_dataframe().output)], axis=1\n", ")\n", "print(f\"Accuracy: {accuracy_score(result_anthropic.true_label, result_anthropic.label) * 100:.0f}%\")" ] }, { "cell_type": "markdown", "id": "6fb58722a3c21c9b", "metadata": {}, "source": [ "# Groq" ] }, { "cell_type": "code", "execution_count": null, "id": "16747c283ceb8579", "metadata": {}, "outputs": [], "source": [ "groq_model = \"deepseek-r1-distill-llama-70b\"" ] }, { "cell_type": "code", "execution_count": null, "id": "8ed68c082d377e59", "metadata": {}, "outputs": [], "source": [ "variables = dict(ds.as_dataframe().input.iloc[0])\n", "formatted_prompt = prompt.format(variables=variables)\n", "response = await groq.AsyncGroq().chat.completions.create(\n", " **{**formatted_prompt, \"model\": groq_model}\n", ")\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "ea5231e37bb3f1d1", "metadata": {}, "source": [ "### Run Experiment" ] }, { "cell_type": "code", "execution_count": null, "id": "696a9547d7e0428", "metadata": {}, "outputs": [], "source": [ "async def groq_eval(input):\n", " formatted_prompt = prompt.format(variables=dict(input))\n", " response = await groq.AsyncGroq().chat.completions.create(\n", " **{**formatted_prompt, \"model\": groq_model}\n", " )\n", " return {\"label\": response.choices[0].message.content}" ] }, { "cell_type": "code", "execution_count": null, "id": "450604d8583b7df1", "metadata": {}, "outputs": [], "source": [ "exp_groq = run_experiment(ds, groq_eval)" ] }, { "cell_type": "markdown", "id": "51dc7adf3ff93a1f", "metadata": {}, "source": [ "### Extract the Last Word to Calculate Accuracy" ] }, { "cell_type": "code", "execution_count": null, "id": "5b7067601f6f230", "metadata": {}, "outputs": [], "source": [ "labels = pd.json_normalize(exp_groq.as_dataframe().output).label.str.split(\"\\n\").str[-1]\n", "result = pd.concat([labels, df.true_label], axis=1)\n", "print(f\"Accuracy: {accuracy_score(result.true_label, result.label) * 100:.0f}%\")\n", "result" ] }, { "cell_type": "markdown", "id": "85efc6596f236bfb", "metadata": {}, "source": [ "Compare answers between GPT and DeepSeek" ] }, { "cell_type": "code", "execution_count": null, "id": "a73ce581c83940ec", "metadata": {}, "outputs": [], "source": [ "pd.concat([result_openai.label.rename(\"gpt\"), result.rename({\"label\": \"deepseek\"}, axis=1)], axis=1)" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

hallucination_eval.ipynb•16.8 KiB