{
"cells": [
{
"cell_type": "markdown",
"id": "d27840ab",
"metadata": {},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
" <br>\n",
" <br>\n",
" <a href=\"https://arize-phoenix.readthedocs.io/projects/evals/en/latest/\">Evals Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
" </p>\n",
"</center>\n",
"<h1 align=\"center\">Arize Phoenix Evals 2.0</h1>\n",
"\n",
"Arize Phoenix is a fully open-source AI observability platform. It's designed for experimentation, evaluation, and troubleshooting.\n",
"\n",
"**In this notebook, you will learn how to do the following things using Evals 2.0:**\n",
"\n",
"1. How to evaluate Phoenix project traces.\n",
"2. How to improve your custom evaluators using experiments.\n",
"3. How to iterate on your application and evals on a realistic example.\n",
"\n",
"<center>\n",
" <h3 align=\"left\">The Evaluation Driven Development Lifecycle</h3>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"eval lifecycle\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/evals_lifecycle.png\" width=\"1000\"/>\n",
" </p>\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"id": "fd532b3c",
"metadata": {},
"source": [
"### Requirements\n",
"\n",
"1. Kaggle API key\n",
"2. OpenAI API key\n",
"3. A Phoenix instance (cloud or local)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0bbf2f80",
"metadata": {},
"outputs": [],
"source": [
"! uv pip install \"arize-phoenix-evals>=2.0.0\" \"arize-phoenix-client>=1.19.0\" arize-phoenix-otel kagglehub openinference-instrumentation-llama_index llama-index numpy pandas --quiet"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f94c20e9",
"metadata": {},
"outputs": [],
"source": [
"! uv pip install openinference-instrumentation-llama_index llama-index"
]
},
{
"cell_type": "markdown",
"id": "910ede0e",
"metadata": {},
"source": [
"# Dataset Preparation and Setup\n",
"\n",
"We are using a public RAG evaluation dataset. It has two components:\n",
"\n",
"1. A knowledge base of 20 documents of various lengths and sources.\n",
"2. 4 question-answer pairs per document.\n",
" - 2 which are not answerable by the document\n",
" - 2 which require a single passage to answer\n",
"\n",
"First, we need to do some data preparation.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d391ad6",
"metadata": {},
"outputs": [],
"source": [
"# Download dataset\n",
"# Requires a Kaggle API key and username in your environment\n",
"\n",
"import os\n",
"\n",
"import kagglehub\n",
"\n",
"path = kagglehub.dataset_download(\"samuelmatsuoharris/single-topic-rag-evaluation-dataset\")\n",
"\n",
"print(\"Path to dataset files:\", path)\n",
"print(os.listdir(path))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f5edd54",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"\n",
"def prepare_query_data(path: str) -> pd.DataFrame:\n",
" single_passage_df = pd.read_csv(os.path.join(path, \"single_passage_answer_questions.csv\"))\n",
" no_answer_df = pd.read_csv(os.path.join(path, \"no_answer_questions.csv\"))\n",
"\n",
" # Single-passage questions\n",
" single_passage_processed = pd.DataFrame(\n",
" {\n",
" \"document_index\": single_passage_df[\"document_index\"],\n",
" \"query\": single_passage_df[\"question\"],\n",
" \"answer\": single_passage_df[\"answer\"],\n",
" \"query_type\": \"single_passage\",\n",
" }\n",
" )\n",
"\n",
" # No-answer questions\n",
" no_answer_processed = pd.DataFrame(\n",
" {\n",
" \"document_index\": no_answer_df[\"document_index\"],\n",
" \"query\": no_answer_df[\"question\"],\n",
" \"answer\": \"N/A\",\n",
" \"query_type\": \"no_answer\",\n",
" }\n",
" )\n",
"\n",
" # Combine all dataframes\n",
" combined_df = pd.concat([single_passage_processed, no_answer_processed], ignore_index=True)\n",
"\n",
" return combined_df\n",
"\n",
"\n",
"query_df = prepare_query_data(path)\n",
"query_df.sample(5).head()"
]
},
{
"cell_type": "markdown",
"id": "408ffb5d",
"metadata": {},
"source": [
"### Split data into train/test\n",
"\n",
"Split documents into a 60/40 train/test split. We will iterate and experiment on our train set only, leaving the test set for any final comparisons.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1cdac376",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"unique_docs = query_df[\"document_index\"].unique()\n",
"print(f\"Total unique documents: {len(unique_docs)}\")\n",
"\n",
"np.random.seed(42)\n",
"sample_size = int(len(unique_docs) * 0.6)\n",
"train_docs = np.random.choice(unique_docs, size=sample_size, replace=False)\n",
"print(f\"Sampled {len(train_docs)} documents ({len(train_docs) / len(unique_docs) * 100:.1f}%)\")\n",
"\n",
"# Split queries based on sampled document indices\n",
"all_queries = query_df.copy()\n",
"train_queries = query_df[query_df[\"document_index\"].isin(train_docs)]\n",
"test_queries = query_df[~query_df[\"document_index\"].isin(train_docs)]\n",
"print(f\"Train queries: {len(train_queries)}, Test queries: {len(test_queries)}\")"
]
},
{
"cell_type": "markdown",
"id": "eac5accc",
"metadata": {},
"source": [
"### Inspect the knowledge base documents\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbda19f7",
"metadata": {},
"outputs": [],
"source": [
"documents = pd.read_csv(os.path.join(path, \"documents.csv\"))\n",
"documents.head()"
]
},
{
"cell_type": "markdown",
"id": "c814af43",
"metadata": {},
"source": [
"### Set Up Phoenix Tracing\n",
"\n",
"This allows us to capture traces not only of our application, but also any evaluations and experiments we do.\n",
"\n",
"You can use either a locally hosted instance of Phoenix or Phoenix Cloud.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ad26ded",
"metadata": {},
"outputs": [],
"source": [
"# Set up Phoenix Tracing\n",
"from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n",
"\n",
"from phoenix.otel import register\n",
"\n",
"project_name = \"rag-demo\" # project for our application traces\n",
"tracer_provider = register(project_name=project_name, verbose=False)\n",
"LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)"
]
},
{
"cell_type": "markdown",
"id": "37c7fdc1",
"metadata": {},
"source": [
"Add your LLM API credentials. Here, we are using OpenAI.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8db1a1a2",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"\ud83d\udd11 Enter your OpenAI API key: \")\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
]
},
{
"cell_type": "markdown",
"id": "14fae561",
"metadata": {},
"source": [
"# Set Up a RAG App using Llama Index\n",
"\n",
"For this demo application, we are building a simple RAG pipeline that has two components:\n",
"\n",
"1. Vector index to retrieve documents\n",
"2. LLM to generate responses\n",
"\n",
"For this initial application, let's keep it simple and use the default configuration and prompts from Llama Index.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14469a14",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"from llama_index.core import (\n",
" Document,\n",
" Settings,\n",
" StorageContext,\n",
" VectorStoreIndex,\n",
" load_index_from_storage,\n",
")\n",
"from llama_index.embeddings.openai import OpenAIEmbedding\n",
"from llama_index.llms.openai import OpenAI\n",
"\n",
"index_dir = \"llamaindex_store\"\n",
"\n",
"# --- Ingest documents ---\n",
"if os.path.exists(index_dir):\n",
" storage_context = StorageContext.from_defaults(persist_dir=index_dir)\n",
" index = load_index_from_storage(storage_context)\n",
"else:\n",
" # --- Set up the LLM and embedding model ---\n",
" Settings.llm = OpenAI(model=\"gpt-4o-mini\", temperature=0) # generator\n",
" Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\") # retriever\n",
"\n",
" kb_docs = []\n",
" for _, row in documents.iterrows():\n",
" doc = Document(\n",
" text=str(row[\"text\"]),\n",
" metadata={\"source_url\": row[\"source_url\"], \"document_index\": row[\"index\"]},\n",
" id_=str(row[\"index\"]),\n",
" )\n",
" kb_docs.append(doc)\n",
"\n",
" index = VectorStoreIndex.from_documents(kb_docs)\n",
"\n",
"# Optional: persist to disk so you can reuse later\n",
"index.storage_context.persist(persist_dir=index_dir)\n",
"\n",
"# Create the query engine\n",
"query_engine = index.as_query_engine()"
]
},
{
"cell_type": "markdown",
"id": "c7b0091d",
"metadata": {},
"source": [
"Let's test to make sure our RAG system is working:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a70d108c",
"metadata": {},
"outputs": [],
"source": [
"query_engine.query(\"What is data science?\")"
]
},
{
"cell_type": "markdown",
"id": "ee8eb042",
"metadata": {},
"source": [
"### Run RAG on Train Set\n"
]
},
{
"cell_type": "markdown",
"id": "6b61a90b",
"metadata": {},
"source": [
"Let's wrap our query engine so it's easier to run on our dataset.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3be71df7",
"metadata": {},
"outputs": [],
"source": [
"from openinference.instrumentation import using_metadata\n",
"\n",
"\n",
"async def run_rag_with_metadata(example, rag_engine):\n",
" \"\"\"Ask a question of the knowledge base.\"\"\"\n",
" metadata = {\n",
" \"expected_answer\": example[\"answer\"],\n",
" \"query_type\": example[\"query_type\"],\n",
" \"expected_document_index\": example[\"document_index\"],\n",
" \"split\": \"test\" if example[\"document_index\"] not in train_docs else \"train\",\n",
" }\n",
" with using_metadata(metadata):\n",
" rag_engine.query(example[\"query\"])"
]
},
{
"cell_type": "markdown",
"id": "fd90ae23",
"metadata": {},
"source": [
"We use the `AsyncExecutor` to run our RAG app on the training dataset with optimal speed.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94a1acdd",
"metadata": {},
"outputs": [],
"source": [
"# Run application on the train set to get a baseline\n",
"from functools import partial\n",
"\n",
"from phoenix.evals.executors import AsyncExecutor\n",
"from phoenix.evals.utils import get_tqdm_progress_bar_formatter\n",
"\n",
"executor = AsyncExecutor(\n",
" generation_fn=partial(run_rag_with_metadata, rag_engine=query_engine),\n",
" concurrency=10, # adjust this as needed\n",
" exit_on_error=True,\n",
" tqdm_bar_format=get_tqdm_progress_bar_formatter(\"Run RAG\"),\n",
")\n",
"\n",
"results, execution_details = await executor.execute(\n",
" [row.to_dict() for _, row in train_queries.iterrows()],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "6097e6da",
"metadata": {},
"source": [
"# Evaluate the Traces\n",
"\n",
"First, let's go to Phoenix and look at our application traces. Do we observe any issues?\n",
"\n",
"- Is the RAG agent correctly refusing to answer the unanswerable queries?\n",
"- Is it retrieving the correct documents?\n",
"- Is it hallucinating?\n",
"\n",
"These are common questions we can turn into repeatable evaluations. So let's create a few evaluators for our RAG app and run them on our traces.\n",
"\n",
"**Steps:**\n",
"\n",
"1. Export traces from Phoenix\n",
"2. Define evaluators\n",
"3. Run evaluators on the trace data\n",
"4. Log the evaluation results back up to Phoenix\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14616534",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.client import Client\n",
"from phoenix.client.helpers.spans import get_input_output_context, get_retrieved_documents\n",
"\n",
"# fetch the retrieved document information\n",
"docs_df = get_retrieved_documents(Client(), project_name=project_name)\n",
"docs_df[\"document_index\"] = docs_df[\"document_metadata\"].apply(lambda x: x[\"document_index\"])\n",
"retrieved_doc_indices = docs_df.groupby(\"context.trace_id\")[\"document_index\"].apply(list)\n",
"\n",
"# fetch the full input, output, context, and metadata for the traces\n",
"qa_df = get_input_output_context(Client(), project_name=project_name)\n",
"qa_df = qa_df.dropna(subset=[\"metadata\"])\n",
"qa_df[\"expected_answer\"] = qa_df[\"metadata\"].apply(lambda x: x[\"expected_answer\"])\n",
"qa_df[\"split\"] = qa_df[\"metadata\"].apply(lambda x: x[\"split\"])\n",
"qa_df[\"expected_document_index\"] = qa_df[\"metadata\"].apply(lambda x: x[\"expected_document_index\"])\n",
"\n",
"\n",
"# combine them\n",
"spans_df = (\n",
" qa_df.reset_index()\n",
" .merge(\n",
" retrieved_doc_indices.rename(\"retrieved_document_indices\"),\n",
" left_on=\"context.trace_id\",\n",
" right_index=True, # retrieved_doc_indices is indexed by context.trace_id\n",
" how=\"left\",\n",
" )\n",
" .set_index(\"context.span_id\")\n",
")\n",
"print(spans_df.shape)\n",
"spans_df.head()"
]
},
{
"cell_type": "markdown",
"id": "10027700",
"metadata": {},
"source": [
"## Define Evaluators\n"
]
},
{
"cell_type": "markdown",
"id": "fd266d6d",
"metadata": {},
"source": [
"### 1. Refusal\n",
"\n",
"Since we know some questions in our dataset are unanswerable, we expect the LLM to respond accordingly. So let's create an LLM judge to check for refusals in the output.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1be11e03",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals import LLM, create_classifier\n",
"\n",
"prompt = \"\"\"\n",
"Given a question and response, determine if the response is a refusal to answer the question.\n",
"\n",
"Question: {input}\n",
"Response: {output}\n",
"\n",
"Is the response a refusal or an informative answer to the question?\n",
"\"\"\"\n",
"\n",
"llm = LLM(model=\"gpt-4o\", provider=\"openai\")\n",
"refusal_evaluator = create_classifier(\n",
" llm=llm,\n",
" name=\"llm_refusal\",\n",
" prompt_template=prompt,\n",
" choices={\"refusal\": 0, \"answer\": 1},\n",
")\n",
"\n",
"# test the evaluator on a single example\n",
"refusal_evaluator.evaluate(spans_df.iloc[0].to_dict())"
]
},
{
"cell_type": "markdown",
"id": "1cd978fd",
"metadata": {},
"source": [
"### 2. Hallucination\n",
"\n",
"Let's also check to see if our RAG pipeline is producing hallucinations. Phoenix evals has a built-in `HallucinationEvaluator` so we'll use that. First, let's inspect the `input_schema` so we know what it needs to run.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e663e735",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals import LLM\n",
"from phoenix.evals.metrics import HallucinationEvaluator\n",
"\n",
"llm = LLM(model=\"gpt-4o\", provider=\"openai\")\n",
"hallucination_evaluator = HallucinationEvaluator(llm=llm)\n",
"hallucination_evaluator.describe()"
]
},
{
"cell_type": "markdown",
"id": "d0b06175",
"metadata": {},
"source": [
"Luckily, our data already has columns that match the `input_schema` for this evaluator. If that were not the case, we could provide an `input_mapping` so it works on our data and bind it to the evaluator so we can reuse it.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cab58273",
"metadata": {},
"outputs": [],
"source": [
"# test the evaluator on a single example\n",
"hallucination_evaluator.evaluate(spans_df.iloc[0].to_dict())"
]
},
{
"cell_type": "markdown",
"id": "595c7dbe",
"metadata": {},
"source": [
"### 3. Retrieval Precision\n",
"\n",
"We also want to measure how well the information retrieval component of our system is working. Let's add a precision metric which checks to see how often the target document appeared in the retrieved results.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d7918b2",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals import bind_evaluator, create_evaluator\n",
"\n",
"\n",
"@create_evaluator(name=\"precision\")\n",
"def precision(retrieved_documents: list[int], relevant_documents: list[int]) -> float:\n",
" relevant_set = set(relevant_documents)\n",
" hits = sum(1 for doc in retrieved_documents if doc in relevant_set)\n",
" return hits / len(retrieved_documents)\n",
"\n",
"\n",
"# our precision evaluator expects a list of relevant documents,\n",
"# but our dataset only has one relevant document per query, so we\n",
"# wrap the expected document index in a list inside our mapping using a lambda function\n",
"precision_mapping = {\n",
" \"relevant_documents\": lambda x: [x[\"expected_document_index\"]],\n",
" \"retrieved_documents\": \"retrieved_document_indices\",\n",
"}\n",
"\n",
"precision_evaluator = bind_evaluator(precision, precision_mapping)\n",
"\n",
"# test the evaluator on a single example\n",
"precision_evaluator.evaluate(spans_df.iloc[0].to_dict())"
]
},
{
"cell_type": "markdown",
"id": "efa203b9",
"metadata": {},
"source": [
"### Putting it all together\n",
"\n",
"Let's run our 3 evaluators on all of our project traces.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff5e4324",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals import async_evaluate_dataframe\n",
"\n",
"train_spans = spans_df[spans_df[\"split\"] == \"train\"]\n",
"results = await async_evaluate_dataframe(\n",
" dataframe=train_spans,\n",
" evaluators=[precision_evaluator, hallucination_evaluator, refusal_evaluator],\n",
" concurrency=10,\n",
" exit_on_error=True,\n",
")\n",
"results.head()"
]
},
{
"cell_type": "markdown",
"id": "db062d38",
"metadata": {},
"source": [
"### Log trace evaluations back to Phoenix\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7215147",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.client import AsyncClient\n",
"from phoenix.evals.utils import to_annotation_dataframe\n",
"\n",
"client = AsyncClient()\n",
"\n",
"annotations = to_annotation_dataframe(\n",
" dataframe=results\n",
") # can also specify score_names to log only certain scores\n",
"await client.spans.log_span_annotations_dataframe(dataframe=annotations)"
]
},
{
"cell_type": "markdown",
"id": "8bb95d5f",
"metadata": {},
"source": [
"# Improve Evaluators\n",
"\n",
"Go into Phoenix and look at your project traces now that you've added some eval metrics. Pay attention to the \"llm_refusal\" metric - is it catching all the refusals?\n",
"No, it looks like it is not performing as expected.\n",
"\n",
"Let's see if we can improve our LLM Judge so it is better aligned.\n",
"\n",
"**Steps:**\n",
"\n",
"1. Manually annotate some traces as \"refused\" or \"responded\" inside Phoenix.\n",
"2. Export those annotated traces and use to create a dataset for experimentation.\n",
"3. Define an LLM judge (refusal) and use as the experiment \"task\".\n",
"4. Create a simple heuristic experiment evaluator that checks for an exact match between the judge score and our annotation\n",
"5. Iterate on the judge prompt until we are happy with the results.\n",
"\n",
"<center>\n",
" <h3 align=\"left\">Phoenix Experiments</h3>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"eval lifecycle\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/experiment.png\" width=\"1000\"/>\n",
" </p>\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"id": "8b2ea926",
"metadata": {},
"source": [
"After manual annotation, pull down those traces:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b556f20",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.client import Client\n",
"from phoenix.client.types.spans import SpanQuery\n",
"\n",
"# Export all the top level spans\n",
"query = SpanQuery().where(\"name == 'RetrieverQueryEngine.query'\")\n",
"spans_df = Client().spans.get_spans_dataframe(query=query, project_identifier=project_name)\n",
"\n",
"# Shape the spans dataframe\n",
"spans_df[\"query\"] = spans_df[\"attributes.input.value\"]\n",
"spans_df[\"response\"] = spans_df[\"attributes.output.value\"]\n",
"spans_df.dropna(subset=[\"attributes.metadata\"], inplace=True)\n",
"spans_df[\"expected_answer\"] = spans_df[\"attributes.metadata\"].apply(lambda x: x[\"expected_answer\"])\n",
"\n",
"# Export annotations and add to the spans from earlier\n",
"annotations_df = Client().spans.get_span_annotations_dataframe(\n",
" spans_dataframe=spans_df, project_identifier=project_name\n",
")\n",
"refusal_ground_truth = annotations_df[\n",
" (annotations_df[\"annotator_kind\"] == \"HUMAN\") & (annotations_df[\"annotation_name\"] == \"refusal\")\n",
"]\n",
"refusal_ground_truth = refusal_ground_truth.rename_axis(index={\"span_id\": \"context.span_id\"})\n",
"refusal_ground_truth = refusal_ground_truth.rename(columns={\"result.score\": \"refusal_score\"})\n",
"labeled_spans_df = spans_df.merge(\n",
" refusal_ground_truth[[\"refusal_score\"]], left_index=True, right_index=True, how=\"left\"\n",
")\n",
"labeled_spans_df = labeled_spans_df[\n",
" [\"context.span_id\", \"query\", \"response\", \"refusal_score\", \"expected_answer\"]\n",
"]\n",
"labeled_spans = labeled_spans_df.dropna(subset=[\"refusal_score\"])\n",
"labeled_spans[\"refusal_score\"].value_counts()\n",
"print(labeled_spans[\"refusal_score\"].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f34a1f3d",
"metadata": {},
"outputs": [],
"source": [
"labeled_spans.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43a6cff1",
"metadata": {},
"outputs": [],
"source": [
"dataset_name = \"annotated-queries-train\"\n",
"ds = await AsyncClient().datasets.create_dataset(\n",
" name=dataset_name,\n",
" dataframe=labeled_spans,\n",
" input_keys=[\"query\", \"response\"],\n",
" output_keys=[\"refusal_score\", \"expected_answer\"],\n",
")\n",
"\n",
"# get the dataset if already created\n",
"# ds = await AsyncClient().datasets.get_dataset(dataset=dataset_name)"
]
},
{
"cell_type": "markdown",
"id": "976909ff",
"metadata": {},
"source": [
"Next, we define:\n",
"\n",
"1. The experiment task (here, our LLM evaluator is the task).\n",
"2. The experiment evaluator (here, a simple check for alignment between the judge and human).\n",
"\n",
"**Note: 2.0 evaluators are now drop-in compatible as experiment evaluators!**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40e6cf8e",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals import create_evaluator\n",
"\n",
"\n",
"# experiment task\n",
"def run_refusal_eval(input, evaluator):\n",
" result = evaluator.evaluate(input)\n",
" return result[0]\n",
"\n",
"\n",
"# experiment evaluator\n",
"@create_evaluator(name=\"exact_match\")\n",
"def exact_match(output, expected):\n",
" return float(output[\"score\"]) == float(expected[\"refusal_score\"])"
]
},
{
"cell_type": "markdown",
"id": "b2b18ede",
"metadata": {},
"source": [
"First, let's get a baseline by testing out the initial refusal evaluator against our annotations.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "264393e0",
"metadata": {},
"outputs": [],
"source": [
"prompt = \"\"\"\n",
"Given a question and response, determine if the response is a refusal to answer the question.\n",
"\n",
"Question: {query}\n",
"Response: {response}\n",
"\n",
"Is the response a refusal or an informative response to the question?\n",
"\"\"\"\n",
"\n",
"llm = LLM(model=\"gpt-4o\", provider=\"openai\")\n",
"baseline_refusal = create_classifier(\n",
" llm=llm,\n",
" name=\"refusal\",\n",
" prompt_template=prompt,\n",
" choices={\"refusal\": 0, \"answer\": 1},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a79c414",
"metadata": {},
"outputs": [],
"source": [
"from functools import partial\n",
"\n",
"from phoenix.client import AsyncClient\n",
"\n",
"async_client = AsyncClient()\n",
"\n",
"experiment = await async_client.experiments.run_experiment(\n",
" dataset=ds,\n",
" task=partial(run_refusal_eval, evaluator=baseline_refusal),\n",
" experiment_name=\"baseline\",\n",
" evaluators=[exact_match],\n",
" concurrency=10,\n",
" # dry_run=3,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f06048dc",
"metadata": {},
"source": [
"Go to the experiment and filter by \"exact_match==0\" -- these are the examples where our refusal judge disagreed with the human annotations. Look at the RAG responses and the judge explanations. I notice that the LLM judge considers uncertain responses as valid answers rather than refusals.\n",
"\n",
"Based on these observations, let's tweak our prompt to make the evaluation criteria more clear to the LLM judge. Describe exactly what a \"refusal\" looks like.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "118d1368",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals import LLM, create_classifier\n",
"\n",
"prompt = \"\"\"\n",
"Given a question and response, determine if the response is a refusal to answer the question.\n",
"Refusals often contain phrases of uncertainty like 'I don't know' and 'I don't have that information'.\n",
"They also often mention that the answer is not provided in the information or context.\n",
"\n",
"If the response contains these phrases, it is a refusal. Even if the response contains other\n",
"text indicating an attempt to answer the question, it is still a refusal.\n",
"\n",
"If the response does not contain these \"hedging\" phrases, it is an informative response. Do not\n",
"consider the correctness of the response, only whether it is a refusal or not.\n",
"\n",
"Question: {query}\n",
"Response: {response}\n",
"\n",
"Is the response a refusal or an informative answer to the question?\n",
"\"\"\"\n",
"\n",
"refusal_v2 = create_classifier(\n",
" llm=llm,\n",
" name=\"refusal\",\n",
" prompt_template=prompt,\n",
" choices={\"refusal\": 0, \"answer\": 1},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24b64bba",
"metadata": {},
"outputs": [],
"source": [
"experiment = await async_client.experiments.run_experiment(\n",
" dataset=ds,\n",
" task=partial(run_refusal_eval, evaluator=refusal_v2),\n",
" experiment_name=\"prompt-v2\",\n",
" evaluators=[exact_match],\n",
" concurrency=10,\n",
" # dry_run=3,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "a1ca557d",
"metadata": {},
"source": [
"Looking at this experiment in Phoenix, I see that we now have \"exact_match == 1.0\" indicating 100% agreement between our new judge and the annotations!\n",
"\n",
"Through experimentation we were able to improve the evaluation metric itself, much in the same way we would improve any process.\n"
]
},
{
"cell_type": "markdown",
"id": "88adddfe",
"metadata": {},
"source": [
"# Improve the Application\n",
"\n",
"Now that we feel good about our refusal metric, let's see if we can improve our RAG system.\n",
"\n",
"Exactly 50% of the queries in our dataset are unanswerable, so ideally we would like to see the \"llm_refusal\" score close to 0.5. We don't want the RAG system attempting to answer questions that are not answerable from the context because this increases the chances of hallucination - not good!\n",
"\n",
"**Steps:**\n",
"\n",
"1. Create a dataset using the train set queries.\n",
"2. Define our experiment task (running RAG on our dataset).\n",
"3. Use our new and improved refusal classifier as the experiment evaluator.\n",
"4. Iterate on the RAG agent's prompt until we are happy.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61899d7c",
"metadata": {},
"outputs": [],
"source": [
"dataset_name = \"train-queries\"\n",
"ds = await AsyncClient().datasets.create_dataset(\n",
" name=dataset_name,\n",
" dataframe=train_queries,\n",
" input_keys=[\"query\"],\n",
")\n",
"\n",
"# if already created\n",
"# ds = await AsyncClient().datasets.get_dataset(dataset=dataset_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "736688a4",
"metadata": {},
"outputs": [],
"source": [
"from phoenix.evals import bind_evaluator\n",
"\n",
"\n",
"# define experiment task (running the RAG engine)\n",
"async def run_rag_task(input, rag_engine):\n",
" \"\"\"Ask a question of the knowledge base.\"\"\"\n",
" response = rag_engine.query(input[\"query\"])\n",
" return response\n",
"\n",
"\n",
"# use an input mapping to fit our dataset to the evaluator we created earlier\n",
"refusal_evaluator = bind_evaluator(refusal_v2, {\"query\": \"input.query\", \"response\": \"output\"})"
]
},
{
"cell_type": "markdown",
"id": "ca222103",
"metadata": {},
"source": [
"### Experiment 1: Baseline RAG System\n",
"\n",
"Let's rerun our initial RAG system to get a baseline. How do the \"out-of-the-box\" defaults work?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cddfdd87",
"metadata": {},
"outputs": [],
"source": [
"query_engine_baseline = index.as_query_engine()\n",
"baseline_experiment = await AsyncClient().experiments.run_experiment(\n",
" dataset=ds,\n",
" task=partial(run_rag_task, rag_engine=query_engine_baseline),\n",
" experiment_name=\"baseline\",\n",
" evaluators=[refusal_evaluator],\n",
" concurrency=10,\n",
" # dry_run=3,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "82378af0",
"metadata": {},
"source": [
"### Experiment 2: RAG with Custom Prompt\n",
"\n",
"Go into Phoenix to see the results of our experiment.\n",
"\n",
"The refusal score is a little high - we want to get it down closer to 0.5 since we know 50% of our queries are unanswerable. Let's see if modifying the system prompt used for the LLM generation component of our RAG system helps.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "904d96cc",
"metadata": {},
"outputs": [],
"source": [
"from textwrap import dedent\n",
"\n",
"custom_system_prompt = \"\"\"You are an expert at answering questions about a given context.\n",
"\\nAlways answer the query using the provided context information, and not prior knowledge.\n",
"\\nSome rules to follow:\n",
"\\n1. Never directly reference the given context in your answer.\n",
"\\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...'\n",
"or anything along those lines.\n",
"\\n3. Do NOT use prior knowledge to answer the question. Only use the context provided.\n",
"\\n4. If you cannot find the answer in the context, say 'I cannot find that information.' When in\n",
"doubt, default to responding 'I cannot find that information.'\n",
"\"\"\"\n",
"custom_query_engine = index.as_query_engine(system_prompt=dedent(custom_system_prompt))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "400f09fd",
"metadata": {},
"outputs": [],
"source": [
"experiment = await AsyncClient().experiments.run_experiment(\n",
" dataset=ds,\n",
" task=partial(run_rag_task, rag_engine=custom_query_engine),\n",
" experiment_name=\"custom-prompt\",\n",
" evaluators=[refusal_evaluator],\n",
" concurrency=10,\n",
" # dry_run=3,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d3c1b209",
"metadata": {},
"source": [
"Check out the results of this experiment in Phoenix.\n",
"\n",
"Nice, we are heading in the right direction! Our refusal score went down a bit closer to 0.5, indicating that our RAG system is correctly refusing to answer more queries.\n"
]
},
{
"cell_type": "markdown",
"id": "dbdb7104",
"metadata": {},
"source": [
"# Conclusion\n"
]
},
{
"cell_type": "markdown",
"id": "be17c5a8",
"metadata": {},
"source": [
"In this notebook, we have covered a lot! Now you know:\n",
"\n",
"1. How to evaluate traces using different types of evaluators:\n",
" - custom LLM classifiers\n",
" - built-in metrics\n",
" - heuristic functions using the `create_evaluator` decorator\n",
"2. How to build and iterate on an LLM Evaluator using experiments\n",
"3. How to iterate on an application using experiments and evaluators\n",
"\n",
"For more information, check out our [Documentation!](https://arize-phoenix.readthedocs.io/projects/evals/en/latest/)\n"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.23"
},
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}