Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
evals_2.0_rag_demo.ipynbโ€ข94.8 kB
{ "cells": [ { "cell_type": "markdown", "id": "d27840ab", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n", " <br>\n", " <br>\n", " <a href=\"https://arize-phoenix.readthedocs.io/projects/evals/en/latest/\">Evals Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Arize Phoenix Evals 2.0</h1>\n", "\n", "Arize Phoenix is a fully open-source AI observability platform. It's designed for experimentation, evaluation, and troubleshooting.\n", "\n", "**In this notebook, you will learn how to do the following things using Evals 2.0:**\n", "\n", "1. How to evaluate Phoenix project traces.\n", "2. How to improve your custom evaluators using experiments.\n", "3. How to iterate on your application and evals on a realistic example.\n", "\n", "<center>\n", " <h3 align=\"left\">The Evaluation Driven Development Lifecycle</h3>\n", " <p style=\"text-align:center\">\n", " <img alt=\"eval lifecycle\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/evals_lifecycle.png\" width=\"1000\"/>\n", " </p>\n", "</center>\n" ] }, { "cell_type": "markdown", "id": "fd532b3c", "metadata": {}, "source": [ "### Requirements\n", "\n", "1. Kaggle API key\n", "2. OpenAI API key\n", "3. A Phoenix instance (cloud or local)\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "0bbf2f80", "metadata": {}, "outputs": [], "source": [ "! uv pip install \"arize-phoenix-evals>=2.0.0\" \"arize-phoenix-client>=1.19.0\" arize-phoenix-otel kagglehub openinference-instrumentation-llama_index llama-index numpy pandas --quiet" ] }, { "cell_type": "code", "execution_count": 2, "id": "f94c20e9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2mUsing Python 3.9.23 environment at: /Users/elizabethhutton/Projects/phoenix/.venv\u001b[0m\n", "\u001b[2K\u001b[2mResolved \u001b[1m94 packages\u001b[0m \u001b[2min 140ms\u001b[0m\u001b[0m \u001b[0m\n", "\u001b[2K\u001b[2mInstalled \u001b[1m2 packages\u001b[0m \u001b[2min 6ms\u001b[0m\u001b[0mmentation==0.58b0 \u001b[0m\n", " \u001b[32m+\u001b[39m \u001b[1mopeninference-instrumentation-llama-index\u001b[0m\u001b[2m==4.3.5\u001b[0m\n", " \u001b[32m+\u001b[39m \u001b[1mopentelemetry-instrumentation\u001b[0m\u001b[2m==0.58b0\u001b[0m\n" ] } ], "source": [ "! uv pip install openinference-instrumentation-llama_index llama-index" ] }, { "cell_type": "markdown", "id": "910ede0e", "metadata": {}, "source": [ "# Dataset Preparation and Setup\n", "\n", "We are using a public RAG evaluation dataset. It has two components:\n", "\n", "1. A knowledge base of 20 documents of various lengths and sources.\n", "2. 4 question-answer pairs per document.\n", " - 2 which are not answerable by the document\n", " - 2 which require a single passage to answer\n", "\n", "First, we need to do some data preparation.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "8d391ad6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/elizabethhutton/Projects/phoenix/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020\n", " warnings.warn(\n", "/Users/elizabethhutton/Projects/phoenix/.venv/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Path to dataset files: /Users/elizabethhutton/.cache/kagglehub/datasets/samuelmatsuoharris/single-topic-rag-evaluation-dataset/versions/4\n", "['multi_passage_answer_questions.csv', 'documents.csv', 'single_passage_answer_questions.csv', 'no_answer_questions.csv']\n" ] } ], "source": [ "# Download dataset\n", "# Requires a Kaggle API key and username in your environment\n", "\n", "import os\n", "\n", "import kagglehub\n", "\n", "path = kagglehub.dataset_download(\"samuelmatsuoharris/single-topic-rag-evaluation-dataset\")\n", "\n", "print(\"Path to dataset files:\", path)\n", "print(os.listdir(path))" ] }, { "cell_type": "code", "execution_count": 3, "id": "6f5edd54", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>document_index</th>\n", " <th>query</th>\n", " <th>answer</th>\n", " <th>query_type</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>72</th>\n", " <td>16</td>\n", " <td>In what version was the velociraptor introduced?</td>\n", " <td>N/A</td>\n", " <td>no_answer</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>3</td>\n", " <td>How do the data storage options compare?</td>\n", " <td>For fast start: use SQLite3 and ChromaDB (File...</td>\n", " <td>single_passage</td>\n", " </tr>\n", " <tr>\n", " <th>73</th>\n", " <td>16</td>\n", " <td>What was added in version 1.5.7?</td>\n", " <td>N/A</td>\n", " <td>no_answer</td>\n", " </tr>\n", " <tr>\n", " <th>46</th>\n", " <td>3</td>\n", " <td>When was GPT 4 made available</td>\n", " <td>N/A</td>\n", " <td>no_answer</td>\n", " </tr>\n", " <tr>\n", " <th>74</th>\n", " <td>17</td>\n", " <td>Why did Saga stab Scratch?</td>\n", " <td>N/A</td>\n", " <td>no_answer</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " document_index query \\\n", "72 16 In what version was the velociraptor introduced? \n", "6 3 How do the data storage options compare? \n", "73 16 What was added in version 1.5.7? \n", "46 3 When was GPT 4 made available \n", "74 17 Why did Saga stab Scratch? \n", "\n", " answer query_type \n", "72 N/A no_answer \n", "6 For fast start: use SQLite3 and ChromaDB (File... single_passage \n", "73 N/A no_answer \n", "46 N/A no_answer \n", "74 N/A no_answer " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "\n", "def prepare_query_data(path: str) -> pd.DataFrame:\n", " single_passage_df = pd.read_csv(os.path.join(path, \"single_passage_answer_questions.csv\"))\n", " no_answer_df = pd.read_csv(os.path.join(path, \"no_answer_questions.csv\"))\n", "\n", " # Single-passage questions\n", " single_passage_processed = pd.DataFrame(\n", " {\n", " \"document_index\": single_passage_df[\"document_index\"],\n", " \"query\": single_passage_df[\"question\"],\n", " \"answer\": single_passage_df[\"answer\"],\n", " \"query_type\": \"single_passage\",\n", " }\n", " )\n", "\n", " # No-answer questions\n", " no_answer_processed = pd.DataFrame(\n", " {\n", " \"document_index\": no_answer_df[\"document_index\"],\n", " \"query\": no_answer_df[\"question\"],\n", " \"answer\": \"N/A\",\n", " \"query_type\": \"no_answer\",\n", " }\n", " )\n", "\n", " # Combine all dataframes\n", " combined_df = pd.concat([single_passage_processed, no_answer_processed], ignore_index=True)\n", "\n", " return combined_df\n", "\n", "\n", "query_df = prepare_query_data(path)\n", "query_df.sample(5).head()" ] }, { "cell_type": "markdown", "id": "408ffb5d", "metadata": {}, "source": [ "### Split data into train/test\n", "\n", "Split documents into a 60/40 train/test split. We will iterate and experiment on our train set only, leaving the test set for any final comparisons.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "1cdac376", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total unique documents: 20\n", "Sampled 12 documents (60.0%)\n", "Train queries: 48, Test queries: 32\n" ] } ], "source": [ "import numpy as np\n", "\n", "unique_docs = query_df[\"document_index\"].unique()\n", "print(f\"Total unique documents: {len(unique_docs)}\")\n", "\n", "np.random.seed(42)\n", "sample_size = int(len(unique_docs) * 0.6)\n", "train_docs = np.random.choice(unique_docs, size=sample_size, replace=False)\n", "print(f\"Sampled {len(train_docs)} documents ({len(train_docs) / len(unique_docs) * 100:.1f}%)\")\n", "\n", "# Split queries based on sampled document indices\n", "all_queries = query_df.copy()\n", "train_queries = query_df[query_df[\"document_index\"].isin(train_docs)]\n", "test_queries = query_df[~query_df[\"document_index\"].isin(train_docs)]\n", "print(f\"Train queries: {len(train_queries)}, Test queries: {len(test_queries)}\")" ] }, { "cell_type": "markdown", "id": "eac5accc", "metadata": {}, "source": [ "### Inspect the knowledge base documents\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "dbda19f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>index</th>\n", " <th>source_url</th>\n", " <th>text</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>https://enterthegungeon.fandom.com/wiki/Bullet...</td>\n", " <td>Bullet Kin\\nBullet Kin are one of the most com...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>https://www.dropbox.com/scl/fi/ljtdg6eaucrbf1a...</td>\n", " <td>---The Paths through the Underground/Underdark...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>https://bytes-and-nibbles.web.app/bytes/stici-...</td>\n", " <td>Semantic and Textual Inference Chatbot Interfa...</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>https://github.com/llmware-ai/llmware</td>\n", " <td>llmware\\n\\nBuilding Enterprise RAG Pipelines w...</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>https://docs.marimo.io/recipes.html</td>\n", " <td>Recipes\\nThis page includes code snippets or โ€œ...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " index source_url \\\n", "0 0 https://enterthegungeon.fandom.com/wiki/Bullet... \n", "1 1 https://www.dropbox.com/scl/fi/ljtdg6eaucrbf1a... \n", "2 2 https://bytes-and-nibbles.web.app/bytes/stici-... \n", "3 3 https://github.com/llmware-ai/llmware \n", "4 4 https://docs.marimo.io/recipes.html \n", "\n", " text \n", "0 Bullet Kin\\nBullet Kin are one of the most com... \n", "1 ---The Paths through the Underground/Underdark... \n", "2 Semantic and Textual Inference Chatbot Interfa... \n", "3 llmware\\n\\nBuilding Enterprise RAG Pipelines w... \n", "4 Recipes\\nThis page includes code snippets or โ€œ... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "documents = pd.read_csv(os.path.join(path, \"documents.csv\"))\n", "documents.head()" ] }, { "cell_type": "markdown", "id": "c814af43", "metadata": {}, "source": [ "### Set Up Phoenix Tracing\n", "\n", "This allows us to capture traces not only of our application, but also any evaluations and experiments we do.\n", "\n", "You can use either a locally hosted instance of Phoenix or Phoenix Cloud.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "7ad26ded", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/elizabethhutton/Projects/phoenix/.venv/lib/python3.9/site-packages/pydantic/_internal/_config.py:373: UserWarning: Valid config keys have changed in V2:\n", "* 'fields' has been removed\n", " warnings.warn(message, UserWarning)\n", "/Users/elizabethhutton/Projects/phoenix/src/phoenix/otel/otel.py:434: UserWarning: Could not infer collector endpoint protocol, defaulting to HTTP.\n", " warnings.warn(\"Could not infer collector endpoint protocol, defaulting to HTTP.\")\n" ] } ], "source": [ "# Set up Phoenix Tracing\n", "from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n", "\n", "from phoenix.otel import register\n", "\n", "project_name = \"rag-demo\" # project for our application traces\n", "tracer_provider = register(project_name=project_name, verbose=False)\n", "LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)" ] }, { "cell_type": "markdown", "id": "37c7fdc1", "metadata": {}, "source": [ "Add your LLM API credentials. Here, we are using OpenAI.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "8db1a1a2", "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"๐Ÿ”‘ Enter your OpenAI API key: \")\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "markdown", "id": "14fae561", "metadata": {}, "source": [ "# Set Up a RAG App using Llama Index\n", "\n", "For this demo application, we are building a simple RAG pipeline that has two components:\n", "\n", "1. Vector index to retrieve documents\n", "2. LLM to generate responses\n", "\n", "For this initial application, let's keep it simple and use the default configuration and prompts from Llama Index.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "14469a14", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "from llama_index.core import (\n", " Document,\n", " Settings,\n", " StorageContext,\n", " VectorStoreIndex,\n", " load_index_from_storage,\n", ")\n", "from llama_index.embeddings.openai import OpenAIEmbedding\n", "from llama_index.llms.openai import OpenAI\n", "\n", "index_dir = \"llamaindex_store\"\n", "\n", "# --- Ingest documents ---\n", "if os.path.exists(index_dir):\n", " storage_context = StorageContext.from_defaults(persist_dir=index_dir)\n", " index = load_index_from_storage(storage_context)\n", "else:\n", " # --- Set up the LLM and embedding model ---\n", " Settings.llm = OpenAI(model=\"gpt-4o-mini\", temperature=0) # generator\n", " Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\") # retriever\n", "\n", " kb_docs = []\n", " for _, row in documents.iterrows():\n", " doc = Document(\n", " text=str(row[\"text\"]),\n", " metadata={\"source_url\": row[\"source_url\"], \"document_index\": row[\"index\"]},\n", " id_=str(row[\"index\"]),\n", " )\n", " kb_docs.append(doc)\n", "\n", " index = VectorStoreIndex.from_documents(kb_docs)\n", "\n", "# Optional: persist to disk so you can reuse later\n", "index.storage_context.persist(persist_dir=index_dir)\n", "\n", "# Create the query engine\n", "query_engine = index.as_query_engine()" ] }, { "cell_type": "markdown", "id": "c7b0091d", "metadata": {}, "source": [ "Let's test to make sure our RAG system is working:\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "a70d108c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Response(response='Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.', source_nodes=[NodeWithScore(node=TextNode(id_='9069eda0-f363-47bc-9f89-e0888c6aeffb', embedding=None, metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='13', node_type='4', metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, hash='2187a4efee001b52656775153aab90cdc0d580feb56478d77d92d778b432b832'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='7d21cbc0-db4a-43f6-adf9-3f5d922207a5', node_type='1', metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, hash='28fde0d3d4b93baab75a2bfd7f7a9829a03827931c26b8c35abd053e7b8da4d8'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='a63e96f4-356b-4247-a451-b62171151193', node_type='1', metadata={}, hash='859bd2c50b1655679651478def43f6b31c098e94e6e76c1dc735688cb5b900a8')}, metadata_template='{key}: {value}', metadata_separator='\\n', text='In the naive implementation, we had separate attention objects each with their own key, query, and value tensors but now we have them all in one tensor, therefore we need a dimension for the heads. We define the new shape we want in mha_shape . Then we use mx.as_strided() to reshape each tensor to have the head dimension. This function is equivalent to view from pytorch and tells mlx to treat this array as a different shape. But we still have a problem. Notice that we if try to multiply Q @ K_t (where K_t is K transposed over itโ€™s last 2 dims) to compute attention weights as we did before, we will be multiplying the following shapes:\\n\\n(B, T, n_heads, head_size//n_heads) @ (B, T, head_size//n_heads, n_heads)\\nResult shape: (B, T, n_heads, n_heads)\\nThis would result in a tensor of shape (B, T, n_heads, n_heads) which is incorrect. With one head our attention weights were shape (B, T, T) which makes sense because it gives us the interaction between each pair of tokens. So now our shape should be the same but with a heads dimension: (B, n_heads, T, T) . We achieve this by transposing the dimensions of keys, queries, and values after we reshape them to make n_heads dimension 1 instead of 2.\\n\\nhead_size = 64 # put at top of file\\nn_heads = 8 # put at top of file\\nclass MultiHeadAttention(nn.Module):\\n def __init__(self):\\n super().__init__()\\n self.k_proj = nn.Linear(n_emb, head_size, bias=False)\\n self.q_proj = nn.Linear(n_emb, head_size, bias=False)\\n self.v_proj = nn.Linear(n_emb, head_size, bias=False)\\n indices = mx.arange(ctx_len)\\n mask = indices[:, None] < indices[None] # broadcasting trick\\n self._causal_mask = mask * -1e9\\n self.c_proj = nn.Linear(head_size, n_emb) # output projection\\n self.attn_dropout = nn.Dropout(dropout)\\n self.resid_dropout = nn.Dropout(dropout)\\n def __call__(self, x):\\n B, T, C = x.shape # (batch_size, ctx_len, n_emb)\\n K = self.k_proj(x) # (B, T, head_size)\\n Q = self.q_proj(x) # (B, T, head_size)\\n V = self.v_proj(x) # (B, T, head_size)\\n mha_shape = (B, T, n_heads, head_size//n_heads)\\n K = mx.as_strided(K, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)\\n Q = mx.as_strided(Q, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)\\n V = mx.as_strided(V, (mha_shape)).transpose([0, 2, 1, 3]) # (B, n_heads, T, head_size//n_heads)\\n attn_weights = (Q @ K.transpose([0, 1, 3, 2])) / math.sqrt(Q.shape[-1]) # (B, n_heads, T, T)\\n attn_weights = attn_weights + self._causal_mask[:T, :T]\\n attn_weights = mx.softmax(attn_weights, axis=-1)\\n attn_weights = self.attn_dropout(attn_weights)\\n o = (attn_weights @ V) # (B, n_heads, T, head_size//n_heads)\\n\\nNow we can calculate the correction attention weights. Notice that we scale the attention weights by the size of an individual attention head rather than head_size which would be the size after concatenation. We also apply dropout to the attention weights.\\n\\nFinally, we perform the concatenation and apply the output projection and dropout.', mimetype='text/plain', start_char_idx=26861, end_char_idx=30002, metadata_seperator='\\n', text_template='{metadata_str}\\n\\n{content}'), score=0.04702400773572889), NodeWithScore(node=TextNode(id_='ad545ae1-7067-4e8c-8162-3578b9cca697', embedding=None, metadata={'source_url': 'https://stardewvalleywiki.com/Version_History', 'document_index': 16}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='16', node_type='4', metadata={'source_url': 'https://stardewvalleywiki.com/Version_History', 'document_index': 16}, hash='2f861d850d5462c14c252ae17b534c985d14c2a0bc1077a77463ca3d20f78102'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='3a5b3c2d-bbfb-4021-af7a-da9e63a32e2c', node_type='1', metadata={'source_url': 'https://stardewvalleywiki.com/Version_History', 'document_index': 16}, hash='17935dce0019a821feb4f52d5cebfce0a691d8fcf0dbed14102dba78ee65c62f'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='9553a3a8-55b2-43c7-a0f8-b606f8a19027', node_type='1', metadata={}, hash='87badfa4bee32006ea2846b8a800b749ef5462877ecc257a935c70050a08c805')}, metadata_template='{key}: {value}', metadata_separator='\\n', text='If he\\'s already visited you, check his shrine for a new opportunity...\\nRestored a \"lost\" Shane event.\\nChanged earthquake to Summer 3rd... to make it clear that it\\'s the season change that kills crops.\\nIncreased opportunities for iridium. The chance to find iridium in the Skull Cavern increases significantly every ten levels.\\nAdded a zoom in/out feature to the options tab.\\nAdded volume sliders for ambient sounds and footstep sounds.\\nAdded snow transparency slider.\\nAdded option to turn off flash effects.\\nAdded lighting quality option.\\nAdded quest (Rat Problem) to make it clearer that you have to investigate the Community Center.\\nBug fixes\\nLeah\\'s schedule has been fixed.\\nSpouses who have jobs won\\'t get stuck in the bus area anymore.\\nUpgrading a house with crafted flooring should no longer cause a mess.\\nRestored more advanced NPC end-point behavior.\\n\"Secret\" NPC\\'s should no longer show up on calendar until you meet them.\\nEscargot, chowder, etc. should now properly give fishing buff.\\nYou now truly cannot pass the bouncer.\\nYou can no longer get stuck trying to board the bus.\\nFixed issue with invisible trees preventing interaction with tiles.\\nDead flowers no longer affect honey.\\nYou can now dance with your spouse at the Flower Dance.\\nGame should now properly pause when steam overlay is active.\\nFixed issue where inactive window was still responding to input.\\nFixed fertilizer prices in Pierre\\'s shop.\\nFixed Fector\\'s Challenge.\\nYou can now press the toolbar shortcut keys (1, 2, 3, etc. by default) to change the active slot while the inventory menu is up.\\nIron ore nodes can no longer be removed, only destroyed.\\nThe dog or cat should no longer sit on chests...\\nSpouses less likely to run away into the dark abyss.\\nNaming your child after an NPC should no longer cause issues.\\nFixed issue where recipes would sometimes consume more ingredients than they should.\\nFixed crashes in certain cutscenes, when certain dialogue options were chosen.\\nMany small bug and typo fixes.\\n1.04\\nStardew Valley 1.04 was released 1 March 2016.\\n\\nGameplay changes\\nAdded a randomize character button to the character creation screen.\\nRobin now sells crafting recipes for wood floor, stone floor, and stepping stone path.\\nAdded a secret new way to modify a rare item.\\nIncreased grass growth rate.\\nIncreased forage spawn possibilities, and made it much less likely for forage to spawn behind trees.\\nReduced value of honey from 200g to 100g.\\nRaised Clint\\'s ore prices.\\nInventory menus now indicate which slot is the \"active slot\".\\nMade the meteorite look snazzier.\\nBug fixes\\nFixed problem with swinging sword while riding a horse.\\nFixed strange lighting behavior when holding torches.\\nFixed problem where stone fence was spawning debris.\\nSpouse should no longer get stuck on their way to town.\\nWild seeds now produce the proper produce when in the greenhouse.\\nSecret gift exchange should now work properly.\\nAll scarecrows now give reports on their crow-scaring activity.\\nBouncer is now truly impassable.\\nTrees no longer grow directly in front of warp statues.\\nWilly\\'s shop no longer counts as water.\\nThe meteorite should no longer appear in the pond or buildings.\\nIf an object is ever directly underneath you, preventing you from moving, right click to remove it.\\nMariner and Luremaster professions should now work properly.\\nTappers are now properly destroyed by bombs.\\nFixed bathing hairstyle inconsistency.\\nFixed various item duplication and stacking issues.\\nPoppyseed muffin now actually looks like a muffin.\\nQuest items should no longer disappear when you die.\\nYou can no longer give quest items to the wrong person.\\nThe Skull Cavern quest can no longer be completed before receiving the actual journal entry.\\n1.03\\nStardew Valley 1.03 was released 28 February 2016.\\n\\nGameplay changes\\nThe cooking menu now looks for items in your refrigerator as well as your inventory.\\nScarecrow range reduced to an 8 tiles radius.\\nThe price of mayonnaise and other artisan animal products now increased by the rancher profession.\\nOnce you befriend someone to 2 hearts, their room is permanently unlocked, even if you go below 2 hearts again.\\nThe \\'auto run\\' option is now enabled by default.\\nBug fixes\\nFixed duplicate item issue in the mines.\\nLadders should no longer spawn underneath the player, locking them in place.\\nFixed problems with the Community Center menu. You can now throw items down and delete them (Delete key) in the Community Center menu.\\nFixed item quality exploit.\\nYou can now throw items down while in the crafting menu.\\nIf you destroy the stable, you can now rebuild it.\\nSpa won\\'t recharge you while the game is paused (e.g., steam overlay up).\\nFixed problems with the Stardew Valley Fair fishing game.\\nVarious stability fixes.', mimetype='text/plain', start_char_idx=205007, end_char_idx=209729, metadata_seperator='\\n', text_template='{metadata_str}\\n\\n{content}'), score=0.04295661131779984)], metadata={'9069eda0-f363-47bc-9f89-e0888c6aeffb': {'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, 'ad545ae1-7067-4e8c-8162-3578b9cca697': {'source_url': 'https://stardewvalleywiki.com/Version_History', 'document_index': 16}})" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_engine.query(\"What is data science?\")" ] }, { "cell_type": "markdown", "id": "ee8eb042", "metadata": {}, "source": [ "### Run RAG on Train Set\n" ] }, { "cell_type": "markdown", "id": "6b61a90b", "metadata": {}, "source": [ "Let's wrap our query engine so it's easier to run on our dataset.\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "3be71df7", "metadata": {}, "outputs": [], "source": [ "from openinference.instrumentation import using_metadata\n", "\n", "\n", "async def run_rag_with_metadata(example, rag_engine):\n", " \"\"\"Ask a question of the knowledge base.\"\"\"\n", " metadata = {\n", " \"expected_answer\": example[\"answer\"],\n", " \"query_type\": example[\"query_type\"],\n", " \"expected_document_index\": example[\"document_index\"],\n", " \"split\": \"test\" if example[\"document_index\"] not in train_docs else \"train\",\n", " }\n", " with using_metadata(metadata):\n", " rag_engine.query(example[\"query\"])" ] }, { "cell_type": "markdown", "id": "fd90ae23", "metadata": {}, "source": [ "We use the `AsyncExecutor` to run our RAG app on the training dataset with optimal speed.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "94a1acdd", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Run RAG |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 48/48 (100.0%) | โณ 01:27<00:00 | 1.83s/it\n" ] } ], "source": [ "# Run application on the train set to get a baseline\n", "from functools import partial\n", "\n", "from phoenix.evals.executors import AsyncExecutor\n", "from phoenix.evals.utils import get_tqdm_progress_bar_formatter\n", "\n", "executor = AsyncExecutor(\n", " generation_fn=partial(run_rag_with_metadata, rag_engine=query_engine),\n", " concurrency=10, # adjust this as needed\n", " exit_on_error=True,\n", " tqdm_bar_format=get_tqdm_progress_bar_formatter(\"Run RAG\"),\n", ")\n", "\n", "results, execution_details = await executor.execute(\n", " [row.to_dict() for _, row in train_queries.iterrows()],\n", ")" ] }, { "cell_type": "markdown", "id": "6097e6da", "metadata": {}, "source": [ "# Evaluate the Traces\n", "\n", "First, let's go to Phoenix and look at our application traces. Do we observe any issues?\n", "\n", "- Is the RAG agent correctly refusing to answer the unanswerable queries?\n", "- Is it retrieving the correct documents?\n", "- Is it hallucinating?\n", "\n", "These are common questions we can turn into repeatable evaluations. So let's create a few evaluators for our RAG app and run them on our traces.\n", "\n", "**Steps:**\n", "\n", "1. Export traces from Phoenix\n", "2. Define evaluators\n", "3. Run evaluators on the trace data\n", "4. Log the evaluation results back up to Phoenix\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "03546c1c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(48, 21)\n" ] }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>name</th>\n", " <th>span_kind</th>\n", " <th>parent_id</th>\n", " <th>start_time</th>\n", " <th>end_time</th>\n", " <th>status_code</th>\n", " <th>status_message</th>\n", " <th>events</th>\n", " <th>context.span_id</th>\n", " <th>context.trace_id</th>\n", " <th>...</th>\n", " <th>attributes.openinference.span.kind</th>\n", " <th>attributes.input.value</th>\n", " <th>attributes.metadata</th>\n", " <th>query</th>\n", " <th>response</th>\n", " <th>split</th>\n", " <th>expected_document_index</th>\n", " <th>expected_answer</th>\n", " <th>document_content</th>\n", " <th>retrieved_documents</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:58.645931+00:00</td>\n", " <td>2025-09-18 21:45:00.263051+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>124677f2c156e8f5</td>\n", " <td>3357a7c847550d6e0b0e0831f5455ed4</td>\n", " <td>...</td>\n", " <td>CHAIN</td>\n", " <td>Which book is the best?</td>\n", " <td>{'split': 'train', 'query_type': 'no_answer', ...</td>\n", " <td>Which book is the best?</td>\n", " <td>I'm unable to provide an answer to that query ...</td>\n", " <td>train</td>\n", " <td>18</td>\n", " <td>N/A</td>\n", " <td>In the naive implementation, we had separate a...</td>\n", " <td>[13, 16]</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:56.894284+00:00</td>\n", " <td>2025-09-18 21:44:58.576838+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>e7fc3976b381d1bb</td>\n", " <td>ae0a44836ad5a4abd288c98234621ff1</td>\n", " <td>...</td>\n", " <td>CHAIN</td>\n", " <td>In 'he who drowned the world', why did Gong Li...</td>\n", " <td>{'split': 'train', 'query_type': 'no_answer', ...</td>\n", " <td>In 'he who drowned the world', why did Gong Li...</td>\n", " <td>Gong Li sacrificed her brother in 'he who drow...</td>\n", " <td>train</td>\n", " <td>18</td>\n", " <td>N/A</td>\n", " <td>In the naive implementation, we had separate a...</td>\n", " <td>[13, 13]</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:55.186142+00:00</td>\n", " <td>2025-09-18 21:44:56.819187+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>1863d59e8310f155</td>\n", " <td>388332cccc824afe407ec2974271237d</td>\n", " <td>...</td>\n", " <td>CHAIN</td>\n", " <td>What caliber is the bullet of light?</td>\n", " <td>{'split': 'train', 'query_type': 'no_answer', ...</td>\n", " <td>What caliber is the bullet of light?</td>\n", " <td>The bullet of light does not have a specified ...</td>\n", " <td>train</td>\n", " <td>17</td>\n", " <td>N/A</td>\n", " <td>Fixed farmhand crash while fishing in rare cas...</td>\n", " <td>[16, 13]</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:53.432784+00:00</td>\n", " <td>2025-09-18 21:44:55.119930+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>363c1325fff25453</td>\n", " <td>bedf65b9a9f17baaeaef14a7e3484a7d</td>\n", " <td>...</td>\n", " <td>CHAIN</td>\n", " <td>Why did Saga stab Scratch?</td>\n", " <td>{'split': 'train', 'query_type': 'no_answer', ...</td>\n", " <td>Why did Saga stab Scratch?</td>\n", " <td>The reason Saga stabbed Scratch was to ensure ...</td>\n", " <td>train</td>\n", " <td>17</td>\n", " <td>N/A</td>\n", " <td>Then we perform row-wise softmax to get the fi...</td>\n", " <td>[13, 5]</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:51.394301+00:00</td>\n", " <td>2025-09-18 21:44:53.365811+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>f3738a9a735840b7</td>\n", " <td>36a20133ae9b0d602c66d6b664d18ec2</td>\n", " <td>...</td>\n", " <td>CHAIN</td>\n", " <td>What was added in version 1.5.7?</td>\n", " <td>{'split': 'train', 'query_type': 'no_answer', ...</td>\n", " <td>What was added in version 1.5.7?</td>\n", " <td>In version 1.5.7, the following features were ...</td>\n", " <td>train</td>\n", " <td>16</td>\n", " <td>N/A</td>\n", " <td>I knew that limiting it to running on my M1 Ma...</td>\n", " <td>[2, 16]</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows ร— 21 columns</p>\n", "</div>" ], "text/plain": [ " name span_kind parent_id \\\n", "0 RetrieverQueryEngine.query CHAIN None \n", "1 RetrieverQueryEngine.query CHAIN None \n", "2 RetrieverQueryEngine.query CHAIN None \n", "3 RetrieverQueryEngine.query CHAIN None \n", "4 RetrieverQueryEngine.query CHAIN None \n", "\n", " start_time end_time \\\n", "0 2025-09-18 21:44:58.645931+00:00 2025-09-18 21:45:00.263051+00:00 \n", "1 2025-09-18 21:44:56.894284+00:00 2025-09-18 21:44:58.576838+00:00 \n", "2 2025-09-18 21:44:55.186142+00:00 2025-09-18 21:44:56.819187+00:00 \n", "3 2025-09-18 21:44:53.432784+00:00 2025-09-18 21:44:55.119930+00:00 \n", "4 2025-09-18 21:44:51.394301+00:00 2025-09-18 21:44:53.365811+00:00 \n", "\n", " status_code status_message events context.span_id \\\n", "0 OK [] 124677f2c156e8f5 \n", "1 OK [] e7fc3976b381d1bb \n", "2 OK [] 1863d59e8310f155 \n", "3 OK [] 363c1325fff25453 \n", "4 OK [] f3738a9a735840b7 \n", "\n", " context.trace_id ... attributes.openinference.span.kind \\\n", "0 3357a7c847550d6e0b0e0831f5455ed4 ... CHAIN \n", "1 ae0a44836ad5a4abd288c98234621ff1 ... CHAIN \n", "2 388332cccc824afe407ec2974271237d ... CHAIN \n", "3 bedf65b9a9f17baaeaef14a7e3484a7d ... CHAIN \n", "4 36a20133ae9b0d602c66d6b664d18ec2 ... CHAIN \n", "\n", " attributes.input.value \\\n", "0 Which book is the best? \n", "1 In 'he who drowned the world', why did Gong Li... \n", "2 What caliber is the bullet of light? \n", "3 Why did Saga stab Scratch? \n", "4 What was added in version 1.5.7? \n", "\n", " attributes.metadata \\\n", "0 {'split': 'train', 'query_type': 'no_answer', ... \n", "1 {'split': 'train', 'query_type': 'no_answer', ... \n", "2 {'split': 'train', 'query_type': 'no_answer', ... \n", "3 {'split': 'train', 'query_type': 'no_answer', ... \n", "4 {'split': 'train', 'query_type': 'no_answer', ... \n", "\n", " query \\\n", "0 Which book is the best? \n", "1 In 'he who drowned the world', why did Gong Li... \n", "2 What caliber is the bullet of light? \n", "3 Why did Saga stab Scratch? \n", "4 What was added in version 1.5.7? \n", "\n", " response split \\\n", "0 I'm unable to provide an answer to that query ... train \n", "1 Gong Li sacrificed her brother in 'he who drow... train \n", "2 The bullet of light does not have a specified ... train \n", "3 The reason Saga stabbed Scratch was to ensure ... train \n", "4 In version 1.5.7, the following features were ... train \n", "\n", " expected_document_index expected_answer \\\n", "0 18 N/A \n", "1 18 N/A \n", "2 17 N/A \n", "3 17 N/A \n", "4 16 N/A \n", "\n", " document_content retrieved_documents \n", "0 In the naive implementation, we had separate a... [13, 16] \n", "1 In the naive implementation, we had separate a... [13, 13] \n", "2 Fixed farmhand crash while fishing in rare cas... [16, 13] \n", "3 Then we perform row-wise softmax to get the fi... [13, 5] \n", "4 I knew that limiting it to running on my M1 Ma... [2, 16] \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.client import Client\n", "from phoenix.client.types.spans import SpanQuery\n", "\n", "# Export all the top level spans\n", "query = SpanQuery().where(\"name == 'RetrieverQueryEngine.query'\")\n", "spans_df = Client().spans.get_spans_dataframe(query=query, project_identifier=project_name)\n", "spans_df.dropna(\n", " subset=[\"attributes.metadata\"], inplace=True\n", ") # drop any traces not from our dataset\n", "\n", "# Shape the spans dataframe\n", "spans_df[\"query\"] = spans_df[\"attributes.input.value\"]\n", "spans_df[\"response\"] = spans_df[\"attributes.output.value\"]\n", "spans_df[\"split\"] = spans_df[\"attributes.metadata\"].apply(lambda x: x[\"split\"])\n", "spans_df[\"expected_document_index\"] = spans_df[\"attributes.metadata\"].apply(\n", " lambda x: x[\"expected_document_index\"]\n", ")\n", "spans_df[\"expected_answer\"] = spans_df[\"attributes.metadata\"].apply(lambda x: x[\"expected_answer\"])\n", "\n", "# Export and process the retrieval spans to get the retrieved documents\n", "query = SpanQuery().where(\"name == 'VectorIndexRetriever.retrieve'\")\n", "retrieval_spans_df = Client().spans.get_spans_dataframe(\n", " query=query, project_identifier=project_name\n", ")\n", "retrieval_spans_df[\"document_content\"] = retrieval_spans_df[\"attributes.retrieval.documents\"].apply(\n", " lambda x: \"\\n----------------------------------\\n\".join([doc[\"document.content\"] for doc in x])\n", ")\n", "retrieval_spans_df[\"retrieved_documents\"] = retrieval_spans_df[\n", " \"attributes.retrieval.documents\"\n", "].apply(lambda x: [doc[\"document.metadata\"][\"document_index\"] for doc in x])\n", "\n", "# Combine the spans with the retrieval spans\n", "spans_df = spans_df.merge(\n", " retrieval_spans_df[[\"context.trace_id\", \"document_content\", \"retrieved_documents\"]],\n", " on=\"context.trace_id\",\n", " how=\"left\",\n", ")\n", "\n", "print(spans_df.shape)\n", "spans_df.head()" ] }, { "cell_type": "markdown", "id": "10027700", "metadata": {}, "source": [ "## Define Evaluators\n" ] }, { "cell_type": "markdown", "id": "fd266d6d", "metadata": {}, "source": [ "### 1. Refusal\n", "\n", "Since we know some questions in our dataset are unanswerable, we expect the LLM to respond accordingly. So let's create an LLM judge to check for refusals in the output.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "1be11e03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Score(name='llm_refusal', score=0, label='refusal', explanation='The response explicitly states that an answer is not possible due to the lack of relevant information, indicating a refusal.', metadata={'model': 'gpt-4o'}, source='llm', direction='maximize')]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals import create_classifier\n", "from phoenix.evals.llm import LLM\n", "\n", "prompt = \"\"\"\n", "Given a question and response, determine if the response is a refusal to answer the question.\n", "\n", "Question: {query}\n", "Response: {response}\n", "\n", "Is the response a refusal or an informative answer to the question?\n", "\"\"\"\n", "\n", "llm = LLM(model=\"gpt-4o\", provider=\"openai\")\n", "refusal_evaluator = create_classifier(\n", " llm=llm,\n", " name=\"llm_refusal\",\n", " prompt_template=prompt,\n", " choices={\"refusal\": 0, \"answer\": 1},\n", ")\n", "\n", "# test the evaluator on a single example\n", "refusal_evaluator.evaluate(spans_df.iloc[0].to_dict())" ] }, { "cell_type": "markdown", "id": "1cd978fd", "metadata": {}, "source": [ "### 2. Hallucination\n", "\n", "Let's also check to see if our RAG pipeline is producing hallucinations. Phoenix evals has a built-in `HallucinationEvaluator` so we'll use that. First, let's inspect the `input_schema` so we know what it needs to run.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "e663e735", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'hallucination',\n", " 'source': 'llm',\n", " 'direction': 'maximize',\n", " 'input_schema': {'properties': {'input': {'description': 'The input query.',\n", " 'title': 'Input',\n", " 'type': 'string'},\n", " 'output': {'description': 'The response to the query.',\n", " 'title': 'Output',\n", " 'type': 'string'},\n", " 'context': {'description': 'The context or reference text.',\n", " 'title': 'Context',\n", " 'type': 'string'}},\n", " 'required': ['input', 'output', 'context'],\n", " 'title': 'HallucinationInputSchema',\n", " 'type': 'object'}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals.llm import LLM\n", "from phoenix.evals.metrics import HallucinationEvaluator\n", "\n", "llm = LLM(model=\"gpt-4o\", provider=\"openai\")\n", "hallucination_evaluator = HallucinationEvaluator(llm=llm)\n", "hallucination_evaluator.describe()" ] }, { "cell_type": "markdown", "id": "d0b06175", "metadata": {}, "source": [ "Okay, we need to provide an `input_mapping` so it works on our data. Let's bind it to the evaluator so we can reuse it.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "cab58273", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Score(name='hallucination', score=1.0, label='factual', explanation='The response correctly identifies that the query about \"which book is the best\" is not related to the given context, which discusses technical details of multi-head attention implementation and patch notes of a video game.', metadata={'model': 'gpt-4o'}, source='llm', direction='maximize')]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hallucination_mapping = {\n", " \"input\": \"query\",\n", " \"output\": \"response\",\n", " \"context\": \"document_content\",\n", "}\n", "hallucination_evaluator.bind(hallucination_mapping)\n", "\n", "# test the evaluator on a single example\n", "hallucination_evaluator.evaluate(spans_df.iloc[0].to_dict())" ] }, { "cell_type": "markdown", "id": "595c7dbe", "metadata": {}, "source": [ "### 3. Retrieval Precision\n", "\n", "We also want to measure how well the information retrieval component of our system is working. Let's add a precision metric which checks to see how often the target document appeared in the retrieved results.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "9d7918b2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Score(name='precision', score=0.0, label=None, explanation=None, metadata={}, source='heuristic', direction='maximize')]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals import bind_evaluator, create_evaluator\n", "\n", "\n", "@create_evaluator(name=\"precision\")\n", "def precision(retrieved_documents: list[int], relevant_documents: list[int]) -> float:\n", " relevant_set = set(relevant_documents)\n", " hits = sum(1 for doc in retrieved_documents if doc in relevant_set)\n", " return hits / len(retrieved_documents)\n", "\n", "\n", "# our precision evaluator expects a list of relevant documents,\n", "# but our dataset only has one relevant document per query, so we\n", "# wrap the expected document index in a list inside our mapping using a lambda function\n", "precision_mapping = {\n", " \"relevant_documents\": lambda x: [x[\"expected_document_index\"]],\n", "}\n", "\n", "precision_evaluator = bind_evaluator(precision, precision_mapping)\n", "\n", "# test the evaluator on a single example\n", "precision_evaluator.evaluate(spans_df.iloc[0].to_dict())" ] }, { "cell_type": "markdown", "id": "efa203b9", "metadata": {}, "source": [ "### Putting it all together\n", "\n", "Let's run our 3 evaluators on all of our project traces.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ff5e4324", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>name</th>\n", " <th>span_kind</th>\n", " <th>parent_id</th>\n", " <th>start_time</th>\n", " <th>end_time</th>\n", " <th>status_code</th>\n", " <th>status_message</th>\n", " <th>events</th>\n", " <th>context.span_id</th>\n", " <th>context.trace_id</th>\n", " <th>...</th>\n", " <th>expected_document_index</th>\n", " <th>expected_answer</th>\n", " <th>document_content</th>\n", " <th>retrieved_documents</th>\n", " <th>precision_execution_details</th>\n", " <th>hallucination_execution_details</th>\n", " <th>llm_refusal_execution_details</th>\n", " <th>precision_score</th>\n", " <th>hallucination_score</th>\n", " <th>llm_refusal_score</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:58.645931+00:00</td>\n", " <td>2025-09-18 21:45:00.263051+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>124677f2c156e8f5</td>\n", " <td>3357a7c847550d6e0b0e0831f5455ed4</td>\n", " <td>...</td>\n", " <td>18</td>\n", " <td>N/A</td>\n", " <td>In the naive implementation, we had separate a...</td>\n", " <td>[13, 16]</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"precision\", \"score\": 0.0, \"metadata\"...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 1.0, \"label...</td>\n", " <td>{\"name\": \"llm_refusal\", \"score\": 0, \"label\": \"...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:56.894284+00:00</td>\n", " <td>2025-09-18 21:44:58.576838+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>e7fc3976b381d1bb</td>\n", " <td>ae0a44836ad5a4abd288c98234621ff1</td>\n", " <td>...</td>\n", " <td>18</td>\n", " <td>N/A</td>\n", " <td>In the naive implementation, we had separate a...</td>\n", " <td>[13, 13]</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"precision\", \"score\": 0.0, \"metadata\"...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 0.0, \"label...</td>\n", " <td>{\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:55.186142+00:00</td>\n", " <td>2025-09-18 21:44:56.819187+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>1863d59e8310f155</td>\n", " <td>388332cccc824afe407ec2974271237d</td>\n", " <td>...</td>\n", " <td>17</td>\n", " <td>N/A</td>\n", " <td>Fixed farmhand crash while fishing in rare cas...</td>\n", " <td>[16, 13]</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"precision\", \"score\": 0.0, \"metadata\"...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 1.0, \"label...</td>\n", " <td>{\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"...</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:53.432784+00:00</td>\n", " <td>2025-09-18 21:44:55.119930+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>363c1325fff25453</td>\n", " <td>bedf65b9a9f17baaeaef14a7e3484a7d</td>\n", " <td>...</td>\n", " <td>17</td>\n", " <td>N/A</td>\n", " <td>Then we perform row-wise softmax to get the fi...</td>\n", " <td>[13, 5]</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"precision\", \"score\": 0.0, \"metadata\"...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 0.0, \"label...</td>\n", " <td>{\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"...</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>RetrieverQueryEngine.query</td>\n", " <td>CHAIN</td>\n", " <td>None</td>\n", " <td>2025-09-18 21:44:51.394301+00:00</td>\n", " <td>2025-09-18 21:44:53.365811+00:00</td>\n", " <td>OK</td>\n", " <td></td>\n", " <td>[]</td>\n", " <td>f3738a9a735840b7</td>\n", " <td>36a20133ae9b0d602c66d6b664d18ec2</td>\n", " <td>...</td>\n", " <td>16</td>\n", " <td>N/A</td>\n", " <td>I knew that limiting it to running on my M1 Ma...</td>\n", " <td>[2, 16]</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"precision\", \"score\": 0.5, \"metadata\"...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 1.0, \"label...</td>\n", " <td>{\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows ร— 27 columns</p>\n", "</div>" ], "text/plain": [ " name span_kind parent_id \\\n", "0 RetrieverQueryEngine.query CHAIN None \n", "1 RetrieverQueryEngine.query CHAIN None \n", "2 RetrieverQueryEngine.query CHAIN None \n", "3 RetrieverQueryEngine.query CHAIN None \n", "4 RetrieverQueryEngine.query CHAIN None \n", "\n", " start_time end_time \\\n", "0 2025-09-18 21:44:58.645931+00:00 2025-09-18 21:45:00.263051+00:00 \n", "1 2025-09-18 21:44:56.894284+00:00 2025-09-18 21:44:58.576838+00:00 \n", "2 2025-09-18 21:44:55.186142+00:00 2025-09-18 21:44:56.819187+00:00 \n", "3 2025-09-18 21:44:53.432784+00:00 2025-09-18 21:44:55.119930+00:00 \n", "4 2025-09-18 21:44:51.394301+00:00 2025-09-18 21:44:53.365811+00:00 \n", "\n", " status_code status_message events context.span_id \\\n", "0 OK [] 124677f2c156e8f5 \n", "1 OK [] e7fc3976b381d1bb \n", "2 OK [] 1863d59e8310f155 \n", "3 OK [] 363c1325fff25453 \n", "4 OK [] f3738a9a735840b7 \n", "\n", " context.trace_id ... expected_document_index \\\n", "0 3357a7c847550d6e0b0e0831f5455ed4 ... 18 \n", "1 ae0a44836ad5a4abd288c98234621ff1 ... 18 \n", "2 388332cccc824afe407ec2974271237d ... 17 \n", "3 bedf65b9a9f17baaeaef14a7e3484a7d ... 17 \n", "4 36a20133ae9b0d602c66d6b664d18ec2 ... 16 \n", "\n", " expected_answer document_content \\\n", "0 N/A In the naive implementation, we had separate a... \n", "1 N/A In the naive implementation, we had separate a... \n", "2 N/A Fixed farmhand crash while fishing in rare cas... \n", "3 N/A Then we perform row-wise softmax to get the fi... \n", "4 N/A I knew that limiting it to running on my M1 Ma... \n", "\n", " retrieved_documents precision_execution_details \\\n", "0 [13, 16] {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "1 [13, 13] {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "2 [16, 13] {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "3 [13, 5] {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "4 [2, 16] {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "\n", " hallucination_execution_details \\\n", "0 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "1 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "2 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "3 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "4 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "\n", " llm_refusal_execution_details \\\n", "0 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "1 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "2 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "3 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "4 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "\n", " precision_score \\\n", "0 {\"name\": \"precision\", \"score\": 0.0, \"metadata\"... \n", "1 {\"name\": \"precision\", \"score\": 0.0, \"metadata\"... \n", "2 {\"name\": \"precision\", \"score\": 0.0, \"metadata\"... \n", "3 {\"name\": \"precision\", \"score\": 0.0, \"metadata\"... \n", "4 {\"name\": \"precision\", \"score\": 0.5, \"metadata\"... \n", "\n", " hallucination_score \\\n", "0 {\"name\": \"hallucination\", \"score\": 1.0, \"label... \n", "1 {\"name\": \"hallucination\", \"score\": 0.0, \"label... \n", "2 {\"name\": \"hallucination\", \"score\": 1.0, \"label... \n", "3 {\"name\": \"hallucination\", \"score\": 0.0, \"label... \n", "4 {\"name\": \"hallucination\", \"score\": 1.0, \"label... \n", "\n", " llm_refusal_score \n", "0 {\"name\": \"llm_refusal\", \"score\": 0, \"label\": \"... \n", "1 {\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"... \n", "2 {\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"... \n", "3 {\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"... \n", "4 {\"name\": \"llm_refusal\", \"score\": 1, \"label\": \"... \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals import async_evaluate_dataframe\n", "\n", "train_spans = spans_df[spans_df[\"split\"] == \"train\"]\n", "results = await async_evaluate_dataframe(\n", " train_spans,\n", " [precision_evaluator, hallucination_evaluator, refusal_evaluator],\n", " concurrency=10,\n", " tqdm_bar_format=get_tqdm_progress_bar_formatter(\"Run Evaluation\"),\n", " exit_on_error=True,\n", ")\n", "results.head()" ] }, { "cell_type": "markdown", "id": "db062d38", "metadata": {}, "source": [ "### Log trace evaluations back to Phoenix\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f7215147", "metadata": {}, "outputs": [], "source": [ "from phoenix.client import AsyncClient\n", "from phoenix.evals.utils import to_annotation_dataframe\n", "\n", "client = AsyncClient()\n", "\n", "annotations = to_annotation_dataframe(\n", " results\n", ") # can also specify score_names to log only certain scores\n", "await client.spans.log_span_annotations_dataframe(dataframe=annotations)" ] }, { "cell_type": "markdown", "id": "8bb95d5f", "metadata": {}, "source": [ "# Improve Evaluators\n", "\n", "Go into Phoenix and look at your project traces now that you've added some eval metrics. Pay attention to the \"llm_refusal\" metric - is it catching all the refusals?\n", "No, it looks like it is not performing as expected.\n", "\n", "Let's see if we can improve our LLM Judge so it is better aligned.\n", "\n", "**Steps:**\n", "\n", "1. Manually annotate some traces as \"refused\" or \"responded\" inside Phoenix.\n", "2. Export those annotated traces and use to create a dataset for experimentation.\n", "3. Define an LLM judge (refusal) and use as the experiment \"task\".\n", "4. Create a simple heuristic experiment evaluator that checks for an exact match between the judge score and our annotation\n", "5. Iterate on the judge prompt until we are happy with the results.\n", "\n", "<center>\n", " <h3 align=\"left\">Phoenix Experiments</h3>\n", " <p style=\"text-align:center\">\n", " <img alt=\"eval lifecycle\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/experiment.png\" width=\"1000\"/>\n", " </p>\n", "</center>\n" ] }, { "cell_type": "markdown", "id": "8b2ea926", "metadata": {}, "source": [ "After manual annotation, pull down those traces:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5b556f20", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "refusal_score\n", "1.0 20\n", "0.0 8\n", "Name: count, dtype: int64\n" ] } ], "source": [ "from phoenix.client import Client\n", "from phoenix.client.types.spans import SpanQuery\n", "\n", "# Export all the top level spans\n", "query = SpanQuery().where(\"name == 'RetrieverQueryEngine.query'\")\n", "spans_df = Client().spans.get_spans_dataframe(query=query, project_identifier=project_name)\n", "\n", "# Shape the spans dataframe\n", "spans_df[\"query\"] = spans_df[\"attributes.input.value\"]\n", "spans_df[\"response\"] = spans_df[\"attributes.output.value\"]\n", "spans_df.dropna(subset=[\"attributes.metadata\"], inplace=True)\n", "spans_df[\"expected_answer\"] = spans_df[\"attributes.metadata\"].apply(lambda x: x[\"expected_answer\"])\n", "\n", "# Export annotations and add to the spans from earlier\n", "annotations_df = Client().spans.get_span_annotations_dataframe(\n", " spans_dataframe=spans_df, project_identifier=project_name\n", ")\n", "refusal_ground_truth = annotations_df[\n", " (annotations_df[\"annotator_kind\"] == \"HUMAN\") & (annotations_df[\"annotation_name\"] == \"refusal\")\n", "]\n", "refusal_ground_truth = refusal_ground_truth.rename_axis(index={\"span_id\": \"context.span_id\"})\n", "refusal_ground_truth = refusal_ground_truth.rename(columns={\"result.score\": \"refusal_score\"})\n", "labeled_spans_df = spans_df.merge(\n", " refusal_ground_truth[[\"refusal_score\"]], left_index=True, right_index=True, how=\"left\"\n", ")\n", "labeled_spans_df = labeled_spans_df[\n", " [\"context.span_id\", \"query\", \"response\", \"refusal_score\", \"expected_answer\"]\n", "]\n", "labeled_spans = labeled_spans_df.dropna(subset=[\"refusal_score\"])\n", "labeled_spans[\"refusal_score\"].value_counts()\n", "print(labeled_spans[\"refusal_score\"].value_counts())" ] }, { "cell_type": "code", "execution_count": 28, "id": "f34a1f3d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>context.span_id</th>\n", " <th>query</th>\n", " <th>response</th>\n", " <th>refusal_score</th>\n", " <th>expected_answer</th>\n", " </tr>\n", " <tr>\n", " <th>context.span_id</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>4cdfbaf375a2316b</th>\n", " <td>4cdfbaf375a2316b</td>\n", " <td>Which book is the best?</td>\n", " <td>I'm unable to provide an answer to that query ...</td>\n", " <td>0.0</td>\n", " <td>N/A</td>\n", " </tr>\n", " <tr>\n", " <th>7a2bba89818d79b0</th>\n", " <td>7a2bba89818d79b0</td>\n", " <td>In 'he who drowned the world', why did Gong Li...</td>\n", " <td>Gong Li sacrificed her brother in 'he who drow...</td>\n", " <td>1.0</td>\n", " <td>N/A</td>\n", " </tr>\n", " <tr>\n", " <th>2ae8849b57062841</th>\n", " <td>2ae8849b57062841</td>\n", " <td>What caliber is the bullet of light?</td>\n", " <td>The bullet of light does not have a specified ...</td>\n", " <td>0.0</td>\n", " <td>N/A</td>\n", " </tr>\n", " <tr>\n", " <th>06c4934f0a9bf16f</th>\n", " <td>06c4934f0a9bf16f</td>\n", " <td>Why did Saga stab Scratch?</td>\n", " <td>The reason Saga stabbed Scratch was to ensure ...</td>\n", " <td>1.0</td>\n", " <td>N/A</td>\n", " </tr>\n", " <tr>\n", " <th>e7897fdccc20b7b4</th>\n", " <td>e7897fdccc20b7b4</td>\n", " <td>What was added in version 1.5.7?</td>\n", " <td>In version 1.5.7, the following features were ...</td>\n", " <td>1.0</td>\n", " <td>N/A</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " context.span_id \\\n", "context.span_id \n", "4cdfbaf375a2316b 4cdfbaf375a2316b \n", "7a2bba89818d79b0 7a2bba89818d79b0 \n", "2ae8849b57062841 2ae8849b57062841 \n", "06c4934f0a9bf16f 06c4934f0a9bf16f \n", "e7897fdccc20b7b4 e7897fdccc20b7b4 \n", "\n", " query \\\n", "context.span_id \n", "4cdfbaf375a2316b Which book is the best? \n", "7a2bba89818d79b0 In 'he who drowned the world', why did Gong Li... \n", "2ae8849b57062841 What caliber is the bullet of light? \n", "06c4934f0a9bf16f Why did Saga stab Scratch? \n", "e7897fdccc20b7b4 What was added in version 1.5.7? \n", "\n", " response \\\n", "context.span_id \n", "4cdfbaf375a2316b I'm unable to provide an answer to that query ... \n", "7a2bba89818d79b0 Gong Li sacrificed her brother in 'he who drow... \n", "2ae8849b57062841 The bullet of light does not have a specified ... \n", "06c4934f0a9bf16f The reason Saga stabbed Scratch was to ensure ... \n", "e7897fdccc20b7b4 In version 1.5.7, the following features were ... \n", "\n", " refusal_score expected_answer \n", "context.span_id \n", "4cdfbaf375a2316b 0.0 N/A \n", "7a2bba89818d79b0 1.0 N/A \n", "2ae8849b57062841 0.0 N/A \n", "06c4934f0a9bf16f 1.0 N/A \n", "e7897fdccc20b7b4 1.0 N/A " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled_spans.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "43a6cff1", "metadata": {}, "outputs": [], "source": [ "dataset_name = \"annotated-queries-train\"\n", "ds = await AsyncClient().datasets.create_dataset(\n", " name=dataset_name,\n", " dataframe=labeled_spans,\n", " input_keys=[\"query\", \"response\"],\n", " output_keys=[\"refusal_score\", \"expected_answer\"],\n", ")\n", "\n", "# get the dataset if already created\n", "# ds = await AsyncClient().datasets.get_dataset(dataset=dataset_name)" ] }, { "cell_type": "markdown", "id": "976909ff", "metadata": {}, "source": [ "Next, we define:\n", "\n", "1. The experiment task (here, our LLM evaluator is the task).\n", "2. The experiment evaluator (here, a simple check for alignment between the judge and human).\n", "\n", "**Note: 2.0 evaluators are now drop-in compatible as experiment evaluators!**\n" ] }, { "cell_type": "code", "execution_count": 31, "id": "40e6cf8e", "metadata": {}, "outputs": [], "source": [ "from phoenix.evals import create_evaluator\n", "\n", "\n", "# experiment task\n", "def run_refusal_eval(input, evaluator):\n", " result = evaluator.evaluate(input)\n", " return result[0]\n", "\n", "\n", "# experiment evaluator\n", "@create_evaluator(name=\"exact_match\")\n", "def exact_match(output, expected):\n", " return float(output[\"score\"]) == float(expected[\"refusal_score\"])" ] }, { "cell_type": "markdown", "id": "b2b18ede", "metadata": {}, "source": [ "First, let's get a baseline by testing out the initial refusal evaluator against our annotations.\n" ] }, { "cell_type": "code", "execution_count": 32, "id": "264393e0", "metadata": {}, "outputs": [], "source": [ "prompt = \"\"\"\n", "Given a question and response, determine if the response is a refusal to answer the question.\n", "\n", "Question: {query}\n", "Response: {response}\n", "\n", "Is the response a refusal or an informative response to the question?\n", "\"\"\"\n", "\n", "llm = LLM(model=\"gpt-4o\", provider=\"openai\")\n", "baseline_refusal = create_classifier(\n", " llm=llm,\n", " name=\"refusal\",\n", " prompt_template=prompt,\n", " choices={\"refusal\": 0, \"answer\": 1},\n", ")" ] }, { "cell_type": "code", "execution_count": 33, "id": "8a79c414", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿงช Experiment started.\n", "๐Ÿ“บ View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/experiments\n", "๐Ÿ”— View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/compare?experimentId=RXhwZXJpbWVudDo3OQ==\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running tasks |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 28/28 (100.0%) | โณ 00:43<00:00 | 1.55s/it\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "โœ… Task runs completed.\n", "๐Ÿง  Evaluation started.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running experiment evaluations |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 28/28 (100.0%) | โณ 00:03<00:00 | 8.14it/s" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment completed: 28 task runs, 1 evaluator runs, 28 evaluations\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from functools import partial\n", "\n", "from phoenix.client import AsyncClient\n", "\n", "async_client = AsyncClient()\n", "\n", "experiment = await async_client.experiments.run_experiment(\n", " dataset=ds,\n", " task=partial(run_refusal_eval, evaluator=baseline_refusal),\n", " experiment_name=\"baseline\",\n", " evaluators=[exact_match],\n", " concurrency=10,\n", " # dry_run=3,\n", ")" ] }, { "cell_type": "markdown", "id": "f06048dc", "metadata": {}, "source": [ "Go to the experiment and filter by \"exact_match==0\" -- these are the examples where our refusal judge disagreed with the human annotations. Look at the RAG responses and the judge explanations. I notice that the LLM judge considers uncertain responses as valid answers rather than refusals.\n", "\n", "Based on these observations, let's tweak our prompt to make the evaluation criteria more clear to the LLM judge. Describe exactly what a \"refusal\" looks like.\n" ] }, { "cell_type": "code", "execution_count": 34, "id": "118d1368", "metadata": {}, "outputs": [], "source": [ "from phoenix.evals import create_classifier\n", "from phoenix.evals.llm import LLM\n", "\n", "prompt = \"\"\"\n", "Given a question and response, determine if the response is a refusal to answer the question.\n", "Refusals often contain phrases of uncertainty like 'I don't know' and 'I don't have that information'.\n", "They also often mention that the answer is not provided in the information or context.\n", "\n", "If the response contains these phrases, it is a refusal. Even if the response contains other\n", "text indicating an attempt to answer the question, it is still a refusal.\n", "\n", "If the response does not contain these \"hedging\" phrases, it is an informative response. Do not\n", "consider the correctness of the response, only whether it is a refusal or not.\n", "\n", "Question: {query}\n", "Response: {response}\n", "\n", "Is the response a refusal or an informative answer to the question?\n", "\"\"\"\n", "\n", "refusal_v2 = create_classifier(\n", " llm=llm,\n", " name=\"refusal\",\n", " prompt_template=prompt,\n", " choices={\"refusal\": 0, \"answer\": 1},\n", ")" ] }, { "cell_type": "code", "execution_count": 35, "id": "24b64bba", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿงช Experiment started.\n", "๐Ÿ“บ View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/experiments\n", "๐Ÿ”— View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/compare?experimentId=RXhwZXJpbWVudDo4MA==\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running tasks |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 28/28 (100.0%) | โณ 00:42<00:00 | 1.52s/it\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "โœ… Task runs completed.\n", "๐Ÿง  Evaluation started.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running experiment evaluations |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 28/28 (100.0%) | โณ 00:03<00:00 | 8.15it/s" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment completed: 28 task runs, 1 evaluator runs, 28 evaluations\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "experiment = await async_client.experiments.run_experiment(\n", " dataset=ds,\n", " task=partial(run_refusal_eval, evaluator=refusal_v2),\n", " experiment_name=\"prompt-v2\",\n", " evaluators=[exact_match],\n", " concurrency=10,\n", " # dry_run=3,\n", ")" ] }, { "cell_type": "markdown", "id": "a1ca557d", "metadata": {}, "source": [ "Looking at this experiment in Phoenix, I see that we now have \"exact_match == 1.0\" indicating 100% agreement between our new judge and the annotations!\n", "\n", "Through experimentation we were able to improve the evaluation metric itself, much in the same way we would improve any process.\n" ] }, { "cell_type": "markdown", "id": "88adddfe", "metadata": {}, "source": [ "# Improve the Application\n", "\n", "Now that we feel good about our refusal metric, let's see if we can improve our RAG system.\n", "\n", "Exactly 50% of the queries in our dataset are unanswerable, so ideally we would like to see the \"llm_refusal\" score close to 0.5. We don't want the RAG system attempting to answer questions that are not answerable from the context because this increases the chances of hallucination - not good!\n", "\n", "**Steps:**\n", "\n", "1. Create a dataset using the train set queries.\n", "2. Define our experiment task (running RAG on our dataset).\n", "3. Use our new and improved refusal classifier as the experiment evaluator.\n", "4. Iterate on the RAG agent's prompt until we are happy.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "61899d7c", "metadata": {}, "outputs": [], "source": [ "dataset_name = \"train-queries\"\n", "ds = await AsyncClient().datasets.create_dataset(\n", " name=dataset_name,\n", " dataframe=train_queries,\n", " input_keys=[\"query\"],\n", ")\n", "\n", "# if already created\n", "# ds = await AsyncClient().datasets.get_dataset(dataset=dataset_name)" ] }, { "cell_type": "code", "execution_count": 39, "id": "736688a4", "metadata": {}, "outputs": [], "source": [ "from phoenix.evals import bind_evaluator\n", "\n", "\n", "# define experiment task (running the RAG engine)\n", "async def run_rag_task(input, rag_engine):\n", " \"\"\"Ask a question of the knowledge base.\"\"\"\n", " response = rag_engine.query(input[\"query\"])\n", " return response\n", "\n", "\n", "# use an input mapping to fit our dataset to the evaluator we created earlier\n", "refusal_evaluator = bind_evaluator(refusal_v2, {\"query\": \"input.query\", \"response\": \"output\"})" ] }, { "cell_type": "markdown", "id": "ca222103", "metadata": {}, "source": [ "### Experiment 1: Baseline RAG System\n", "\n", "Let's rerun our initial RAG system to get a baseline. How do the \"out-of-the-box\" defaults work?\n" ] }, { "cell_type": "code", "execution_count": 40, "id": "cddfdd87", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿงช Experiment started.\n", "๐Ÿ“บ View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/experiments\n", "๐Ÿ”— View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/compare?experimentId=RXhwZXJpbWVudDo4MQ==\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running tasks |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 48/48 (100.0%) | โณ 01:36<00:00 | 2.01s/it\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "โœ… Task runs completed.\n", "๐Ÿง  Evaluation started.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running experiment evaluations |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 48/48 (100.0%) | โณ 00:39<00:00 | 1.21it/s" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment completed: 48 task runs, 1 evaluator runs, 48 evaluations\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "query_engine_baseline = index.as_query_engine()\n", "baseline_experiment = await AsyncClient().experiments.run_experiment(\n", " dataset=ds,\n", " task=partial(run_rag_task, rag_engine=query_engine_baseline),\n", " experiment_name=\"baseline\",\n", " evaluators=[refusal_evaluator],\n", " concurrency=10,\n", " # dry_run=3,\n", ")" ] }, { "cell_type": "markdown", "id": "82378af0", "metadata": {}, "source": [ "### Experiment 2: RAG with Custom Prompt\n", "\n", "Go into Phoenix to see the results of our experiment.\n", "\n", "The refusal score is a little high - we want to get it down closer to 0.5 since we know 50% of our queries are unanswerable. Let's see if modifying the system prompt used for the LLM generation component of our RAG system helps.\n" ] }, { "cell_type": "code", "execution_count": 41, "id": "904d96cc", "metadata": {}, "outputs": [], "source": [ "from textwrap import dedent\n", "\n", "custom_system_prompt = \"\"\"You are an expert at answering questions about a given context.\n", "\\nAlways answer the query using the provided context information, and not prior knowledge.\n", "\\nSome rules to follow:\n", "\\n1. Never directly reference the given context in your answer.\n", "\\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...'\n", "or anything along those lines.\n", "\\n3. Do NOT use prior knowledge to answer the question. Only use the context provided.\n", "\\n4. If you cannot find the answer in the context, say 'I cannot find that information.' When in\n", "doubt, default to responding 'I cannot find that information.'\n", "\"\"\"\n", "custom_query_engine = index.as_query_engine(system_prompt=dedent(custom_system_prompt))" ] }, { "cell_type": "code", "execution_count": 42, "id": "400f09fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿงช Experiment started.\n", "๐Ÿ“บ View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/experiments\n", "๐Ÿ”— View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/compare?experimentId=RXhwZXJpbWVudDo4Mg==\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running tasks |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 48/48 (100.0%) | โณ 01:35<00:00 | 1.99s/it\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "โœ… Task runs completed.\n", "๐Ÿง  Evaluation started.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "running experiment evaluations |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 48/48 (100.0%) | โณ 00:17<00:00 | 2.82it/s" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Experiment completed: 48 task runs, 1 evaluator runs, 48 evaluations\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "experiment = await AsyncClient().experiments.run_experiment(\n", " dataset=ds,\n", " task=partial(run_rag_task, rag_engine=custom_query_engine),\n", " experiment_name=\"custom-prompt\",\n", " evaluators=[refusal_evaluator],\n", " concurrency=10,\n", " # dry_run=3,\n", ")" ] }, { "cell_type": "markdown", "id": "d3c1b209", "metadata": {}, "source": [ "Check out the results of this experiment in Phoenix.\n", "\n", "Nice, we are heading in the right direction! Our refusal score went down a bit closer to 0.5, indicating that our RAG system is correctly refusing to answer more queries.\n" ] }, { "cell_type": "markdown", "id": "dbdb7104", "metadata": {}, "source": [ "# Conclusion\n" ] }, { "cell_type": "markdown", "id": "be17c5a8", "metadata": {}, "source": [ "In this notebook, we have covered a lot! Now you know:\n", "\n", "1. How to evaluate traces using different types of evaluators:\n", " - custom LLM classifiers\n", " - built-in metrics\n", " - heuristic functions using the `create_evaluator` decorator\n", "2. How to build and iterate on an LLM Evaluator using experiments\n", "3. How to iterate on an application using experiments and evaluators\n", "\n", "For more information, check out our [Documentation!](https://arize-phoenix.readthedocs.io/projects/evals/en/latest/)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.23" } }, "nbformat": 4, "nbformat_minor": 5 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server