Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
evaluations_with_error_handling.ipynb56.4 kB
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Retrieval Relevance Evals</h1>\n", "\n", "Phoenix evals are designed to be robust to many kinds of errors, providing many tools to control error handling and retry behavior, as well as the ability to surface details about what happened during long eval runs.\n", "\n", "In this notebook, we'll simulate various kinds of errors that might happen while running evals and show different ways Phoenix evals can work with them.\n", "\n", "## Install Dependencies and Import Libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "N_EVAL_SAMPLE_SIZE = 40" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -qq \"arize-phoenix-evals\" \"openai>=1\" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio 'httpx<0.28'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.\n", "\n", "Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "\n", "nest_asyncio.apply()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "from collections import Counter\n", "from getpass import getpass\n", "\n", "import pandas as pd\n", "\n", "from phoenix.evals import (\n", " RAG_RELEVANCY_PROMPT_RAILS_MAP,\n", " RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " OpenAIModel,\n", " download_benchmark_dataset,\n", " llm_classify,\n", ")\n", "\n", "pd.set_option(\"display.max_colwidth\", None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download Dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df = download_benchmark_dataset(\n", " task=\"binary-relevance-classification\", dataset_name=\"wiki_qa-train\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure a test LLM\n", "\n", "Configure your OpenAI API key." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Input Dataset\n", "Sample size determines run time\n", "Recommend iterating small: 100 samples\n", "Then increasing to large test set" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)\n", "df_sample = df_sample.rename(\n", " columns={\n", " \"query_text\": \"input\",\n", " \"document_text\": \"reference\",\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run LLM Evals\n", "Run relevance evals against a subset of the data.\n", "Instantiate the LLM and set parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up test model wrapper\n", "\n", "To demonstrate error handling while running evals, we'll remove some input data that was required from our sampled dataset.\n", "\n", "Second, we'll create a buggy model that inherits from the `OpenAIModel` wrapper to simulate spurious errors that might occur when trying to run evals." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df_sample.loc[28, \"reference\"] = None\n", "df_sample.loc[37, \"input\"] = None" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "\n", "class FunnyAIModel(OpenAIModel):\n", " async def _async_generate(self, *args, **kwargs):\n", " if random.random() < 0.3:\n", " raise RuntimeError(\"What could have possibly happened here?!\")\n", " return await super()._async_generate(*args, **kwargs)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "funny_model = FunnyAIModel(\n", " model=\"gpt-4o\",\n", " temperature=0.0,\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Hello! Yes, I'm here and working. How can I assist you today?\"" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "funny_model(\"Hello world, this is a test if you are working?\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking execution details\n", "\n", "The default behavior is to retry (with a default maximum of 10) on exceptions while running evals. However, if input data is missing and a prompt cannot be generated from a template, that row will immediately fail. `llm_classify` will return early, and the rows that will not be run will not have an eval.\n", "\n", "In addition to the output, columns will be provided in the output that will show all exceptions that were encountered during execution, as well as a status that summarizes what happened for each row and timing info." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "955d0a9da9cd4e8db2fb9f235078ebeb", "version_major": 2, "version_minor": 0 }, "text/plain": [ "llm_classify | | 0/40 (0.0%) | ⏳ 00:00<? | ?it/s" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Retries exhausted after 1 attempts: Missing template variables: reference\n" ] } ], "source": [ "rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())\n", "evals_with_exception_info = llm_classify(\n", " dataframe=df_sample,\n", " template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " model=funny_model,\n", " rails=rails,\n", " concurrency=3,\n", " include_exceptions=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>label</th>\n", " <th>exceptions</th>\n", " <th>execution_status</th>\n", " <th>execution_seconds</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.541415</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.614199</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>unrelated</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.526719</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.680465</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>unrelated</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.593109</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>1.033485</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.511282</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>unrelated</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.599838</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.712328</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.643351</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.544921</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.547776</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>unrelated</td>\n", " <td>[RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.492994</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.479243</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.629159</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.506721</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.525353</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.550200</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.626456</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.495438</td>\n", " </tr>\n", " <tr>\n", " <th>20</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.752875</td>\n", " </tr>\n", " <tr>\n", " <th>21</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.485587</td>\n", " </tr>\n", " <tr>\n", " <th>22</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.812941</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>24</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.575621</td>\n", " </tr>\n", " <tr>\n", " <th>25</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.817680</td>\n", " </tr>\n", " <tr>\n", " <th>26</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.485074</td>\n", " </tr>\n", " <tr>\n", " <th>27</th>\n", " <td>None</td>\n", " <td>[RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000610</td>\n", " </tr>\n", " <tr>\n", " <th>28</th>\n", " <td>None</td>\n", " <td>[PhoenixTemplateMappingError('Missing template variables: reference')]</td>\n", " <td>MISSING INPUT</td>\n", " <td>0.000100</td>\n", " </tr>\n", " <tr>\n", " <th>29</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>30</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>31</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>32</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>33</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>34</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>35</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>36</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>37</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>38</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>39</th>\n", " <td>None</td>\n", " <td>[]</td>\n", " <td>DID NOT RUN</td>\n", " <td>0.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " label \\\n", "0 relevant \n", "1 relevant \n", "2 unrelated \n", "3 unrelated \n", "4 unrelated \n", "5 unrelated \n", "6 relevant \n", "7 unrelated \n", "8 relevant \n", "9 relevant \n", "10 unrelated \n", "11 relevant \n", "12 unrelated \n", "13 relevant \n", "14 unrelated \n", "15 relevant \n", "16 relevant \n", "17 unrelated \n", "18 relevant \n", "19 relevant \n", "20 relevant \n", "21 relevant \n", "22 unrelated \n", "23 None \n", "24 relevant \n", "25 unrelated \n", "26 unrelated \n", "27 None \n", "28 None \n", "29 None \n", "30 None \n", "31 None \n", "32 None \n", "33 None \n", "34 None \n", "35 None \n", "36 None \n", "37 None \n", "38 None \n", "39 None \n", "\n", " exceptions \\\n", "0 [] \n", "1 [] \n", "2 [RuntimeError('What could have possibly happened here?!')] \n", "3 [] \n", "4 [RuntimeError('What could have possibly happened here?!')] \n", "5 [] \n", "6 [] \n", "7 [RuntimeError('What could have possibly happened here?!')] \n", "8 [] \n", "9 [RuntimeError('What could have possibly happened here?!')] \n", "10 [] \n", "11 [] \n", "12 [RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')] \n", "13 [RuntimeError('What could have possibly happened here?!')] \n", "14 [] \n", "15 [RuntimeError('What could have possibly happened here?!')] \n", "16 [] \n", "17 [] \n", "18 [RuntimeError('What could have possibly happened here?!')] \n", "19 [] \n", "20 [] \n", "21 [] \n", "22 [] \n", "23 [] \n", "24 [RuntimeError('What could have possibly happened here?!')] \n", "25 [] \n", "26 [] \n", "27 [RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')] \n", "28 [PhoenixTemplateMappingError('Missing template variables: reference')] \n", "29 [] \n", "30 [] \n", "31 [] \n", "32 [] \n", "33 [] \n", "34 [] \n", "35 [] \n", "36 [] \n", "37 [] \n", "38 [] \n", "39 [] \n", "\n", " execution_status execution_seconds \n", "0 COMPLETED 0.541415 \n", "1 COMPLETED 0.614199 \n", "2 COMPLETED WITH RETRIES 0.526719 \n", "3 COMPLETED 0.680465 \n", "4 COMPLETED WITH RETRIES 0.593109 \n", "5 COMPLETED 1.033485 \n", "6 COMPLETED 0.511282 \n", "7 COMPLETED WITH RETRIES 0.599838 \n", "8 COMPLETED 0.712328 \n", "9 COMPLETED WITH RETRIES 0.643351 \n", "10 COMPLETED 0.544921 \n", "11 COMPLETED 0.547776 \n", "12 COMPLETED WITH RETRIES 0.492994 \n", "13 COMPLETED WITH RETRIES 0.479243 \n", "14 COMPLETED 0.629159 \n", "15 COMPLETED WITH RETRIES 0.506721 \n", "16 COMPLETED 0.525353 \n", "17 COMPLETED 0.550200 \n", "18 COMPLETED WITH RETRIES 0.626456 \n", "19 COMPLETED 0.495438 \n", "20 COMPLETED 0.752875 \n", "21 COMPLETED 0.485587 \n", "22 COMPLETED 0.812941 \n", "23 DID NOT RUN 0.000000 \n", "24 COMPLETED WITH RETRIES 0.575621 \n", "25 COMPLETED 0.817680 \n", "26 COMPLETED 0.485074 \n", "27 DID NOT RUN 0.000610 \n", "28 MISSING INPUT 0.000100 \n", "29 DID NOT RUN 0.000000 \n", "30 DID NOT RUN 0.000000 \n", "31 DID NOT RUN 0.000000 \n", "32 DID NOT RUN 0.000000 \n", "33 DID NOT RUN 0.000000 \n", "34 DID NOT RUN 0.000000 \n", "35 DID NOT RUN 0.000000 \n", "36 DID NOT RUN 0.000000 \n", "37 DID NOT RUN 0.000000 \n", "38 DID NOT RUN 0.000000 \n", "39 DID NOT RUN 0.000000 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "evals_with_exception_info" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that after a terminal error occurs, `llm_classify` stops early and some rows are left in a `DID NOT RUN` state. We can use a `Counter` to show many evals did not finish or encountered an error." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'COMPLETED': 17,\n", " 'DID NOT RUN': 13,\n", " 'COMPLETED WITH RETRIES': 9,\n", " 'MISSING INPUT': 1})" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(evals_with_exception_info[\"execution_status\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuring Early Exit Behavior\n", "\n", "You can also pass `exit_on_error=False` to `llm_classify`, which will skip rows that either are missing inputs or fail during execution. This setting can be combined with `maximum_retries` to fully configure exception handling behavior." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "571cca2fc79d4655b554a245dddb5676", "version_major": 2, "version_minor": 0 }, "text/plain": [ "llm_classify | | 0/40 (0.0%) | ⏳ 00:00<? | ?it/s" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Retries exhausted after 3 attempts: What could have possibly happened here?!\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Retries exhausted after 1 attempts: Missing template variables: reference\n", "Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!')\n", "Requeuing...\n", "Retries exhausted after 1 attempts: Missing template variables: input\n" ] } ], "source": [ "rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())\n", "all_evals = llm_classify(\n", " dataframe=df_sample,\n", " template=RAG_RELEVANCY_PROMPT_TEMPLATE,\n", " model=funny_model,\n", " rails=rails,\n", " concurrency=3,\n", " max_retries=2,\n", " exit_on_error=False,\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>label</th>\n", " <th>exceptions</th>\n", " <th>execution_status</th>\n", " <th>execution_seconds</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.656291</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>1.214457</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>1.006221</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>None</td>\n", " <td>[RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>FAILED</td>\n", " <td>0.000806</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.598868</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>unrelated</td>\n", " <td>[RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.620612</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.491993</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.498494</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.460664</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.543839</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.361752</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.827295</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.565072</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.441669</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.860275</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.361320</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.684321</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.595008</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.480569</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.658973</td>\n", " </tr>\n", " <tr>\n", " <th>20</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.672698</td>\n", " </tr>\n", " <tr>\n", " <th>21</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.372545</td>\n", " </tr>\n", " <tr>\n", " <th>22</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.506847</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.670335</td>\n", " </tr>\n", " <tr>\n", " <th>24</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.545959</td>\n", " </tr>\n", " <tr>\n", " <th>25</th>\n", " <td>unrelated</td>\n", " <td>[RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.515250</td>\n", " </tr>\n", " <tr>\n", " <th>26</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.585052</td>\n", " </tr>\n", " <tr>\n", " <th>27</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.518192</td>\n", " </tr>\n", " <tr>\n", " <th>28</th>\n", " <td>None</td>\n", " <td>[PhoenixTemplateMappingError('Missing template variables: reference')]</td>\n", " <td>MISSING INPUT</td>\n", " <td>0.000257</td>\n", " </tr>\n", " <tr>\n", " <th>29</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.478919</td>\n", " </tr>\n", " <tr>\n", " <th>30</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.522542</td>\n", " </tr>\n", " <tr>\n", " <th>31</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.643352</td>\n", " </tr>\n", " <tr>\n", " <th>32</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.471015</td>\n", " </tr>\n", " <tr>\n", " <th>33</th>\n", " <td>relevant</td>\n", " <td>[RuntimeError('What could have possibly happened here?!')]</td>\n", " <td>COMPLETED WITH RETRIES</td>\n", " <td>0.485109</td>\n", " </tr>\n", " <tr>\n", " <th>34</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.384154</td>\n", " </tr>\n", " <tr>\n", " <th>35</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.481911</td>\n", " </tr>\n", " <tr>\n", " <th>36</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.486317</td>\n", " </tr>\n", " <tr>\n", " <th>37</th>\n", " <td>None</td>\n", " <td>[PhoenixTemplateMappingError('Missing template variables: input')]</td>\n", " <td>MISSING INPUT</td>\n", " <td>0.000209</td>\n", " </tr>\n", " <tr>\n", " <th>38</th>\n", " <td>unrelated</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.580683</td>\n", " </tr>\n", " <tr>\n", " <th>39</th>\n", " <td>relevant</td>\n", " <td>[]</td>\n", " <td>COMPLETED</td>\n", " <td>0.377904</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " label \\\n", "0 relevant \n", "1 relevant \n", "2 unrelated \n", "3 None \n", "4 unrelated \n", "5 unrelated \n", "6 relevant \n", "7 unrelated \n", "8 relevant \n", "9 relevant \n", "10 unrelated \n", "11 relevant \n", "12 unrelated \n", "13 relevant \n", "14 unrelated \n", "15 relevant \n", "16 relevant \n", "17 unrelated \n", "18 relevant \n", "19 relevant \n", "20 relevant \n", "21 relevant \n", "22 unrelated \n", "23 unrelated \n", "24 relevant \n", "25 unrelated \n", "26 unrelated \n", "27 relevant \n", "28 None \n", "29 relevant \n", "30 relevant \n", "31 relevant \n", "32 relevant \n", "33 relevant \n", "34 unrelated \n", "35 relevant \n", "36 unrelated \n", "37 None \n", "38 unrelated \n", "39 relevant \n", "\n", " exceptions \\\n", "0 [RuntimeError('What could have possibly happened here?!')] \n", "1 [] \n", "2 [] \n", "3 [RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')] \n", "4 [] \n", "5 [RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')] \n", "6 [RuntimeError('What could have possibly happened here?!')] \n", "7 [] \n", "8 [] \n", "9 [] \n", "10 [] \n", "11 [] \n", "12 [] \n", "13 [] \n", "14 [] \n", "15 [] \n", "16 [] \n", "17 [] \n", "18 [] \n", "19 [RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')] \n", "20 [] \n", "21 [RuntimeError('What could have possibly happened here?!')] \n", "22 [] \n", "23 [] \n", "24 [] \n", "25 [RuntimeError('What could have possibly happened here?!'), RuntimeError('What could have possibly happened here?!')] \n", "26 [] \n", "27 [] \n", "28 [PhoenixTemplateMappingError('Missing template variables: reference')] \n", "29 [] \n", "30 [] \n", "31 [] \n", "32 [] \n", "33 [RuntimeError('What could have possibly happened here?!')] \n", "34 [] \n", "35 [] \n", "36 [] \n", "37 [PhoenixTemplateMappingError('Missing template variables: input')] \n", "38 [] \n", "39 [] \n", "\n", " execution_status execution_seconds \n", "0 COMPLETED WITH RETRIES 0.656291 \n", "1 COMPLETED 1.214457 \n", "2 COMPLETED 1.006221 \n", "3 FAILED 0.000806 \n", "4 COMPLETED 0.598868 \n", "5 COMPLETED WITH RETRIES 0.620612 \n", "6 COMPLETED WITH RETRIES 0.491993 \n", "7 COMPLETED 0.498494 \n", "8 COMPLETED 0.460664 \n", "9 COMPLETED 0.543839 \n", "10 COMPLETED 0.361752 \n", "11 COMPLETED 0.827295 \n", "12 COMPLETED 0.565072 \n", "13 COMPLETED 0.441669 \n", "14 COMPLETED 0.860275 \n", "15 COMPLETED 0.361320 \n", "16 COMPLETED 0.684321 \n", "17 COMPLETED 0.595008 \n", "18 COMPLETED 0.480569 \n", "19 COMPLETED WITH RETRIES 0.658973 \n", "20 COMPLETED 0.672698 \n", "21 COMPLETED WITH RETRIES 0.372545 \n", "22 COMPLETED 0.506847 \n", "23 COMPLETED 0.670335 \n", "24 COMPLETED 0.545959 \n", "25 COMPLETED WITH RETRIES 0.515250 \n", "26 COMPLETED 0.585052 \n", "27 COMPLETED 0.518192 \n", "28 MISSING INPUT 0.000257 \n", "29 COMPLETED 0.478919 \n", "30 COMPLETED 0.522542 \n", "31 COMPLETED 0.643352 \n", "32 COMPLETED 0.471015 \n", "33 COMPLETED WITH RETRIES 0.485109 \n", "34 COMPLETED 0.384154 \n", "35 COMPLETED 0.481911 \n", "36 COMPLETED 0.486317 \n", "37 MISSING INPUT 0.000209 \n", "38 COMPLETED 0.580683 \n", "39 COMPLETED 0.377904 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_evals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `exit_on_error=False`, no evals should be left in a `DID NOT RUN` state." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'COMPLETED': 30,\n", " 'COMPLETED WITH RETRIES': 7,\n", " 'MISSING INPUT': 2,\n", " 'FAILED': 1})" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(all_evals[\"execution_status\"])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server