Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
evals_introduction.ipynb35.4 kB
{ "cells": [ { "cell_type": "markdown", "id": "dacc824e", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n", " <br>\n", " <br>\n", " <a href=\"https://arize-phoenix.readthedocs.io/projects/evals/en/latest/\">Evals Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Arize Phoenix Evals 2.0</h1>\n", "\n", "We are excited to introduce `arize-phoenix-evals` 2.0, an open-source library providing tools to evaluate AI systems so you can build faster and with more confidence. We have rebuilt the library from the ground up to make evaluation faster, easier, and more powerful.\n", "\n", "**In this notebook, you will learn more about:**\n", "\n", "1. Our guiding principles\n", "2. The core library abstractions\n", "3. Usage examples\n", "4. What's changed between 2.0 and the previous version\n", "\n", "#### Our Guiding Principles\n", "\n", "**Fast:** We are optimizing for maximum speed, minimal headache.\n", "\n", "**Ergonomic:** It should be user-friendly and easy to pick up.\n", "\n", "**Flexible:** We make minimal assumptions about the shape of your data or evals.\n", "\n", "**Powerful:** Built with extensibility in mind, the library enables complex evaluation tasks.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "aa7312e8", "metadata": {}, "outputs": [], "source": [ "! pip install arize-phoenix \"arize-phoenix-evals>=2.0.0\" openai pandas openinference-instrumentation-openai --quiet" ] }, { "cell_type": "code", "execution_count": null, "id": "ebb216a5", "metadata": {}, "outputs": [], "source": [ "# set up phoenix app and tracing\n", "import phoenix as px\n", "from phoenix.otel import register\n", "\n", "px.launch_app()\n", "tracer_provider = register(auto_instrument=True)" ] }, { "cell_type": "markdown", "id": "b026cceb", "metadata": {}, "source": [ "## LLM Configuration\n", "\n", "**Core Design Principle:** The library should work with any LLM model and provider.\n", "\n", "The LLM wrapper unifies generation tasks across model providers by delegating to the most commonly installed client SDKs (OpenAI, LangChain, LiteLLM) via adapters.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "037001c1", "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "code", "execution_count": 4, "id": "dd52ac96", "metadata": {}, "outputs": [], "source": [ "from phoenix.evals.llm import LLM\n", "\n", "llm = LLM(\n", " provider=\"openai\", model=\"gpt-4o-mini\"\n", ") # you could also specify the client e.g. \"langchain\", \"litellm\", \"openai\"" ] }, { "cell_type": "markdown", "id": "632848d2", "metadata": {}, "source": [ "## Evaluators and Scores\n", "\n", "An evaluation is defined as any process that returns a `Score`.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "3080621a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hallucination result:\n", "{\n", " \"name\": \"hallucination\",\n", " \"score\": 1.0,\n", " \"label\": \"factual\",\n", " \"explanation\": \"The response correctly states that Paris is the capital of France, which is supported by the context provided. Thus, it does not contain any false information or hallucinations.\",\n", " \"metadata\": {\n", " \"model\": \"gpt-4o-mini\"\n", " },\n", " \"source\": \"llm\",\n", " \"direction\": \"maximize\"\n", "}\n" ] } ], "source": [ "from phoenix.evals.metrics import (\n", " HallucinationEvaluator,\n", ")\n", "\n", "llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n", "hallucination_evaluator = HallucinationEvaluator(llm=llm)\n", "result = hallucination_evaluator.evaluate(\n", " {\n", " \"input\": \"What is the capital of France?\",\n", " \"output\": \"Paris is the capital of France.\",\n", " \"context\": \"Paris is the capital and largest city of France.\",\n", " }\n", ")\n", "print(\"Hallucination result:\")\n", "result[0].pretty_print()" ] }, { "cell_type": "markdown", "id": "b03f84a5", "metadata": {}, "source": [ "**Core Design Principle:** The output of evaluators should be rich with information.\n", "\n", "Evaluators always return a **list** of `Score` objects. Often, this will be a list of length 1, but some evaluators may return multiple scores for a single `eval_input` (e.g. precision/recall or multi-criteria evals).\n" ] }, { "cell_type": "markdown", "id": "b220eca8", "metadata": {}, "source": [ "## Built-In Metrics\n" ] }, { "cell_type": "markdown", "id": "44004e3b", "metadata": {}, "source": [ "### Precision, Recall, F1 (multi-score)\n", "\n", "A single evaluator can return multiple scores!\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "a465ddb9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Results:\n", "Score(name='precision', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')\n", "Score(name='recall', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')\n", "Score(name='f1', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')\n" ] } ], "source": [ "from phoenix.evals.metrics import PrecisionRecallFScore\n", "\n", "precision_recall_fscore = PrecisionRecallFScore(positive_label=\"yes\")\n", "result = precision_recall_fscore.evaluate(\n", " {\"output\": [\"no\", \"yes\", \"yes\"], \"expected\": [\"yes\", \"no\", \"yes\"]}\n", ")\n", "print(\"Results:\")\n", "print(result[0])\n", "print(result[1])\n", "print(result[2])" ] }, { "cell_type": "markdown", "id": "05595bd1", "metadata": {}, "source": [ "## Custom LLM Classification Evaluators\n", "\n", "This is similar to `llm_classify`, for LLM-as-a-judge evaluations that output a label and explanation.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "150d7c6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"name\": \"sentiment\",\n", " \"score\": 1.0,\n", " \"label\": \"positive\",\n", " \"explanation\": \"The text expresses a strong positive emotion of love, indicating a favorable sentiment towards something.\",\n", " \"metadata\": {\n", " \"model\": \"gpt-4o-mini\"\n", " },\n", " \"source\": \"llm\",\n", " \"direction\": \"maximize\"\n", "}\n" ] } ], "source": [ "from phoenix.evals import ClassificationEvaluator\n", "from phoenix.evals.llm import LLM\n", "\n", "llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n", "\n", "evaluator = ClassificationEvaluator(\n", " name=\"sentiment\",\n", " llm=llm,\n", " prompt_template=\"Classify the sentiment of this text: {text}\",\n", " choices={\"positive\": 1.0, \"negative\": 0.0, \"neutral\": 0.5}, # specify custom score mapping!\n", ")\n", "\n", "result = evaluator.evaluate({\"text\": \"I love this!\"})\n", "result[0].pretty_print()" ] }, { "cell_type": "markdown", "id": "25e181fd", "metadata": {}, "source": [ "### About the `ClassificationEvaluator`\n", "\n", "**New features:**\n", "\n", "- Specify scores for each label\n", "- Runs on single records (not just a dataframe)\n", "- Leverages model tool calling / structured output for more reliable output parsing\n", "- There is also a factory function `create_classifier` to create `ClassificationEvaluator` objects.\n", "\n", "This abstraction can be easily extended to support multi-criteria evaluations where a judge is asked to evaluate an input across multiple dimensions in one request.\n" ] }, { "cell_type": "markdown", "id": "638889b9", "metadata": {}, "source": [ "## Input Mapping and Transformation\n", "\n", "**Core Design Principle:** The inputs to an evaluator should be well-defined and discoverable.\n", "\n", "Every evaluator has an `input_schema` which describes what inputs it expects.\n" ] }, { "cell_type": "markdown", "id": "e808dbbb", "metadata": {}, "source": [ "### Use `.describe()` to inspect an `Evaluator`'s input schema\n", "\n", "Because pydantic `BaseModel` is used for the `input_schema`, input fields can be annotated with types, descriptions, and even aliases.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "343098ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'hallucination',\n", " 'source': 'llm',\n", " 'direction': 'maximize',\n", " 'input_schema': {'properties': {'input': {'description': 'The input query.',\n", " 'title': 'Input',\n", " 'type': 'string'},\n", " 'output': {'description': 'The response to the query.',\n", " 'title': 'Output',\n", " 'type': 'string'},\n", " 'context': {'description': 'The context or reference text.',\n", " 'title': 'Context',\n", " 'type': 'string'}},\n", " 'required': ['input', 'output', 'context'],\n", " 'title': 'HallucinationInputSchema',\n", " 'type': 'object'}}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hallucination_evaluator.describe() # requires strings for input, output, and context" ] }, { "cell_type": "code", "execution_count": 9, "id": "aa53db77", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'exact_match',\n", " 'source': 'heuristic',\n", " 'direction': 'maximize',\n", " 'input_schema': {'properties': {'output': {'title': 'Output',\n", " 'type': 'string'},\n", " 'expected': {'title': 'Expected', 'type': 'string'}},\n", " 'required': ['output', 'expected'],\n", " 'title': 'Exact_matchInput',\n", " 'type': 'object'}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals.metrics import exact_match\n", "\n", "exact_match.describe() # requires string output and expected" ] }, { "cell_type": "markdown", "id": "cb78bcd1", "metadata": {}, "source": [ "### Use `input_mapping` to map/transform data into expected `input_schema`\n", "\n", "An evaluator's input arguments may not perfectly match those in your example or dataset. Or, you may want to run multiple evaluators on the same example, but they have different or conflicting `input_schema`'s.\n", "\n", "You may have noticed that `Evaluators` accept an `eval_input` payload rather than keyword arguments.\n", "\n", "**Core Design Principle:** You should not have to modify your data to run evaluations.\n", "\n", "To extract the values from a nested `eval_input` payload, provide an `input_mapping` that maps evaluator's input fields to a path spec in your original data.\n", "\n", "**Possible Mapping Values:**\n", "\n", "- top-level keys in your JSON\n", "- a path spec following JSON path syntax\n", "- callable functions\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "53f9a351", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"name\": \"hallucination\",\n", " \"score\": 0.0,\n", " \"label\": \"hallucinated\",\n", " \"explanation\": \"The provided data lacks specific details in the query, context, or response, making it impossible to determine if the response is factual or hallucinated. Therefore, I cannot classify the response as either.\",\n", " \"metadata\": {\n", " \"model\": \"gpt-4o-mini\"\n", " },\n", " \"source\": \"llm\",\n", " \"direction\": \"maximize\"\n", "}\n" ] } ], "source": [ "# example nested eval input for a RAG system\n", "eval_input = {\n", " \"input\": {\"query\": \"user input query\"},\n", " \"output\": {\n", " \"responses\": [\"model answer\", \"model answer 2\"],\n", " \"documents\": [\"doc A\", \"doc B\"],\n", " },\n", " \"expected\": \"correct answer\",\n", "}\n", "\n", "# in order to run the hallucination evaluator, we need to process the eval_input to the fit the input schema\n", "input_mapping = {\n", " \"input\": \"input.query\", # dot notation to access nested keys\n", " \"output\": \"output.responses[0]\", # brackets to access list elements\n", " \"context\": lambda x: \" \".join(\n", " x[\"output\"][\"documents\"]\n", " ), # lambda function to combine the document chunks\n", "}\n", "\n", "# the evaluator uses the input_mapping to transform the eval_input into the expected input schema\n", "result = hallucination_evaluator.evaluate(eval_input, input_mapping)\n", "result[0].pretty_print()" ] }, { "cell_type": "markdown", "id": "09d15d39", "metadata": {}, "source": [ "### Use `bind_evaluator` to bind an `input_mapping` to an `Evaluator` for reuse\n", "\n", "Note: We don't need to remap \"expected\" for the `exact_match` eval because it already exists in our `eval_input`\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "8eb96ffd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"name\": \"hallucination\",\n", " \"score\": 0.0,\n", " \"label\": \"hallucinated\",\n", " \"explanation\": \"The response does not align with the context provided, indicating it may include inaccurate information.\",\n", " \"metadata\": {\n", " \"model\": \"gpt-4o-mini\"\n", " },\n", " \"source\": \"llm\",\n", " \"direction\": \"maximize\"\n", "}\n", "{\n", " \"name\": \"exact_match\",\n", " \"score\": 0.0,\n", " \"metadata\": {},\n", " \"source\": \"heuristic\",\n", " \"direction\": \"maximize\"\n", "}\n" ] }, { "data": { "text/plain": [ "[None, None]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals import bind_evaluator\n", "\n", "# we can bind an input_mapping to an evaluator ahead of call time for easier sequential evals\n", "evaluators = [\n", " bind_evaluator(hallucination_evaluator, input_mapping),\n", " bind_evaluator(exact_match, {\"output\": \"output.responses[0]\"}),\n", "]\n", "scores = []\n", "for evaluator in evaluators:\n", " scores.append(evaluator.evaluate(eval_input)) # no need to pass input_mapping each time\n", "\n", "[score[0].pretty_print() for score in scores]" ] }, { "cell_type": "markdown", "id": "bf3b5a01", "metadata": {}, "source": [ "## A Convenient Decorator\n", "\n", "Use the `create_evaluator` decorator to turn any function that returns something \"score-like\" into an `Evaluator`.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "4a374439", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Score(name='text_length', score=0.5, label='short', explanation=None, metadata={}, source='heuristic', direction='maximize')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals import create_evaluator\n", "\n", "\n", "# heuristic evaluator that returns a tuple of score and label\n", "@create_evaluator(name=\"text_length\")\n", "def text_length_score(text: str) -> tuple[float, str]:\n", " \"\"\"Score text based on length (longer = better, up to a point)\"\"\"\n", " length = len(text)\n", " if length < 10:\n", " score = 0.0\n", " label = \"too_short\"\n", " elif length < 50:\n", " score = 0.5\n", " label = \"short\"\n", " elif length < 200:\n", " score = 1.0\n", " label = \"good_length\"\n", " else:\n", " score = 0.8\n", " label = \"too_long\"\n", "\n", " return (score, label)\n", "\n", "\n", "text_length_score.evaluate({\"text\": \"This is a test\"})" ] }, { "cell_type": "code", "execution_count": 13, "id": "69e57a21", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'name': 'keyword_presence',\n", " 'source': 'heuristic',\n", " 'direction': 'maximize',\n", " 'input_schema': {'properties': {'text': {'title': 'Text', 'type': 'string'},\n", " 'keywords': {'items': {'type': 'string'},\n", " 'title': 'Keywords',\n", " 'type': 'array'}},\n", " 'required': ['text', 'keywords'],\n", " 'title': 'Keyword_presenceInput',\n", " 'type': 'object'}}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from phoenix.evals import Score, create_evaluator\n", "\n", "\n", "# heuristic evaluator that returns a Score object with metadata\n", "@create_evaluator(name=\"keyword_presence\", source=\"heuristic\", direction=\"maximize\")\n", "def keyword_presence_score(text: str, keywords: list[str]) -> tuple[float, str, str]:\n", " \"\"\"Score text based on presence of keywords\"\"\"\n", " text_lower = text.lower()\n", " keyword_list = keywords\n", "\n", " found_keywords = [k for k in keyword_list if k in text_lower]\n", " score = len(found_keywords) / len(keyword_list) if keyword_list else 0.0\n", "\n", " return Score(\n", " score=score,\n", " label=f\"found_{len(found_keywords)}_of_{len(keyword_list)}\",\n", " explanation=f\"Found keywords: {found_keywords}\",\n", " metadata={\"found_keywords\": found_keywords, \"total_keywords\": len(keyword_list)},\n", " )\n", "\n", "\n", "keyword_presence_score.describe() # input schema is inferred from the function signature" ] }, { "cell_type": "markdown", "id": "5f562301", "metadata": {}, "source": [ "## Dataframe Evaluation\n", "\n", "Run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:\n", "\n", "1. `{score_name}_score` contains the JSON serialized score (or None if the evaluation failed)\n", "2. `{evaluator_name}_execution_details` contains information about the execution status, duration, and any exceptions that ocurred.\n", "\n", "**Notes:**\n", "\n", "- use `bind_evaluator` to bind `input_mappings` to your evaluators so they match your dataframe columns.\n", "\n", "### Example 1: Async version with multiple evaluators\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ebb79bd4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>output</th>\n", " <th>expected</th>\n", " <th>context</th>\n", " <th>query</th>\n", " <th>response</th>\n", " <th>hallucination_execution_details</th>\n", " <th>exact_match_execution_details</th>\n", " <th>hallucination_score</th>\n", " <th>exact_match_score</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Yes</td>\n", " <td>Yes</td>\n", " <td>This is a test</td>\n", " <td>What is the name of this test?</td>\n", " <td>First test</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 0.0, \"label...</td>\n", " <td>{\"name\": \"exact_match\", \"score\": 1.0, \"metadat...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Yes</td>\n", " <td>No</td>\n", " <td>This is another test</td>\n", " <td>What is the name of this test?</td>\n", " <td>Another test</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 1.0, \"label...</td>\n", " <td>{\"name\": \"exact_match\", \"score\": 0.0, \"metadat...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>This is a third test</td>\n", " <td>What is the name of this test?</td>\n", " <td>Third test</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"hallucination\", \"score\": 1.0, \"label...</td>\n", " <td>{\"name\": \"exact_match\", \"score\": 1.0, \"metadat...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " output expected context query \\\n", "0 Yes Yes This is a test What is the name of this test? \n", "1 Yes No This is another test What is the name of this test? \n", "2 No No This is a third test What is the name of this test? \n", "\n", " response hallucination_execution_details \\\n", "0 First test {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "1 Another test {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "2 Third test {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "\n", " exact_match_execution_details \\\n", "0 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "1 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "2 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "\n", " hallucination_score \\\n", "0 {\"name\": \"hallucination\", \"score\": 0.0, \"label... \n", "1 {\"name\": \"hallucination\", \"score\": 1.0, \"label... \n", "2 {\"name\": \"hallucination\", \"score\": 1.0, \"label... \n", "\n", " exact_match_score \n", "0 {\"name\": \"exact_match\", \"score\": 1.0, \"metadat... \n", "1 {\"name\": \"exact_match\", \"score\": 0.0, \"metadat... \n", "2 {\"name\": \"exact_match\", \"score\": 1.0, \"metadat... " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "from phoenix.evals import async_evaluate_dataframe, bind_evaluator\n", "from phoenix.evals.llm import LLM\n", "from phoenix.evals.metrics import HallucinationEvaluator, exact_match\n", "\n", "exact_match._input_mapping = {} # unset the input mapping from earlier\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"output\": [\"Yes\", \"Yes\", \"No\"],\n", " \"expected\": [\"Yes\", \"No\", \"No\"],\n", " \"context\": [\"This is a test\", \"This is another test\", \"This is a third test\"],\n", " \"query\": [\n", " \"What is the name of this test?\",\n", " \"What is the name of this test?\",\n", " \"What is the name of this test?\",\n", " ],\n", " \"response\": [\"First test\", \"Another test\", \"Third test\"],\n", " }\n", ")\n", "\n", "llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n", "\n", "hallucination_evaluator = bind_evaluator(\n", " HallucinationEvaluator(llm=llm), {\"input\": \"query\", \"output\": \"response\"}\n", ")\n", "\n", "result = await async_evaluate_dataframe(df, [hallucination_evaluator, exact_match])\n", "result.head()" ] }, { "cell_type": "markdown", "id": "bc19f57c", "metadata": {}, "source": [ "### Example 2: Sync version with multi-score evaluator\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e6544f39", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>output</th>\n", " <th>expected</th>\n", " <th>precision_recall_fscore_execution_details</th>\n", " <th>precision_score</th>\n", " <th>recall_score</th>\n", " <th>f1_score</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>[Yes, Yes, No]</td>\n", " <td>[Yes, No, No]</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"precision\", \"score\": 0.5, \"metadata\"...</td>\n", " <td>{\"name\": \"recall\", \"score\": 1.0, \"metadata\": {...</td>\n", " <td>{\"name\": \"f1\", \"score\": 0.6666666666666666, \"m...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>[Yes, No, No]</td>\n", " <td>[Yes, No, No]</td>\n", " <td>{\"status\": \"COMPLETED\", \"exceptions\": [], \"exe...</td>\n", " <td>{\"name\": \"precision\", \"score\": 1.0, \"metadata\"...</td>\n", " <td>{\"name\": \"recall\", \"score\": 1.0, \"metadata\": {...</td>\n", " <td>{\"name\": \"f1\", \"score\": 1.0, \"metadata\": {\"bet...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " output expected \\\n", "0 [Yes, Yes, No] [Yes, No, No] \n", "1 [Yes, No, No] [Yes, No, No] \n", "\n", " precision_recall_fscore_execution_details \\\n", "0 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "1 {\"status\": \"COMPLETED\", \"exceptions\": [], \"exe... \n", "\n", " precision_score \\\n", "0 {\"name\": \"precision\", \"score\": 0.5, \"metadata\"... \n", "1 {\"name\": \"precision\", \"score\": 1.0, \"metadata\"... \n", "\n", " recall_score \\\n", "0 {\"name\": \"recall\", \"score\": 1.0, \"metadata\": {... \n", "1 {\"name\": \"recall\", \"score\": 1.0, \"metadata\": {... \n", "\n", " f1_score \n", "0 {\"name\": \"f1\", \"score\": 0.6666666666666666, \"m... \n", "1 {\"name\": \"f1\", \"score\": 1.0, \"metadata\": {\"bet... " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "from phoenix.evals import evaluate_dataframe\n", "from phoenix.evals.metrics import PrecisionRecallFScore\n", "\n", "precision_recall_fscore = PrecisionRecallFScore(positive_label=\"Yes\")\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"output\": [[\"Yes\", \"Yes\", \"No\"], [\"Yes\", \"No\", \"No\"]],\n", " \"expected\": [[\"Yes\", \"No\", \"No\"], [\"Yes\", \"No\", \"No\"]],\n", " }\n", ")\n", "\n", "result = evaluate_dataframe(df, [precision_recall_fscore])\n", "result.head()" ] }, { "cell_type": "markdown", "id": "4562865b", "metadata": {}, "source": [ "# Practice: BYO Judge\n", "\n", "**Your task:** Create a custom LLM judge to classify text complexity. Inputs can be classified into one of the following labels: simple, moderate, or complex. For your use case, simple text is better than moderate or complex.\n", "\n", "Use the following 3 examples to test your new evaluator:\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "b303d2b4", "metadata": {}, "outputs": [], "source": [ "data = [\n", " {\n", " \"text\": \"AI is when computers learn to do things like people, like recognizing faces or playing games.\"\n", " },\n", " {\n", " \"text\": \"Machine learning is a method in artificial intelligence where systems improve their performance by learning from data, without being explicitly programmed for each task\"\n", " },\n", " {\n", " \"text\": \"Artificial intelligence systems employing deep reinforcement learning utilize hierarchical neural architectures to iteratively optimize policy gradients across high-dimensional state-action spaces, converging toward sub-optimal equilibria in stochastic environments via backpropagated reward signals and temporally extended credit assignment mechanisms.\"\n", " },\n", "]" ] }, { "cell_type": "code", "execution_count": null, "id": "99a0c630", "metadata": {}, "outputs": [], "source": [ "# write your judge here" ] }, { "cell_type": "code", "execution_count": null, "id": "6153d5fa", "metadata": {}, "outputs": [], "source": [ "# test your judge on the examples here" ] }, { "cell_type": "markdown", "id": "663a6682", "metadata": {}, "source": [ "# Practice: BYO Heuristic Evaluator\n", "\n", "**Your task:** Turn the following function into an Evaluator that calculates the Levenshtein distance between two strings.\n", "\n", "Note: Smaller values indicate higher similarity (lower score = better).\n", "\n", "Run the Evaluator on the following data:\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "f7128859", "metadata": {}, "outputs": [], "source": [ "eval_input = {\n", " \"input\": {\"query\": \"What is the capital of France?\"},\n", " \"output\": {\"response\": \"It is Paris\"},\n", " \"expected\": \"Paris\",\n", "}" ] }, { "cell_type": "code", "execution_count": 24, "id": "dd1112c5", "metadata": {}, "outputs": [], "source": [ "# turn this function into a heuristic evaluator\n", "def levenshtein_distance(s1: str, s2: str) -> int:\n", " \"\"\"\n", " Compute the Levenshtein distance between two strings s1 and s2.\n", " \"\"\"\n", " m, n = len(s1), len(s2)\n", "\n", " dp = [[0] * (n + 1) for _ in range(m + 1)]\n", "\n", " for i in range(m + 1):\n", " dp[i][0] = i\n", " for j in range(n + 1):\n", " dp[0][j] = j\n", "\n", " for i in range(1, m + 1):\n", " for j in range(1, n + 1):\n", " cost = 0 if s1[i - 1] == s2[j - 1] else 1\n", " dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)\n", "\n", " return dp[m][n]" ] }, { "cell_type": "code", "execution_count": null, "id": "a71c057a", "metadata": {}, "outputs": [], "source": [ "# test your evaluator on the example above.\n", "# hint: use an input_mapping to map/transform the input to the function's expected arguments." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server