Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
datasets_and_experiments_quickstart.ipynb10.3 kB
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n", " <br>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Quickstart: Datasets and Experiments</h1>\n", "\n", "Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install Phoenix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"arize-phoenix[evals]\" openai 'httpx<0.28'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Launch Phoenix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import phoenix as px\n", "\n", "px.launch_app().view()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set your OpenAI API key." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "import openai\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "openai.api_key = openai_api_key\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initialize a Phoenix Client. This acts as the main entry point for interacting with the Phoenix API\n", "and can be installed independently of Phoenix itself." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client import AsyncClient\n", "\n", "px_client = AsyncClient()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upload a dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = await px_client.datasets.create_dataset(\n", " name=\"experiment-quickstart-dataset\",\n", " inputs=[{\"question\": \"What is Paul Graham known for?\"}],\n", " outputs=[{\"answer\": \"Co-founding Y Combinator and writing on startups and techology.\"}],\n", " metadata=[{\"topic\": \"tech\"}],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tasks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a task to evaluate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from typing import Any\n", "\n", "from openai import OpenAI\n", "\n", "openai_client = OpenAI()\n", "\n", "task_prompt_template = \"Answer in a few words: {question}\"\n", "\n", "\n", "def task(input: Any) -> str:\n", " question = input[\"question\"]\n", " message_content = task_prompt_template.format(question=question)\n", " response = openai_client.chat.completions.create(\n", " model=\"gpt-4o\", messages=[{\"role\": \"user\", \"content\": message_content}]\n", " )\n", " return response.choices[0].message.content or \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can define evaluators as functions. If the function has only one argument, it will be called with the task output. Otherwise, evaluator functions can be defined with any combination of the following arguments:\n", "\n", "- `input`: The input field of the dataset example\n", "- `output`: The output of the task\n", "- `expected`: The expected or reference output of the dataset example\n", "- `reference`: An alias for `expected`\n", "- `metadata`: Metadata associated with the dataset example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def contains_keyword(output: str) -> float:\n", " keywords = [\"Y Combinator\", \"YC\"]\n", " output_lower = output.lower()\n", " return 1.0 if any(keyword.lower() in output_lower for keyword in keywords) else 0.0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from typing import Any, Dict\n", "\n", "\n", "def jaccard_similarity(output: str, expected: Dict[str, Any]) -> float:\n", " # https://en.wikipedia.org/wiki/Jaccard_index\n", " actual_words = set(output.lower().split(\" \"))\n", " expected_words = set(expected[\"answer\"].lower().split(\" \"))\n", " words_in_common = actual_words.intersection(expected_words)\n", " all_words = actual_words.union(expected_words)\n", " return len(words_in_common) / len(all_words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or LLMs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client.experiments import create_evaluator\n", "\n", "eval_prompt_template = \"\"\"\n", "Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.\n", "Output only a single word (accurate or inaccurate).\n", "\n", "QUESTION: {question}\n", "\n", "REFERENCE_ANSWER: {reference_answer}\n", "\n", "ANSWER: {answer}\n", "\n", "ACCURACY (accurate / inaccurate):\n", "\"\"\"\n", "\n", "\n", "@create_evaluator(kind=\"llm\") # need the decorator or the kind will default to \"code\"\n", "def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:\n", " message_content = eval_prompt_template.format(\n", " question=input[\"question\"], reference_answer=expected[\"answer\"], answer=output\n", " )\n", " response = openai_client.chat.completions.create(\n", " model=\"gpt-4o\", messages=[{\"role\": \"user\", \"content\": message_content}]\n", " )\n", " response_message_content = response.choices[0].message.content.lower().strip()\n", " return 1.0 if response_message_content == \"accurate\" else 0.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experiments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run an experiment and evaluate the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "experiment = await px_client.experiments.run_experiment(\n", " dataset=dataset,\n", " task=task,\n", " experiment_name=\"initial-experiment\",\n", " evaluators=[contains_keyword, accuracy],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run more evaluators after the fact." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "experiment = await px_client.experiments.evaluate_experiment(\n", " experiment=experiment, evaluators=[contains_keyword, jaccard_similarity]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def show_evaluation_summary(exp: Any):\n", " contains_keyword_scores = [\n", " run.result[\"score\"]\n", " for run in exp[\"evaluation_runs\"]\n", " if run.result and \"score\" in run.result and run.name == \"contains_keyword\"\n", " ]\n", " jaccard_scores = [\n", " run.result[\"score\"]\n", " for run in exp[\"evaluation_runs\"]\n", " if run.result and \"score\" in run.result and run.name == \"jaccard_similarity\"\n", " ]\n", " accuracy_scores = [\n", " run.result[\"score\"]\n", " for run in exp[\"evaluation_runs\"]\n", " if run.result and \"score\" in run.result and run.name == \"accuracy\"\n", " ]\n", "\n", " print(\"📊 Evaluation Results:\")\n", " if contains_keyword_scores:\n", " avg_contains = sum(contains_keyword_scores) / len(contains_keyword_scores)\n", " print(f\" Contains Keyword: {avg_contains:.3f} (n={len(contains_keyword_scores)})\")\n", "\n", " if accuracy_scores:\n", " avg_accuracy = sum(accuracy_scores) / len(accuracy_scores)\n", " print(f\" Accuracy: {avg_accuracy:.3f} (n={len(accuracy_scores)})\")\n", "\n", " if jaccard_scores:\n", " avg_jaccard = sum(jaccard_scores) / len(jaccard_scores)\n", " print(f\" Jaccard Similarity: {avg_jaccard:.3f} (n={len(jaccard_scores)})\")\n", "\n", "\n", "show_evaluation_summary(experiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And iterate 🚀" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server