@arizeai/phoenix-mcp

Official

Overview Schema Related Servers Score Discussions

phoenix
tutorials
experiments

running_experiments_with_repetitions.ipynb•16.3 KiB

{ "cells": [ { "cell_type": "markdown", "id": "dca316ce", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "id": "e0dbe604", "metadata": {}, "source": [ "# Evaluation of Customer Reviews using Repetitions in Phoenix\n", "\n", "This notebook walks through how to generate synthetic customer reviews, upload them into **Phoenix**, and run evaluations to identify patterns and repetitions. \n", "We’ll go step by step: generating data, structuring it into a dataset, and finally running experiments inside Phoenix to compare model outputs against reference labels. \n", "Along the way, we’ll also look at screenshots of the Phoenix UI to see how datasets and experiments are visualized. \n" ] }, { "cell_type": "markdown", "id": "3fffd4f3", "metadata": {}, "source": [ "In the background, please set up a local instance of Phoenix. \n", "One way to do that is in your terminal, install arize-phoenix & run `phoenix serve`. For more information on other ways to run Phoenix locally, please check out our [documentation on self hosting](https://arize.com/docs/phoenix/self-hosting). " ] }, { "cell_type": "markdown", "id": "a82d6871", "metadata": {}, "source": [ "### Setup & Installation\n", "We start by installing the required dependencies: \n", "- **pandas** for data manipulation \n", "- **openai** for LLM calls \n", "- **arize-phoenix** to log and evaluate results" ] }, { "cell_type": "code", "execution_count": null, "id": "2935272c", "metadata": {}, "outputs": [], "source": [ "%pip install pandas openai arize-phoenix" ] }, { "cell_type": "markdown", "id": "6facc3b2", "metadata": {}, "source": [ "### Importing Libraries\n", "Next, we import the libraries needed to: \n", "- Generate synthetic customer reviews using the OpenAI API \n", "- Register Phoenix for tracking experiments \n", "- Prepare data and send it into Phoenix for evaluation \n" ] }, { "cell_type": "code", "execution_count": null, "id": "91129268", "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import re\n", "from getpass import getpass\n", "\n", "import pandas as pd\n", "from openai import AsyncOpenAI\n", "\n", "from phoenix.client import AsyncClient\n", "from phoenix.otel import register\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n", "\n", "openai_client = AsyncOpenAI()\n", "\n", "client = AsyncClient()\n", "\n", "tracer_provider = register(project_name=\"generating-datasets\", auto_instrument=True)" ] }, { "cell_type": "markdown", "id": "137a42d3", "metadata": {}, "source": [ "### Generating Synthetic Customer Reviews\n", "Here, we create a **few-shot prompt** that instructs the model to generate 25 product reviews for clothing items. \n", "This ensures we have a realistic dataset to evaluate with multiple tones and sentiments. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "6a8a6926", "metadata": {}, "outputs": [], "source": [ "few_shot_prompt = \"\"\"\n", "You are a creative writer simulating customer product reviews for a clothing brand.\n", "Generate exactly 25 unique reviews. Each review should be a few sentences long (max 200 words each) and sound like something a real customer might write.\n", "\n", "Balance them across the following categories:\n", "1. Highly Positive & Actionable → clear praise AND provides constructive suggestions for improvement.\n", "2. Positive but Generic → generally favorable but vague.\n", "3. Neutral / Mixed → highlights both pros and cons.\n", "4. Negative but Actionable → critical but with constructive feedback.\n", "5. Highly Negative & Non-Constructive → strongly negative, unhelpful venting.\n", "6. Off-topic → not about clothing at all (e.g., a review mistakenly left about a different product or service). Don't say anything about how the product is not about clothing.\n", "\n", "Constraints:\n", "- Cover all 6 categories across the 25 reviews.\n", "- Use a natural human voice, with realistic details.\n", "- Constructive feedback should be specific and actionable.\n", "- Make them really hard for someone else to classify. Add ambiguous reviews and reviews that are not clear, such as \"The shirt is fine. Not bad, not great. Might buy again\"\n", "- Decide the classified label randomly first and then write the review. Double check all the reviews and make sure you classify them correctly.\n", "\n", "OUTPUT SHAPE (JSON array ONLY; no extra text):\n", "[\n", " {\n", " \"input\": str,\n", " \"label\": \"highly positive & actionable\" | \"positive but generic\" | \"neutral\" | \"negative but actionable\" | \"highly negative\" | \"off-topic\",\n", " }\n", "]\n", "\n", "Style Examples, Here are examples for guidance (do not repeat):\n", "{\n", " \"input\": \"I absolutely love the new denim jacket I purchased. The fit is perfect, the stitching feels durable, and I’ve already gotten compliments. The inside lining is soft and makes it comfortable to wear for hours. One small suggestion would be to add an inner pocket for a phone or keys — that would make it perfect. Overall, I’ll definitely be back for more.\",\n", " \"label\": \"highly positive & actionable\"\n", "}\n", "{\n", " \"input\": \"The T-shirt I bought was nice. The color was good and it felt comfortable. I liked it overall and would probably buy again.\",\n", " \"label\": \"positive but generic\"\n", "}\n", "{\n", " \"input\": \"The dress arrived on time and the material is soft. However, the sizing runs a bit small, and the shade of blue was lighter than pictured. It’s not bad, but I’m not as excited about it as I hoped.\",\n", " \"label\": \"neutral\"\n", "}\n", "{\n", " \"input\": \"The shoes looked stylish but the soles wore down quickly after just a month. If the company improved the durability of the soles, these would be a great buy. Right now, I don’t think they’re worth the price.\",\n", " \"label\": \"negative but actionable\"\n", "}\n", "{\n", " \"input\": \"This sweater is terrible. The worst thing I’ve ever bought. Waste of money.\",\n", " \"label\": \"highly negative & non-constructive\"\n", "}\n", "{\n", " \"input\": \"I'm very disappointed in my delivery. The dog food arrived late and was leaking.\",\n", " \"label\": \"off-topic\"\n", "}\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "e07ca64a", "metadata": {}, "source": [ "### Running the LLM to Generate Data\n", "We send our prompt to the OpenAI model to generate the reviews. \n", "The output will be a structured set of text responses simulating customer feedback. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "e88bc163", "metadata": {}, "outputs": [], "source": [ "resp = await openai_client.chat.completions.create(\n", " model=\"gpt-5\",\n", " messages=[{\"role\": \"user\", \"content\": few_shot_prompt}],\n", ")\n", "content = resp.choices[0].message.content.strip()\n", "\n", "try:\n", " data = json.loads(content)\n", "except json.JSONDecodeError:\n", " m = re.search(r\"\\[\\s*{.*}\\s*\\]\\s*$\", content, re.S)\n", " assert m, \"Model did not return a JSON array.\"\n", " data = json.loads(m.group(0))" ] }, { "cell_type": "markdown", "id": "c08662cd", "metadata": {}, "source": [ "### Creating a DataFrame\n", "We load the generated responses into a **pandas DataFrame** with two columns: \n", "- `input`: the customer review text \n", "- `label`: the sentiment category we will later evaluate \n" ] }, { "cell_type": "code", "execution_count": null, "id": "659fd27e", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(data)[[\"input\", \"label\"]]\n", "df" ] }, { "cell_type": "markdown", "id": "d91ac12e", "metadata": {}, "source": [ "### Uploading to Phoenix\n", "We now create a **Phoenix dataset** named `clothing-product-reviews` from our DataFrame. This allows us to track, explore, and evaluate the generated reviews inside Phoenix. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "ed69072e", "metadata": {}, "outputs": [], "source": [ "dataset_name = \"my-customer-product-reviews\"\n", "dataset = await client.datasets.create_dataset(\n", " name=dataset_name,\n", " dataframe=df,\n", " input_keys=[\"input\"],\n", " output_keys=[\"label\"],\n", ")\n", "print(\"Dataset created.\")" ] }, { "cell_type": "markdown", "id": "22b4a183", "metadata": {}, "source": [ "This is what your uploaded dataset will look like in the Phoenix UI! \n", "\n", "<img alt=\"uploaded dataset image\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/repetitions_dataset_view.png\" width=\"900\"/>" ] }, { "cell_type": "markdown", "id": "96c013ba", "metadata": {}, "source": [ "### Defining the Evaluation Task\n", "We define a task function that represents how we want to evaluate each review. \n", "This is where you could run another LLM pass (or a heuristic) to classify the review. \n", "Phoenix wraps each run into an **Example** object for easy logging. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "6714c759", "metadata": {}, "outputs": [], "source": [ "async def my_task(theInput) -> str:\n", " TASK_PROMPT = f\"\"\"\n", " You will be given a single customer review about products from a clothing brand.\n", " Your job is to classify the type of review into a label.\n", " Please provide an explanation as to how you came to your answer.\n", "\n", " Allowed labels:\n", " - Highly Positive & Actionable\n", " - Positive but Generic\n", " - Neutral / Mixed\n", " - Negative but Actionable\n", " - Highly Negative & Non-Constructive\n", " - Off-topic\n", "\n", " Here is the customer review: {theInput}\n", "\n", " RESPONSE FORMAT:\n", " First provide your explanation, then on a new line write \"LABEL:\" followed by the exact label.\n", " Example:\n", " EXPLANATION: This review shows mixed sentiment with both positive and negative aspects...\n", " LABEL: Neutral / Mixed\n", " \"\"\"\n", " resp = await openai_client.chat.completions.create(\n", " model=\"gpt-4o-mini\", messages=[{\"role\": \"user\", \"content\": TASK_PROMPT}], temperature=1.0\n", " )\n", " content = resp.choices[0].message.content.strip()\n", "\n", " if \"LABEL:\" in content:\n", " label = content.split(\"LABEL:\")[-1].strip()\n", " return label\n", " else:\n", " return content.split(\"\\n\")[-1].strip()" ] }, { "cell_type": "markdown", "id": "9c1e8957", "metadata": {}, "source": [ "### Running an Experiment\n", "We run an experiment on our dataset using the defined task. \n", "This produces a labeled set of outputs that we can compare against our expectations. \n", "Phoenix records: \n", "- inputs (customer reviews) \n", "- outputs (model classifications) \n", "- metadata (timing, tokens, cost, etc.) \n" ] }, { "cell_type": "code", "execution_count": null, "id": "8c5afa1f", "metadata": {}, "outputs": [], "source": [ "from phoenix.client.experiments import async_run_experiment\n", "\n", "experiment = await async_run_experiment(\n", " dataset=dataset,\n", " task=my_task,\n", " experiment_name=\"testing explanations\",\n", " client=client,\n", " repetitions=3,\n", ")" ] }, { "cell_type": "markdown", "id": "cbb164c6", "metadata": {}, "source": [ "This is what your uploaded experiment will look like in the Phoenix UI! You can click through the arrows as you want to look through each of the repetitions\n", "\n", "<img alt=\"uploaded dataset image\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/repetitions_experiment_view.png\" width=\"900\"/>" ] }, { "cell_type": "markdown", "id": "46921a0e", "metadata": {}, "source": [ "### Improving our Task\n", "After analyzing our experiment results and the results of our repetitions, we can iterate on our eval template \n", "and see if there are any gaps in our prompt, or other things we may need to redefine. \n", "\n", "Using the same Task Prompt, I added clearer definitions for the labels as the biggest change to see if there's any improvement." ] }, { "cell_type": "code", "execution_count": null, "id": "fa13dfbf", "metadata": {}, "outputs": [], "source": [ "async def improve_my_task(theInput) -> str:\n", " TASK_PROMPT = f\"\"\"\n", " You will be given a single customer review about products from a clothing brand, related to fashion clothing and apparel.\n", " Your job is to classify the type of review into a label.\n", " Please provide an explanation as to how you came to your answer.\n", "\n", " Allowed labels:\n", " - Highly Positive & Actionable: clear praise AND provides constructive suggestions for improvement.\n", " - Positive but Generic: generally favorable but vague.\n", " - Neutral / Mixed: highlights both pros and cons.\n", " - Negative but Actionable: critical but with constructive feedback.\n", " - Highly Negative & Non-Constructive: strongly negative, unhelpful venting.\n", " - Off-topic: not about clothing at all (e.g., a review mistakenly left about a different product or service).\n", "\n", " Here is the customer review: {theInput}\n", "\n", " RESPONSE FORMAT:\n", " First provide your explanation, then on a new line write \"LABEL:\" followed by the exact label.\n", " Example:\n", " EXPLANATION: This review shows mixed sentiment with both positive and negative aspects...\n", " LABEL: Neutral / Mixed\n", " \"\"\"\n", " resp = await openai_client.chat.completions.create(\n", " model=\"gpt-4o-mini\", messages=[{\"role\": \"user\", \"content\": TASK_PROMPT}], temperature=1.0\n", " )\n", " content = resp.choices[0].message.content.strip()\n", "\n", " if \"LABEL:\" in content:\n", " label = content.split(\"LABEL:\")[-1].strip()\n", " return label\n", " else:\n", " return content.split(\"\\n\")[-1].strip()" ] }, { "cell_type": "code", "execution_count": null, "id": "2caef7ca", "metadata": {}, "outputs": [], "source": [ "from phoenix.client.experiments import async_run_experiment\n", "\n", "experiment = await async_run_experiment(\n", " dataset=dataset,\n", " task=improve_my_task,\n", " experiment_name=\"improving my evaluation task\",\n", " client=client,\n", " repetitions=3,\n", ")" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

running_experiments_with_repetitions.ipynb•16.3 KiB