@arizeai/phoenix-mcp

Official

Overview Schema Related Servers Score Discussions

phoenix
tutorials
experiments

summarization.ipynb•22.2 KiB

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "<h1 align=\"center\">Prompt Template Iteration for a Summarization Service</h1>\n", "\n", "Imagine you're deploying a service for your media company's summarization model that condenses daily news into concise summaries to be displayed online. One challenge of using LLMs for summarization is that even the best models tend to be verbose.\n", "\n", "In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces concise yet accurate summaries. You will:\n", "\n", "- Upload a **dataset** of **examples** containing articles and human-written reference summaries to Phoenix\n", "- Define an **experiment task** that summarizes a news article\n", "- Devise **evaluators** for length and ROUGE score\n", "- Run **experiments** to iterate on your prompt template and to compare the summaries produced by different LLMs\n", "\n", "⚠️ This tutorial requires and OpenAI API key, and optionally, an Anthropic API key.\n", "\n", "Let's get started!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Dependencies and Import Libraries\n", "\n", "Install requirements and import libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -U arize-phoenix anthropic openai openinference-instrumentation-openai openinference-instrumentation-anthropic rouge tiktoken datasets pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from typing import Any, Dict\n", "\n", "import pandas as pd\n", "\n", "pd.set_option(\"display.max_colwidth\", None) # display full cells of dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch Phoenix\n", "\n", "Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import phoenix as px\n", "\n", "px.launch_app().view()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client import AsyncClient\n", "\n", "px_client = AsyncClient()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instrument Your Application\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openinference.instrumentation.anthropic import AnthropicInstrumentor\n", "from openinference.instrumentation.openai import OpenAIInstrumentor\n", "\n", "from phoenix.otel import register\n", "\n", "tracer_provider = register(endpoint=\"http://127.0.0.1:6006/v1/traces\")\n", "OpenAIInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)\n", "AnthropicInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Your Dataset\n", "\n", "Download your [data](https://huggingface.co/datasets/abisee/cnn_dailymail) from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.\n", "\n", "Upload the data as a **dataset** in Phoenix and follow the link in the cell output to inspect the individual **examples** of the dataset. Later in the notebook, you will run **experiments** over this dataset in order to iteratively improve your summarization application." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime, timezone\n", "\n", "from datasets import load_dataset\n", "\n", "hf_ds = load_dataset(\"abisee/cnn_dailymail\", \"3.0.0\")\n", "df = (\n", " hf_ds[\"test\"]\n", " .to_pandas()\n", " .sample(n=10, random_state=0)\n", " .reset_index(drop=True)\n", " .rename(columns={\"highlights\": \"summary\"})\n", ")\n", "dataset = await px_client.datasets.create_dataset(\n", " name=f\"news-article-summaries-{datetime.now(timezone.utc)}\",\n", " dataframe=df,\n", " input_keys=[\"article\"],\n", " output_keys=[\"summary\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Your Experiment Task\n", "\n", "A **task** is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An **experiment** maps a task across all the examples in a dataset and optionally executes **evaluators** to grade the task outputs.\n", "\n", "You'll start by defining your task, which in this case, invokes OpenAI. First, set your OpenAI API key if it is not already present as an environment variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "if os.environ.get(\"OPENAI_API_KEY\") is None:\n", " os.environ[\"OPENAI_API_KEY\"] = getpass(\"🔑 Enter your OpenAI API key: \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, define a function to format a prompt template and invoke an OpenAI model on an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import AsyncOpenAI\n", "\n", "openai_client = AsyncOpenAI()\n", "\n", "\n", "async def summarize_article_openai(input: Any, prompt_template: str, model: str) -> str:\n", " formatted_prompt_template = prompt_template.format(article=input[\"article\"])\n", " response = await openai_client.chat.completions.create(\n", " model=model,\n", " messages=[\n", " {\"role\": \"assistant\", \"content\": formatted_prompt_template},\n", " ],\n", " )\n", " assert response.choices\n", " return response.choices[0].message.content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this function, you can use `functools.partial` to derive your first task, which is a callable that takes in an example and returns an output. Test out your task by invoking it on the test example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import textwrap\n", "from functools import partial\n", "\n", "template = \"\"\"\n", "Summarize the article:\n", "\n", "ARTICLE\n", "=======\n", "{article}\n", "\n", "SUMMARY\n", "=======\n", "\"\"\"\n", "gpt_4o = \"gpt-4o-2024-05-13\"\n", "task = partial(summarize_article_openai, prompt_template=template, model=gpt_4o)\n", "test_example = dataset.examples[0]\n", "print(textwrap.fill(await task(test_example[\"input\"]), width=100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Your Evaluators\n", "\n", "Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:\n", "\n", "- ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)\n", "- ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)\n", "- ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.\n", "\n", "Higher ROUGE scores mean that a generated summary is more similar to the corresponding reference summary. Scores near 1 / 2 are considered excellent, and a [model fine-tuned on this particular dataset achieved a rouge score of ~0.44](https://huggingface.co/datasets/abisee/cnn_dailymail#supported-tasks-and-leaderboards).\n", "\n", "Since we also care about conciseness, you'll also define an evaluator to count the number of tokens in each generated summary.\n", "\n", "Note that you can use any third-party library you like while defining evaluators (in your case, `rouge` and `tiktoken`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "from rouge import Rouge\n", "\n", "\n", "# convenience functions\n", "def _rouge_1(hypothesis: str, reference: str) -> Dict[str, Any]:\n", " scores = Rouge().get_scores(hypothesis, reference)\n", " return scores[0][\"rouge-1\"]\n", "\n", "\n", "def _rouge_1_f1_score(hypothesis: str, reference: str) -> float:\n", " return _rouge_1(hypothesis, reference)[\"f\"]\n", "\n", "\n", "def _rouge_1_precision(hypothesis: str, reference: str) -> float:\n", " return _rouge_1(hypothesis, reference)[\"p\"]\n", "\n", "\n", "def _rouge_1_recall(hypothesis: str, reference: str) -> float:\n", " return _rouge_1(hypothesis, reference)[\"r\"]\n", "\n", "\n", "# evaluators\n", "def rouge_1_f1_score(output: str, expected: Dict[str, Any]) -> float:\n", " return _rouge_1_f1_score(hypothesis=output, reference=expected[\"summary\"])\n", "\n", "\n", "def rouge_1_precision(output: str, expected: Dict[str, Any]) -> float:\n", " return _rouge_1_precision(hypothesis=output, reference=expected[\"summary\"])\n", "\n", "\n", "def rouge_1_recall(output: str, expected: Dict[str, Any]) -> float:\n", " return _rouge_1_recall(hypothesis=output, reference=expected[\"summary\"])\n", "\n", "\n", "def num_tokens(output: str) -> int:\n", " encoding = tiktoken.encoding_for_model(gpt_4o)\n", " return len(encoding.encode(output))\n", "\n", "\n", "EVALUATORS = [rouge_1_f1_score, rouge_1_precision, rouge_1_recall, num_tokens]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Experiments and Iterate on Your Prompt Template\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run your first experiment and follow the link in the cell output to inspect the task outputs (generated summaries) and evaluations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "initial_experiment_results = await px_client.experiments.run_experiment(\n", " dataset=dataset,\n", " task=task,\n", " experiment_name=\"initial-template\",\n", " experiment_description=\"first experiment using a simple prompt template\",\n", " experiment_metadata={\"vendor\": \"openai\", \"model\": gpt_4o},\n", " evaluators=EVALUATORS,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def show_evaluation_summary(exp):\n", " rouge_f1_scores = [\n", " run.result[\"score\"]\n", " for run in exp[\"evaluation_runs\"]\n", " if run.result and \"score\" in run.result and run.name == \"rouge_1_f1_score\"\n", " ]\n", " rouge_precision_scores = [\n", " run.result[\"score\"]\n", " for run in exp[\"evaluation_runs\"]\n", " if run.result and \"score\" in run.result and run.name == \"rouge_1_precision\"\n", " ]\n", " rouge_recall_scores = [\n", " run.result[\"score\"]\n", " for run in exp[\"evaluation_runs\"]\n", " if run.result and \"score\" in run.result and run.name == \"rouge_1_recall\"\n", " ]\n", " token_counts = [\n", " run.result[\"score\"]\n", " for run in exp[\"evaluation_runs\"]\n", " if run.result and \"score\" in run.result and run.name == \"num_tokens\"\n", " ]\n", "\n", " print(\"Template results:\")\n", " print(\n", " f\" ROUGE-1 F1: {sum(rouge_f1_scores) / len(rouge_f1_scores):.3f} (n={len(rouge_f1_scores)})\"\n", " )\n", " print(\n", " f\" ROUGE-1 Precision: {sum(rouge_precision_scores) / len(rouge_precision_scores):.3f} (n={len(rouge_precision_scores)})\"\n", " )\n", " print(\n", " f\" ROUGE-1 Recall: {sum(rouge_recall_scores) / len(rouge_recall_scores):.3f} (n={len(rouge_recall_scores)})\"\n", " )\n", " print(\n", " f\" Average tokens: {sum(token_counts) / len(token_counts):.1f} (n={len(token_counts)})\"\n", " )\n", "\n", "\n", "show_evaluation_summary(initial_experiment_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "concise_template = \"\"\"\n", "Summarize the article in two to four sentences. Be concise and include only the most important information.\n", "\n", "ARTICLE\n", "=======\n", "{article}\n", "\n", "SUMMARY\n", "=======\n", "\"\"\"\n", "task = partial(summarize_article_openai, prompt_template=concise_template, model=gpt_4o)\n", "second_experiment_results = await px_client.experiments.run_experiment(\n", " dataset=dataset,\n", " task=task,\n", " experiment_name=\"concise-template\",\n", " experiment_description=\"explicitly instuct the llm to be concise\",\n", " experiment_metadata={\"vendor\": \"openai\", \"model\": gpt_4o},\n", " evaluators=EVALUATORS,\n", ")\n", "\n", "show_evaluation_summary(second_experiment_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspecting the experiment results, you'll notice that the average `num_tokens` has indeed increased, but the generated summaries are still far more verbose than the reference summaries.\n", "\n", "Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df = (\n", " hf_ds[\"train\"]\n", " .to_pandas()\n", " .sample(n=5, random_state=42)\n", " .head()\n", " .rename(columns={\"highlights\": \"summary\"})\n", ")\n", "\n", "example_template = \"\"\"\n", "ARTICLE\n", "=======\n", "{article}\n", "\n", "SUMMARY\n", "=======\n", "{summary}\n", "\"\"\"\n", "\n", "examples = \"\\n\".join(\n", " [\n", " example_template.format(article=row[\"article\"], summary=row[\"summary\"])\n", " for _, row in train_df.iterrows()\n", " ]\n", ")\n", "\n", "few_shot_template = \"\"\"\n", "Summarize the article in two to four sentences. Be concise and include only the most important information, as in the examples below.\n", "\n", "EXAMPLES\n", "========\n", "\n", "{examples}\n", "\n", "\n", "Now summarize the following article.\n", "\n", "ARTICLE\n", "=======\n", "{article}\n", "\n", "SUMMARY\n", "=======\n", "\"\"\"\n", "\n", "few_shot_template = few_shot_template.format(\n", " examples=examples,\n", " article=\"{article}\",\n", ")\n", "print(few_shot_template)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now run the experiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "task = partial(summarize_article_openai, prompt_template=few_shot_template, model=gpt_4o)\n", "few_shot_experiment_results = await px_client.experiments.run_experiment(\n", " dataset=dataset,\n", " task=task,\n", " experiment_name=\"few-shot-template\",\n", " experiment_description=\"include examples\",\n", " experiment_metadata={\"vendor\": \"openai\", \"model\": gpt_4o},\n", " evaluators=EVALUATORS,\n", ")\n", "\n", "show_evaluation_summary(few_shot_experiment_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare With Another Model (Optional)\n", "\n", "⚠️ This section requires an Anthropic API key.\n", "\n", "Now that you have a prompt template that is performing reasonably well, you can compare the performance of other models on this particular task. Anthropic's Claude is notable for producing concise and to-the-point output.\n", "\n", "First, enter your Anthropic API key if it is not already present.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "if os.environ.get(\"ANTHROPIC_API_KEY\") is None:\n", " os.environ[\"ANTHROPIC_API_KEY\"] = getpass(\"🔑 Enter your Anthropic API key: \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, define a new task that summarizes articles using the same prompt template as before. Then, run the experiment. Tasks can optionally accept special argument names that will be bound to specific values. `example` is bound to the dataset example associated with an experiment run." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from anthropic import AsyncAnthropic\n", "\n", "anthropic_client = AsyncAnthropic()\n", "\n", "\n", "async def summarize_article_anthropic(input: Any, prompt_template: str, model: str) -> str:\n", " formatted_prompt_template = prompt_template.format(article=input[\"article\"])\n", " message = await anthropic_client.messages.create(\n", " model=model,\n", " max_tokens=1024,\n", " messages=[{\"role\": \"user\", \"content\": formatted_prompt_template}],\n", " )\n", " return message.content[0].text\n", "\n", "\n", "claude_35_sonnet = \"claude-3-5-sonnet-20240620\"\n", "task = partial(\n", " summarize_article_anthropic, prompt_template=few_shot_template, model=claude_35_sonnet\n", ")\n", "\n", "anthropic_experiment_results = await px_client.experiments.run_experiment(\n", " dataset=dataset,\n", " task=task,\n", " experiment_name=\"anthropic-few-shot\",\n", " experiment_description=\"anthropic\",\n", " experiment_metadata={\"vendor\": \"anthropic\", \"model\": claude_35_sonnet},\n", " evaluators=EVALUATORS,\n", ")\n", "\n", "show_evaluation_summary(anthropic_experiment_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your experiment does not produce more concise summaries, inspect the individual results. You may notice that some summaries from Claude 3.5 Sonnet start with a preamble such as:\n", "\n", "```\n", "Here is a concise 3-sentence summary of the article...\n", "```\n", "\n", "See if you can tweak the prompt and re-run the experiment to exclude this preamble from Claude's output. Doing so should result in the most concise summaries yet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Synopsis and Next Steps\n", "\n", "Congrats! In this tutorial, you have:\n", "\n", "- Created a Phoenix dataset\n", "- Defined an experimental task and custom evaluators\n", "- Iteratively improved a prompt template to produce more concise summaries with balanced ROUGE-1 precision and recall\n", "\n", "As next steps, you can continue to iterate on your prompt template. If you find that you are unable to improve your summaries with further prompt engineering, you can export your dataset from Phoenix and use the [OpenAI fine-tuning API](https://platform.openai.com/docs/guides/fine-tuning/create-a-fine-tuned-model) to train a bespoke model for your needs." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

summarization.ipynb•22.2 KiB