@arizeai/phoenix-mcp

Official

Overview Inspect Schema Related Servers Score Discussions

phoenix_prompt_tutorial.ipynb•31.2 kB

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Phoenix Prompts Tutorial - Companion Notebook\n", "\n", "This notebook accompanies the Phoenix Prompts Quickstart documentation. Follow along with the docs for detailed explanations.\n", "\n", "**Prerequisites:**\n", "- Phoenix running locally (`phoenix serve`)\n", "- OpenAI API key set as `OPENAI_API_KEY` environment variable\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "# Set OpenAI API key if not already set\n", "if not os.getenv(\"OPENAI_API_KEY\"):\n", " os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Part 1: Find and Edit Prompts\n", "\n", "## Step 1: Locate Bad Spans in Traces\n", "\n", "First, let's build and trace a support agent to generate some traces we can inspect.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the classification system prompt\n", "system_prompt = \"\"\"\n", "Account Creation\n", "Login Issues\n", "Password Reset\n", "Two-Factor Authentication\n", "Profile Updates\n", "Billing Inquiry\n", "Refund Request\n", "Subscription Upgrade/Downgrade\n", "Payment Method Update\n", "Invoice Request\n", "Order Status\n", "Shipping Delay\n", "Product Return\n", "Warranty Claim\n", "Technical Bug Report\n", "Feature Request\n", "Integration Help\n", "Data Export\n", "Security Concern\n", "Terms of Service Question\n", "Privacy Policy Question\n", "Compliance Inquiry\n", "Accessibility Support\n", "Language Support\n", "Mobile App Issue\n", "Desktop App Issue\n", "Email Notifications\n", "Marketing Preferences\n", "Beta Program Enrollment\n", "General Feedback\n", "\n", "Return just the category, no other text for the support query.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "import requests\n", "\n", "url = \"https://storage.googleapis.com/arize-phoenix-assets/assets/images/guidelines.json\"\n", "response = requests.get(url)\n", "\n", "with open(\"guidelines.json\", \"wb\") as f:\n", " f.write(response.content)\n", "\n", "with open(\"guidelines.json\", \"r\") as f:\n", " guidelines = json.load(f)\n", "\n", "print(\"✅ Loaded guidelines.json\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build and Trace Support Agent\n", "from openai import OpenAI\n", "\n", "from phoenix.otel import register\n", "\n", "# Setup Phoenix tracing with auto-instrumentation for OpenAI\n", "tracer_provider = register(project_name=\"support-agent\", auto_instrument=True)\n", "tracer = tracer_provider.get_tracer(__name__)\n", "\n", "client = OpenAI()\n", "\n", "\n", "@tracer.tool\n", "def retrieve_guidelines(classification: str) -> str:\n", " \"\"\"Retrieve guidelines based on the support query classification.\"\"\"\n", " return guidelines.get(classification, \"No guidelines found.\")\n", "\n", "\n", "@tracer.chain\n", "def handle_support_query(query: str) -> str:\n", " # Step 1: Classify the query\n", " classification_response = client.chat.completions.create(\n", " model=\"gpt-4o-mini\",\n", " messages=[\n", " {\"role\": \"system\", \"content\": system_prompt},\n", " {\"role\": \"user\", \"content\": query},\n", " ],\n", " )\n", " classification = classification_response.choices[0].message.content\n", "\n", " # Step 2: Retrieve guidelines based on classification\n", " guideline = retrieve_guidelines(classification)\n", "\n", " # Step 3: Generate final response using guidelines\n", " response = client.chat.completions.create(\n", " model=\"gpt-4o-mini\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": f\"Respond to the support query using the following guidelines:\\n{guideline}\",\n", " },\n", " {\"role\": \"user\", \"content\": query},\n", " ],\n", " )\n", " return response.choices[0].message.content" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run the agent on some test queries to generate traces\n", "queries = [\n", " \"warranty reg page says 404\",\n", " \"every time i click settings, bye\",\n", " \"when's dark mode? u said soon\",\n", " \"calendar sync eats my events\",\n", " \"cant dl my info, button grayed\",\n", "]\n", "\n", "for query in queries:\n", " print(f\"Query: {query}\")\n", " result = handle_support_query(query)\n", " print(f\"Response: {result[:200]}...\")\n", " print(\"-\" * 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Replay Span and Edit Prompt in Playground\n", "\n", "Open Phoenix UI and navigate to your traces. Click on a span and use the **Playground** to:\n", "1. Save the original prompt as `support-classifier`\n", "2. Edit the prompt and test changes\n", "3. Save the edited version as Version 2\n", "\n", "## Step 3: Load Edited Prompt Back Into Your Code\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client import Client\n", "\n", "px_client = Client()\n", "\n", "# Pull the latest version\n", "prompt = px_client.prompts.get(prompt_identifier=\"support-classifier\")\n", "\n", "# Or pull a specific version\n", "# prompt = px_client.prompts.get(prompt_version_id=\"YOUR_VERSION_ID\")\n", "\n", "print(f\"Loaded prompt: {prompt._model_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Part 2: Test Prompts at Scale\n", "\n", "## Step 1: Load Dataset of Inputs\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from phoenix.client import Client\n", "\n", "px_client = Client()\n", "\n", "# Load our support query dataset\n", "support_query_csv_url = (\n", " \"https://storage.googleapis.com/arize-phoenix-assets/assets/images/support_queries.csv\"\n", ")\n", "support_query_df = pd.read_csv(support_query_csv_url)\n", "\n", "print(f\"Loaded {len(support_query_df)} examples\")\n", "print(support_query_df.head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Upload dataset to Phoenix\n", "support_query_dataset = px_client.datasets.create_dataset(\n", " dataframe=support_query_df,\n", " name=\"support-query-dataset\",\n", " input_keys=[\"query\"],\n", " output_keys=[\"ground_truth\"],\n", ")\n", "\n", "print(f\"✅ Created dataset: {support_query_dataset.id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Run Experiment with Our Current Prompt\n", "\n", "### Define Task Function\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import AsyncOpenAI\n", "\n", "from phoenix.client import Client\n", "\n", "async_openai_client = AsyncOpenAI()\n", "px_client = Client()\n", "\n", "prompt = px_client.prompts.get(prompt_identifier=\"support-classifier\")\n", "model = prompt._model_name\n", "messages = prompt._template[\"messages\"]\n", "\n", "# Edit user prompt to match dataset input key \"query\"\n", "messages[1][\"content\"][0][\"text\"] = \"{{query}}\"\n", "\n", "\n", "async def task(input):\n", " task_messages = [\n", " {\"role\": m[\"role\"], \"content\": [{\"type\": \"text\", \"text\": m[\"content\"][0][\"text\"]}]}\n", " for m in messages\n", " ]\n", " task_messages[1][\"content\"][0][\"text\"] = input[\"query\"]\n", " response = await async_openai_client.chat.completions.create(\n", " model=model,\n", " messages=task_messages,\n", " )\n", " return response.choices[0].message.content\n", "\n", "\n", "print(f\"Task defined with model: {model}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define Evaluators\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analysis evaluator template for rich feedback\n", "analysis_evaluator_template = \"\"\"\n", "You are an expert support ticket classifier evaluator.\n", "\n", "Your task: Given a user query, the predicted classification from a model, and the correct classification, decide if the prediction is correct, explain why, identify possible confusion reasons, highlight the exact part(s) of the query that best support the correct classification, and (if incorrect) label the type of error made.\n", "\n", "Here are the available classes:\n", "\n", "Account Creation, Login Issues, Password Reset, Two-Factor Authentication, Profile Updates,\n", "Billing Inquiry, Refund Request, Subscription Upgrade/Downgrade, Payment Method Update, Invoice Request,\n", "Order Status, Shipping Delay, Product Return, Warranty Claim, Technical Bug Report, Feature Request,\n", "Integration Help, Data Export, Security Concern, Terms of Service Question, Privacy Policy Question,\n", "Compliance Inquiry, Accessibility Support, Language Support, Mobile App Issue, Desktop App Issue,\n", "Email Notifications, Marketing Preferences, Beta Program Enrollment, General Feedback\n", "\n", "---\n", "\n", "**Inputs:**\n", "- Query: {query}\n", "- Predicted classification: {output}\n", "- Correct classification: {ground_truth}\n", "\n", "---\n", "\n", "**Error Type Definitions**:\n", "- **broad_vs_specific** → The model picked a broader category instead of the more specific correct one (or vice versa).\n", "- **keyword_bias** → The model latched onto an isolated keyword that led to the wrong class.\n", "- **multi_intent_confusion** → The query had multiple possible intents; model picked the less dominant one.\n", "- **ambiguous_query** → The query was unclear or underspecified.\n", "- **off_topic** → The query doesn't match any class well; model still guessed.\n", "- **paraphrase_gap** → The model failed to recognize a non-standard phrasing of the correct intent.\n", "- **other** → Any other reason.\n", "- **none** → Use only if correctness is \"correct\".\n", "\n", "---\n", "\n", "**Output Format (JSON)**:\n", " \"correctness\": \"correct\" or \"incorrect\",\n", " \"explanation\": \"Brief explanation of why the predicted classification is correct or incorrect.\",\n", " \"confusion_reason\": \"If incorrect, explain why the model may have made this choice. If correct, say 'no confusion'.\",\n", " \"error_type\": \"One of the error types above. Use 'none' if correct.\",\n", " \"evidence_span\": \"Exact phrase(s) from the query that strongly indicate the correct classification.\",\n", " \"prompt_fix_suggestion\": \"One clear instruction to add to the classifier prompt to prevent this error.\"\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.evals import create_evaluator\n", "from phoenix.evals.llm import LLM\n", "\n", "llm = LLM(provider=\"openai\", model=\"gpt-4.1\")\n", "\n", "\n", "def normalize(label):\n", " return label.strip().strip('\"').strip(\"'\").lower()\n", "\n", "\n", "async def ground_truth_evaluator(expected, output):\n", " \"\"\"Simple evaluator: checks if output matches ground truth.\"\"\"\n", " return normalize(expected.get(\"ground_truth\")) == normalize(output)\n", "\n", "\n", "SCHEMA = {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"correctness\": {\"type\": \"string\", \"enum\": [\"correct\", \"incorrect\"]},\n", " \"explanation\": {\"type\": \"string\"},\n", " \"confusion_reason\": {\"type\": \"string\"},\n", " \"error_type\": {\"type\": \"string\"},\n", " \"evidence_span\": {\"type\": \"string\"},\n", " \"prompt_fix_suggestion\": {\"type\": \"string\"},\n", " },\n", " \"required\": [\n", " \"correctness\",\n", " \"explanation\",\n", " \"confusion_reason\",\n", " \"error_type\",\n", " \"evidence_span\",\n", " \"prompt_fix_suggestion\",\n", " ],\n", " \"additionalProperties\": False,\n", "}\n", "\n", "\n", "@create_evaluator(name=\"output_evaluator\", kind=\"llm\")\n", "def analysis_evaluator(input, expected, output):\n", " \"\"\"LLM evaluator: provides rich feedback on classification errors.\"\"\"\n", " query = input.get(\"query\")\n", " ground_truth = expected.get(\"ground_truth\")\n", "\n", " prompt = (\n", " analysis_evaluator_template.replace(\"{query}\", query)\n", " .replace(\"{ground_truth}\", ground_truth)\n", " .replace(\"{output}\", output)\n", " )\n", " obj = llm.generate_object(prompt=prompt, schema=SCHEMA)\n", " correctness = obj[\"correctness\"]\n", " score = 1.0 if correctness == \"correct\" else 0.0\n", " explanation = (\n", " f\"correctness: {correctness}; \"\n", " f\"explanation: {obj.get('explanation', '')}; \"\n", " f\"confusion_reason: {obj.get('confusion_reason', '')}; \"\n", " f\"error_type: {obj.get('error_type', '')}; \"\n", " f\"evidence_span: {obj.get('evidence_span', '')}; \"\n", " f\"prompt_fix_suggestion: {obj.get('prompt_fix_suggestion', '')};\"\n", " )\n", " return {\"score\": score, \"label\": correctness, \"explanation\": explanation}\n", "\n", "\n", "print(\"✅ Evaluators defined\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run Experiment\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client.experiments import async_run_experiment\n", "\n", "experiment = await async_run_experiment(\n", " dataset=support_query_dataset,\n", " task=task,\n", " evaluators=[ground_truth_evaluator, analysis_evaluator],\n", " experiment_name=\"support-classifier-baseline\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Analyze Experiment Results\n", "\n", "Navigate to the Phoenix UI to view experiment results. Filter for incorrect classifications:\n", "```\n", "evals[\"output_evaluator\"].score == 0\n", "```\n", "\n", "Filter for broad_vs_specific errors:\n", "```\n", "'broad_vs_specific' in evals[\"output_evaluator\"].explanation\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Part 3: Compare Prompt Versions\n", "\n", "## Edit Prompt Template (Version 3)\n", "\n", "Based on our analysis, we'll add an instruction to address broad_vs_specific errors.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client import Client\n", "from phoenix.client.types.prompts import PromptVersion\n", "\n", "px_client = Client()\n", "\n", "# New instruction to address broad_vs_specific errors\n", "broad_vs_specific_instruction = \"\"\"When classifying user queries, always prefer the most specific applicable category over a broader one. If a query mentions a clear, concrete action or object (e.g., subscription downgrade, invoice, profile name), classify it under that specific intent rather than a general one (e.g., Billing Inquiry, General Feedback).\"\"\"\n", "\n", "# Get existing prompt\n", "existing = px_client.prompts.get(prompt_identifier=\"support-classifier\")\n", "\n", "# Modify the template\n", "messages = existing._template[\"messages\"]\n", "messages[0][\"content\"][0][\"text\"] += \"\\n\\n\" + broad_vs_specific_instruction\n", "\n", "# Create new version with modifications\n", "new_version = PromptVersion(\n", " messages,\n", " model_name=existing._model_name,\n", " model_provider=existing._model_provider,\n", " template_format=existing._template_format,\n", " description=\"Added broad_vs_specific rule\",\n", ")\n", "\n", "# Save as new version\n", "version_3 = px_client.prompts.create(\n", " name=\"support-classifier\",\n", " version=new_version,\n", ")\n", "\n", "print(f\"✅ Created Version 3: {version_3.id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Edit Prompt Parameters (Version 4)\n", "\n", "Now let's create another version with adjusted model parameters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get existing prompt (fresh copy)\n", "existing = px_client.prompts.get(prompt_identifier=\"support-classifier\")\n", "\n", "new_version = PromptVersion(\n", " existing._template[\"messages\"],\n", " model_name=\"gpt-4.1-mini\",\n", " model_provider=existing._model_provider,\n", " template_format=\"MUSTACHE\",\n", " description=\"Using temperature=0.3, top_p=0.8, model_name=gpt-4.1-mini\",\n", ")\n", "\n", "# Set invocation parameters\n", "new_version._invocation_parameters = {\n", " \"temperature\": 0.3,\n", " \"top_p\": 0.8,\n", "}\n", "\n", "version_4 = px_client.prompts.create(\n", " name=\"support-classifier\",\n", " version=new_version,\n", ")\n", "\n", "print(f\"✅ Created Version 4: {version_4.id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare Prompt Versions\n", "\n", "Copy the version IDs from Phoenix UI and run experiments to compare.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import AsyncOpenAI\n", "\n", "from phoenix.client import Client\n", "from phoenix.client.experiments import async_run_experiment\n", "\n", "px_client = Client()\n", "async_openai_client = AsyncOpenAI()\n", "\n", "# Get dataset\n", "dataset = px_client.datasets.get_dataset(dataset=\"support-query-dataset\")\n", "\n", "# Version IDs - REPLACE WITH YOUR VERSION IDs FROM PHOENIX UI\n", "VERSION_3 = \"REPLACE_WITH_VERSION_3_ID\"\n", "VERSION_4 = \"REPLACE_WITH_VERSION_4_ID\"\n", "\n", "# Get prompt versions\n", "prompt_v3 = px_client.prompts.get(prompt_version_id=VERSION_3)\n", "prompt_v4 = px_client.prompts.get(prompt_version_id=VERSION_4)\n", "\n", "print(f\"Version 3 model: {prompt_v3._model_name}\")\n", "print(f\"Version 4 model: {prompt_v4._model_name}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define reusable task factory\n", "def create_task(prompt):\n", " model = prompt._model_name\n", " messages = prompt._template[\"messages\"].copy()\n", "\n", " async def task(input):\n", " # Create a copy to avoid mutating the original\n", " task_messages = [\n", " {\"role\": m[\"role\"], \"content\": [{\"type\": \"text\", \"text\": m[\"content\"][0][\"text\"]}]}\n", " for m in messages\n", " ]\n", " task_messages[1][\"content\"][0][\"text\"] = input[\"query\"]\n", " response = await async_openai_client.chat.completions.create(\n", " model=model,\n", " messages=task_messages,\n", " )\n", " return response.choices[0].message.content\n", "\n", " return task" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run experiment with Version 3\n", "print(\"🧪 Running experiment with Version 3...\")\n", "experiment_v3 = await async_run_experiment(\n", " dataset=dataset,\n", " task=create_task(prompt_v3),\n", " evaluators=[ground_truth_evaluator, analysis_evaluator],\n", " experiment_name=\"support-classifier-v3\",\n", ")\n", "\n", "# Run experiment with Version 4\n", "print(\"\\n🧪 Running experiment with Version 4...\")\n", "experiment_v4 = await async_run_experiment(\n", " dataset=dataset,\n", " task=create_task(prompt_v4),\n", " evaluators=[ground_truth_evaluator, analysis_evaluator],\n", " experiment_name=\"support-classifier-v4\",\n", ")\n", "\n", "print(f\"\\n✅ Compare results at: http://localhost:6006/datasets/{dataset.id}/experiments\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Part 4: Optimize with Prompt Learning\n", "\n", "## Install the Prompt Learning SDK\n", "\n", "```bash\n", "git clone https://github.com/priyanjindal/prompt-learning.git\n", "cd prompt-learning\n", "pip install .\n", "```\n", "\n", "## Load Experiment for Training\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import pandas as pd\n", "import requests\n", "\n", "\n", "def process_experiment(experiment_id, feedback_columns=None):\n", " \"\"\"\n", " Fetch experiment data from Phoenix API and process it into a DataFrame.\n", "\n", " Args:\n", " experiment_id: The Phoenix experiment ID\n", " feedback_columns: List of feedback field names to extract from annotations\n", "\n", " Returns:\n", " pd.DataFrame: Processed experiment data\n", " \"\"\"\n", " url = f\"{os.environ.get('PHOENIX_COLLECTOR_ENDPOINT', 'http://localhost:6006')}/v1/experiments/{experiment_id}/json\"\n", " headers = {}\n", " if os.environ.get(\"PHOENIX_API_KEY\"):\n", " headers[\"Authorization\"] = f\"Bearer {os.environ['PHOENIX_API_KEY']}\"\n", "\n", " response = requests.get(url, headers=headers)\n", " if response.status_code != 200:\n", " raise RuntimeError(\n", " f\"Failed to fetch experiment data: {response.status_code} {response.text}\"\n", " )\n", "\n", " results = response.json()\n", "\n", " # Build DataFrame from experiment results\n", " data = []\n", " for entry in results:\n", " row = {\n", " \"input\": entry.get(\"input\", {}),\n", " \"output\": entry.get(\"output\"),\n", " \"ground_truth\": entry.get(\"reference_output\", {}).get(\"ground_truth\"),\n", " }\n", " # Extract query from input\n", " if isinstance(row[\"input\"], dict):\n", " row[\"query\"] = row[\"input\"].get(\"query\", \"\")\n", "\n", " # Extract feedback from annotations\n", " if feedback_columns and entry.get(\"annotations\"):\n", " # Find the output_evaluator annotation\n", " for annotation in entry[\"annotations\"]:\n", " if annotation.get(\"name\") == \"output_evaluator\":\n", " eval_output = annotation.get(\"explanation\", \"\")\n", " for item in eval_output.split(\";\"):\n", " if \":\" in item:\n", " key, value = item.split(\":\", 1)\n", " key = key.strip()\n", " if key in feedback_columns:\n", " row[key] = value.strip()\n", " break\n", "\n", " data.append(row)\n", "\n", " return pd.DataFrame(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# REPLACE with your Version 4 experiment ID from Phoenix UI\n", "EXPERIMENT_V4_ID = \"REPLACE_WITH_EXPERIMENT_V4_ID\"\n", "\n", "# Feedback columns from analysis_evaluator\n", "feedback_columns = [\n", " \"correctness\",\n", " \"explanation\",\n", " \"confusion_reason\",\n", " \"error_type\",\n", " \"evidence_span\",\n", " \"prompt_fix_suggestion\",\n", "]\n", "\n", "processed_experiment_data = process_experiment(\n", " experiment_id=EXPERIMENT_V4_ID, feedback_columns=feedback_columns\n", ")\n", "\n", "print(f\"Processed {len(processed_experiment_data)} rows\")\n", "print(processed_experiment_data.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Unoptimized Prompt\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "from phoenix.client import Client\n", "\n", "px_client = Client()\n", "\n", "# REPLACE with the prompt version ID you want to optimize\n", "PROMPT_VERSION_ID = \"REPLACE_WITH_PROMPT_VERSION_ID\"\n", "unoptimized_prompt = px_client.prompts.get(prompt_version_id=PROMPT_VERSION_ID)\n", "\n", "# Extract system prompt from messages[0]\n", "system_prompt = unoptimized_prompt._template[\"messages\"][0][\"content\"][0][\"text\"]\n", "\n", "print(f\"Loaded system prompt ({len(system_prompt)} chars)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimize Prompt (Version 5)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from prompt_learning import PromptLearningOptimizer\n", "\n", "# Initialize optimizer with existing system prompt\n", "optimizer = PromptLearningOptimizer(\n", " prompt=system_prompt, model_choice=\"gpt-4o\", openai_api_key=os.getenv(\"OPENAI_API_KEY\")\n", ")\n", "\n", "# Run optimization\n", "optimized_system_prompt = optimizer.optimize(\n", " dataset=processed_experiment_data,\n", " output_column=\"output\",\n", " feedback_columns=feedback_columns,\n", " context_size_k=90000,\n", ")\n", "\n", "print(\"\\n\" + \"=\" * 60)\n", "print(\"OPTIMIZED PROMPT\")\n", "print(\"=\" * 60)\n", "print(optimized_system_prompt[:500] + \"...\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client.types.prompts import PromptVersion\n", "\n", "# Build new messages with optimized system prompt\n", "optimized_messages = [\n", " {\"role\": \"system\", \"content\": [{\"type\": \"text\", \"text\": optimized_system_prompt}]},\n", " {\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"{{query}}\"}]},\n", "]\n", "\n", "# Create new version with optimized prompt\n", "new_version = PromptVersion(\n", " optimized_messages,\n", " model_name=unoptimized_prompt._model_name,\n", " model_provider=unoptimized_prompt._model_provider,\n", " template_format=\"MUSTACHE\",\n", " description=\"Optimized with Prompt Learning from V4 experiment\",\n", ")\n", "\n", "# Preserve invocation parameters if any\n", "if unoptimized_prompt._invocation_parameters:\n", " new_version._invocation_parameters = unoptimized_prompt._invocation_parameters\n", "\n", "# Push to Phoenix\n", "optimized_prompt = px_client.prompts.create(\n", " name=\"support-classifier\",\n", " version=new_version,\n", ")\n", "\n", "print(f\"✅ Created optimized prompt version: {optimized_prompt.id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Measure New Prompt Version's Performance\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client.experiments import async_run_experiment\n", "\n", "print(\"🧪 Running experiment with Prompt Learning optimized prompt...\")\n", "\n", "experiment_optimized = await async_run_experiment(\n", " dataset=support_query_dataset,\n", " task=create_task(optimized_prompt),\n", " evaluators=[ground_truth_evaluator, analysis_evaluator],\n", " experiment_name=\"support-classifier-optimized\",\n", ")\n", "\n", "print(f\"\\n✅ Experiment completed: {experiment_optimized.id}\")\n", "print(\"📊 View results: http://localhost:6006/experiments\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# 🎉 Summary\n", "\n", "Congratulations! You've completed the Phoenix Prompts Tutorial!\n", "\n", "**You've learned how to:**\n", "- **Store and version prompts** in Phoenix's Prompt Hub\n", "- **Create and upload datasets** to Phoenix from CSV files or DataFrames\n", "- **Build custom evaluators** - both code-based and LLM-based with structured output\n", "- **Run experiments** to test prompts at scale with automatic evaluation tracking\n", "- **Compare prompt versions** side-by-side to measure the impact of changes\n", "- **Optimize prompts with Prompt Learning** - using experiment feedback to automatically generate improvements\n" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }

Latest Blog Posts

What Is Context Bloat in MCP?
By Om-Shree-0709 on December 16, 2025.
mcp
Context Bloat
MCP Moves to the Linux Foundation: Neutral Stewardship for Agentic Infrastructure
By Om-Shree-0709 on December 15, 2025.
mcp
anthropic
Linux Foundation
Code Execution with MCP: Architecting Agentic Efficiency
By Om-Shree-0709 on December 14, 2025.
mcp
Token bloat

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server