Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
trace_level_evals.ipynb23.8 kB
{ "cells": [ { "cell_type": "markdown", "id": "e541c8c8", "metadata": { "id": "e541c8c8" }, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "id": "1c2c8e6f", "metadata": { "id": "1c2c8e6f" }, "source": [ "# Trace-Level Evals for a Movie Recommendation Agent" ] }, { "cell_type": "markdown", "id": "a68d55ab", "metadata": { "id": "a68d55ab" }, "source": [ "This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.\n", "\n", "In this notebook, you will:\n", "- Build and capture interactions (traces) from your movie recommendation agent\n", "- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage\n", "- Format the evaluation outputs to match Arize’s schema and log them to the platform\n", "- Learn a robust pipeline for assessing trace-level performance\n", "\n", "✅ You will need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an OpenAI API key to run this notebook." ] }, { "cell_type": "markdown", "id": "549464f8", "metadata": { "id": "549464f8" }, "source": [ "# Set Up Keys & Dependencies" ] }, { "cell_type": "code", "id": "5b425a65", "metadata": { "collapsed": true, "id": "5b425a65" }, "source": [ "%pip install openinference-instrumentation-openai openinference-instrumentation-openai-agents openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio openai openai-agents" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "15652859", "metadata": { "id": "15652859" }, "source": [ "import os\n", "from getpass import getpass\n", "\n", "import nest_asyncio\n", "\n", "nest_asyncio.apply()\n", "\n", "if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n", " phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n", "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n", "\n", "\n", "if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n", " phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n", "os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "f47e0638", "metadata": { "id": "f47e0638" }, "source": [ "# Configure Tracing" ] }, { "metadata": {}, "cell_type": "code", "source": [ "from phoenix.otel import register\n", "\n", "# configure the Phoenix tracer\n", "tracer_provider = register(project_name=\"movie-rec-agent\", auto_instrument=True)" ], "id": "36dbc00a43257651", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "0a3919bd", "metadata": { "id": "0a3919bd" }, "source": [ "# Build Movie Recommendation System" ] }, { "cell_type": "markdown", "id": "7c81c9bb", "metadata": { "id": "7c81c9bb" }, "source": [ "First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:\n", "1. Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming\n", "2. Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.\n", "3. Preview Summarizer: For each movie, return a 1-2 sentence description\n", "\n", "Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews." ] }, { "cell_type": "markdown", "id": "c64844a7", "metadata": { "id": "c64844a7" }, "source": [ "Let's test our agent & view traces in Arize" ] }, { "cell_type": "code", "id": "f1731db8", "metadata": { "id": "f1731db8" }, "source": [ "import ast\n", "from typing import List, Union\n", "\n", "from agents import Agent, Runner, function_tool\n", "from openai import OpenAI\n", "from opentelemetry import trace\n", "\n", "tracer = trace.get_tracer(__name__)\n", "\n", "client = OpenAI()\n", "\n", "\n", "@function_tool\n", "def movie_selector_llm(genre: str) -> List[str]:\n", " prompt = (\n", " f\"List up to 5 recent popular streaming movies in the {genre} genre. \"\n", " \"Provide only movie titles as a Python list of strings.\"\n", " )\n", " response = client.chat.completions.create(\n", " model=\"gpt-4\",\n", " messages=[{\"role\": \"user\", \"content\": prompt}],\n", " temperature=0.7,\n", " max_tokens=150,\n", " )\n", " content = response.choices[0].message.content\n", " try:\n", " movie_list = ast.literal_eval(content)\n", " if isinstance(movie_list, list):\n", " return movie_list[:5]\n", " except Exception:\n", " return content.split(\"\\n\")\n", "\n", "\n", "@function_tool\n", "def reviewer_llm(movies: Union[str, List[str]]) -> str:\n", " if isinstance(movies, list):\n", " movies_str = \", \".join(movies)\n", " prompt = f\"Sort the following movies by rating from highest to lowest and provide a short review for each:\\n{movies_str}\"\n", " else:\n", " prompt = f\"Provide a short review and rating for the movie: {movies}\"\n", " response = client.chat.completions.create(\n", " model=\"gpt-4\",\n", " messages=[{\"role\": \"user\", \"content\": prompt}],\n", " temperature=0.7,\n", " max_tokens=300,\n", " )\n", " return response.choices[0].message.content.strip()\n", "\n", "\n", "@function_tool\n", "def preview_summarizer_llm(movie: str) -> str:\n", " prompt = f\"Write a 1-2 sentence summary describing the movie '{movie}'.\"\n", " response = client.chat.completions.create(\n", " model=\"gpt-4\",\n", " messages=[{\"role\": \"user\", \"content\": prompt}],\n", " temperature=0.7,\n", " max_tokens=100,\n", " )\n", " return response.choices[0].message.content.strip()" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "8ee91369", "metadata": { "id": "8ee91369" }, "source": [ "agent = Agent(\n", " name=\"MovieRecommendationAgentLLM\",\n", " tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],\n", " instructions=(\n", " \"You are a helpful movie recommendation assistant with access to three tools:\\n\"\n", " \"1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\\n\"\n", " \"2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\\n\"\n", " \"3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\\n\\n\"\n", " \"Your goal is to provide a helpful, user-friendly response combining relevant information.\"\n", " ),\n", ")\n", "\n", "\n", "async def main():\n", " user_input = \"Which comedy movie should I watch?\"\n", " result = await Runner.run(agent, user_input)\n", " print(result.final_output)\n", "\n", "\n", "await main()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "6bef5b97", "metadata": { "id": "6bef5b97" }, "source": [ "Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit." ] }, { "cell_type": "code", "id": "5cf62bca", "metadata": { "id": "5cf62bca" }, "source": [ "questions = [\n", " \"Which Batman movie should I watch?\",\n", " \"I want to watch a good romcom\",\n", " \"What is a very scary horror movie?\",\n", " \"Name a feel-good holiday movie\",\n", " \"Recommend a musical with great songs\",\n", " \"Give me a classic drama from the 90s\",\n", "]\n", "\n", "for question in questions:\n", " result = await Runner.run(agent, question)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "f819c8ee", "metadata": { "id": "f819c8ee" }, "source": [ "# Get Span Data from Phoenix" ] }, { "cell_type": "markdown", "id": "0de0363e", "metadata": { "id": "0de0363e" }, "source": [ "Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values." ] }, { "metadata": {}, "cell_type": "code", "source": [ "from phoenix.client import AsyncClient\n", "\n", "px_client = AsyncClient()\n", "primary_df = await px_client.spans.get_spans_dataframe(project_identifier=\"movie-rec-agent\")" ], "id": "981913a80fa0a780", "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "91b78eee", "metadata": { "id": "91b78eee" }, "source": [ "import pandas as pd\n", "\n", "trace_df = primary_df.groupby(\"context.trace_id\").agg(\n", " {\n", " \"attributes.input.value\": \"first\",\n", " \"attributes.output.value\": lambda x: \" \".join(x.dropna()),\n", " }\n", ")\n", "\n", "trace_df.head()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "d014913f", "metadata": { "id": "d014913f" }, "source": [ "# Define and Run Evaluators" ] }, { "cell_type": "markdown", "id": "a426e4ca", "metadata": { "id": "a426e4ca" }, "source": [ "In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge." ] }, { "cell_type": "code", "id": "d0f1768d", "metadata": { "id": "d0f1768d" }, "source": [ "TOOL_CALLING_ORDER = \"\"\"\n", "You are evaluating the correctness of the tool calling order in an LLM application's trace.\n", "\n", "You will be given:\n", "1. The user input that initiated the trace\n", "2. The full trace output, including the sequence of tool calls made by the agent\n", "\n", "##\n", "User Input:\n", "{attributes.input.value}\n", "\n", "Trace Output:\n", "{attributes.output.value}\n", "##\n", "\n", "Respond with exactly one word: `correct` or `incorrect`.\n", "1. `correct` →\n", "- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively.\n", "- A proper answer involves calls to reviews, summaries, and recommendations where relevant.\n", "2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.\n", "\"\"\"" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "131a5e7f", "metadata": { "id": "131a5e7f" }, "source": [ "RECOMMENDATION_RELEVANCE = \"\"\"\n", "You are evaluating the relevance of movie recommendations provided by an LLM application.\n", "\n", "You will be given:\n", "1. The user input that initiated the trace\n", "2. The list of movie recommendations output by the system\n", "\n", "##\n", "User Input:\n", "{attributes.input.value}\n", "\n", "Recommendations:\n", "{attributes.output.value}\n", "##\n", "\n", "Respond with exactly one word: `correct` or `incorrect`.\n", "1. `correct` →\n", "- All recommended movies match the requested genre or criteria in the user input.\n", "- The recommendations should be relevant to the user's request and shouldn't be repetitive.\n", "- `incorrect` → one or more recommendations do not match the requested genre or criteria.\n", "\"\"\"" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "ce8d7f1b", "metadata": { "id": "ce8d7f1b" }, "source": [ "import os\n", "\n", "import nest_asyncio\n", "\n", "from phoenix.evals import OpenAIModel, llm_classify\n", "\n", "nest_asyncio.apply()\n", "\n", "model = OpenAIModel(\n", " api_key=os.environ[\"OPENAI_API_KEY\"],\n", " model=\"gpt-4o-mini\",\n", " temperature=0.0,\n", ")\n", "\n", "rails = [\"correct\", \"incorrect\"]\n", "\n", "tool_eval_results = llm_classify(\n", " dataframe=trace_df,\n", " template=TOOL_CALLING_ORDER,\n", " model=model,\n", " rails=rails,\n", " provide_explanation=True,\n", " verbose=False,\n", ")\n", "\n", "tool_eval_results" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "efd9620e", "metadata": { "id": "efd9620e" }, "source": [ "relevance_eval_results = llm_classify(\n", " dataframe=trace_df,\n", " template=RECOMMENDATION_RELEVANCE,\n", " model=model,\n", " rails=rails,\n", " provide_explanation=True,\n", " verbose=False,\n", ")\n", "\n", "relevance_eval_results" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "4613e975", "metadata": { "id": "4613e975" }, "source": [ "# Log Results Back to Phoenix" ] }, { "cell_type": "markdown", "id": "afd4401d", "metadata": { "id": "afd4401d" }, "source": [ "The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations." ] }, { "cell_type": "code", "id": "aec87930", "metadata": { "id": "aec87930", "ExecuteTime": { "end_time": "2025-08-25T21:08:47.221747Z", "start_time": "2025-08-25T21:08:47.174358Z" } }, "source": [ "root_spans = primary_df[primary_df[\"parent_id\"].isna()][[\"context.trace_id\", \"context.span_id\"]]\n", "\n", "tool_eval_results = tool_eval_results[[\"label\", \"explanation\"]]\n", "\n", "# Merge tool correctness eval results with trace_df\n", "tool_correctness_df = pd.merge(\n", " trace_df, tool_eval_results, left_index=True, right_index=True, how=\"left\"\n", ")\n", "\n", "# Merge with root spans to get valid span IDs\n", "tool_correctness_df = pd.merge(\n", " tool_correctness_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n", ").set_index(\"context.span_id\", drop=False)\n", "\n", "relevance_eval_results = relevance_eval_results[[\"label\", \"explanation\"]]\n", "\n", "# Merge relevance eval results with trace_df\n", "relevance_df = pd.merge(\n", " trace_df, relevance_eval_results, left_index=True, right_index=True, how=\"left\"\n", ")\n", "\n", "# Merge with root spans to get valid span IDs\n", "relevance_df = pd.merge(\n", " relevance_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n", ").set_index(\"context.span_id\", drop=False)\n", "\n", "\n", "# Log to Phoenix\n", "await px_client.spans.log_span_annotations_dataframe(\n", " dataframe=tool_correctness_df,\n", " annotation_name=\"Tool Correctness\",\n", " annotator_kind=\"LLM\",\n", ")\n", "await px_client.spans.log_span_annotations_dataframe(\n", " dataframe=relevance_df,\n", " annotation_name=\"Recommendation Relevance\",\n", " annotator_kind=\"LLM\",\n", ")" ], "outputs": [], "execution_count": 32 }, { "cell_type": "markdown", "id": "xVG-DsKlWb6c", "metadata": { "id": "xVG-DsKlWb6c" }, "source": [ "![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/trace_level_evals_phoenix.png)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server