trace_level_evals.ipynb•23.8 kB
{
"cells": [
{
"cell_type": "markdown",
"id": "e541c8c8",
"metadata": {
"id": "e541c8c8"
},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
" <br>\n",
" <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"id": "1c2c8e6f",
"metadata": {
"id": "1c2c8e6f"
},
"source": [
"# Trace-Level Evals for a Movie Recommendation Agent"
]
},
{
"cell_type": "markdown",
"id": "a68d55ab",
"metadata": {
"id": "a68d55ab"
},
"source": [
"This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.\n",
"\n",
"In this notebook, you will:\n",
"- Build and capture interactions (traces) from your movie recommendation agent\n",
"- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage\n",
"- Format the evaluation outputs to match Arize’s schema and log them to the platform\n",
"- Learn a robust pipeline for assessing trace-level performance\n",
"\n",
"✅ You will need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an OpenAI API key to run this notebook."
]
},
{
"cell_type": "markdown",
"id": "549464f8",
"metadata": {
"id": "549464f8"
},
"source": [
"# Set Up Keys & Dependencies"
]
},
{
"cell_type": "code",
"id": "5b425a65",
"metadata": {
"collapsed": true,
"id": "5b425a65"
},
"source": [
"%pip install openinference-instrumentation-openai openinference-instrumentation-openai-agents openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio openai openai-agents"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"id": "15652859",
"metadata": {
"id": "15652859"
},
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n",
" phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n",
"os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n",
"\n",
"\n",
"if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n",
" phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n",
"os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n",
"\n",
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "f47e0638",
"metadata": {
"id": "f47e0638"
},
"source": [
"# Configure Tracing"
]
},
{
"metadata": {},
"cell_type": "code",
"source": [
"from phoenix.otel import register\n",
"\n",
"# configure the Phoenix tracer\n",
"tracer_provider = register(project_name=\"movie-rec-agent\", auto_instrument=True)"
],
"id": "36dbc00a43257651",
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "0a3919bd",
"metadata": {
"id": "0a3919bd"
},
"source": [
"# Build Movie Recommendation System"
]
},
{
"cell_type": "markdown",
"id": "7c81c9bb",
"metadata": {
"id": "7c81c9bb"
},
"source": [
"First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:\n",
"1. Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming\n",
"2. Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.\n",
"3. Preview Summarizer: For each movie, return a 1-2 sentence description\n",
"\n",
"Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews."
]
},
{
"cell_type": "markdown",
"id": "c64844a7",
"metadata": {
"id": "c64844a7"
},
"source": [
"Let's test our agent & view traces in Arize"
]
},
{
"cell_type": "code",
"id": "f1731db8",
"metadata": {
"id": "f1731db8"
},
"source": [
"import ast\n",
"from typing import List, Union\n",
"\n",
"from agents import Agent, Runner, function_tool\n",
"from openai import OpenAI\n",
"from opentelemetry import trace\n",
"\n",
"tracer = trace.get_tracer(__name__)\n",
"\n",
"client = OpenAI()\n",
"\n",
"\n",
"@function_tool\n",
"def movie_selector_llm(genre: str) -> List[str]:\n",
" prompt = (\n",
" f\"List up to 5 recent popular streaming movies in the {genre} genre. \"\n",
" \"Provide only movie titles as a Python list of strings.\"\n",
" )\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=0.7,\n",
" max_tokens=150,\n",
" )\n",
" content = response.choices[0].message.content\n",
" try:\n",
" movie_list = ast.literal_eval(content)\n",
" if isinstance(movie_list, list):\n",
" return movie_list[:5]\n",
" except Exception:\n",
" return content.split(\"\\n\")\n",
"\n",
"\n",
"@function_tool\n",
"def reviewer_llm(movies: Union[str, List[str]]) -> str:\n",
" if isinstance(movies, list):\n",
" movies_str = \", \".join(movies)\n",
" prompt = f\"Sort the following movies by rating from highest to lowest and provide a short review for each:\\n{movies_str}\"\n",
" else:\n",
" prompt = f\"Provide a short review and rating for the movie: {movies}\"\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=0.7,\n",
" max_tokens=300,\n",
" )\n",
" return response.choices[0].message.content.strip()\n",
"\n",
"\n",
"@function_tool\n",
"def preview_summarizer_llm(movie: str) -> str:\n",
" prompt = f\"Write a 1-2 sentence summary describing the movie '{movie}'.\"\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=0.7,\n",
" max_tokens=100,\n",
" )\n",
" return response.choices[0].message.content.strip()"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"id": "8ee91369",
"metadata": {
"id": "8ee91369"
},
"source": [
"agent = Agent(\n",
" name=\"MovieRecommendationAgentLLM\",\n",
" tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],\n",
" instructions=(\n",
" \"You are a helpful movie recommendation assistant with access to three tools:\\n\"\n",
" \"1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\\n\"\n",
" \"2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\\n\"\n",
" \"3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\\n\\n\"\n",
" \"Your goal is to provide a helpful, user-friendly response combining relevant information.\"\n",
" ),\n",
")\n",
"\n",
"\n",
"async def main():\n",
" user_input = \"Which comedy movie should I watch?\"\n",
" result = await Runner.run(agent, user_input)\n",
" print(result.final_output)\n",
"\n",
"\n",
"await main()"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "6bef5b97",
"metadata": {
"id": "6bef5b97"
},
"source": [
"Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit."
]
},
{
"cell_type": "code",
"id": "5cf62bca",
"metadata": {
"id": "5cf62bca"
},
"source": [
"questions = [\n",
" \"Which Batman movie should I watch?\",\n",
" \"I want to watch a good romcom\",\n",
" \"What is a very scary horror movie?\",\n",
" \"Name a feel-good holiday movie\",\n",
" \"Recommend a musical with great songs\",\n",
" \"Give me a classic drama from the 90s\",\n",
"]\n",
"\n",
"for question in questions:\n",
" result = await Runner.run(agent, question)"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "f819c8ee",
"metadata": {
"id": "f819c8ee"
},
"source": [
"# Get Span Data from Phoenix"
]
},
{
"cell_type": "markdown",
"id": "0de0363e",
"metadata": {
"id": "0de0363e"
},
"source": [
"Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values."
]
},
{
"metadata": {},
"cell_type": "code",
"source": [
"from phoenix.client import AsyncClient\n",
"\n",
"px_client = AsyncClient()\n",
"primary_df = await px_client.spans.get_spans_dataframe(project_identifier=\"movie-rec-agent\")"
],
"id": "981913a80fa0a780",
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"id": "91b78eee",
"metadata": {
"id": "91b78eee"
},
"source": [
"import pandas as pd\n",
"\n",
"trace_df = primary_df.groupby(\"context.trace_id\").agg(\n",
" {\n",
" \"attributes.input.value\": \"first\",\n",
" \"attributes.output.value\": lambda x: \" \".join(x.dropna()),\n",
" }\n",
")\n",
"\n",
"trace_df.head()"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "d014913f",
"metadata": {
"id": "d014913f"
},
"source": [
"# Define and Run Evaluators"
]
},
{
"cell_type": "markdown",
"id": "a426e4ca",
"metadata": {
"id": "a426e4ca"
},
"source": [
"In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge."
]
},
{
"cell_type": "code",
"id": "d0f1768d",
"metadata": {
"id": "d0f1768d"
},
"source": [
"TOOL_CALLING_ORDER = \"\"\"\n",
"You are evaluating the correctness of the tool calling order in an LLM application's trace.\n",
"\n",
"You will be given:\n",
"1. The user input that initiated the trace\n",
"2. The full trace output, including the sequence of tool calls made by the agent\n",
"\n",
"##\n",
"User Input:\n",
"{attributes.input.value}\n",
"\n",
"Trace Output:\n",
"{attributes.output.value}\n",
"##\n",
"\n",
"Respond with exactly one word: `correct` or `incorrect`.\n",
"1. `correct` →\n",
"- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively.\n",
"- A proper answer involves calls to reviews, summaries, and recommendations where relevant.\n",
"2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.\n",
"\"\"\""
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"id": "131a5e7f",
"metadata": {
"id": "131a5e7f"
},
"source": [
"RECOMMENDATION_RELEVANCE = \"\"\"\n",
"You are evaluating the relevance of movie recommendations provided by an LLM application.\n",
"\n",
"You will be given:\n",
"1. The user input that initiated the trace\n",
"2. The list of movie recommendations output by the system\n",
"\n",
"##\n",
"User Input:\n",
"{attributes.input.value}\n",
"\n",
"Recommendations:\n",
"{attributes.output.value}\n",
"##\n",
"\n",
"Respond with exactly one word: `correct` or `incorrect`.\n",
"1. `correct` →\n",
"- All recommended movies match the requested genre or criteria in the user input.\n",
"- The recommendations should be relevant to the user's request and shouldn't be repetitive.\n",
"- `incorrect` → one or more recommendations do not match the requested genre or criteria.\n",
"\"\"\""
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"id": "ce8d7f1b",
"metadata": {
"id": "ce8d7f1b"
},
"source": [
"import os\n",
"\n",
"import nest_asyncio\n",
"\n",
"from phoenix.evals import OpenAIModel, llm_classify\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"model = OpenAIModel(\n",
" api_key=os.environ[\"OPENAI_API_KEY\"],\n",
" model=\"gpt-4o-mini\",\n",
" temperature=0.0,\n",
")\n",
"\n",
"rails = [\"correct\", \"incorrect\"]\n",
"\n",
"tool_eval_results = llm_classify(\n",
" dataframe=trace_df,\n",
" template=TOOL_CALLING_ORDER,\n",
" model=model,\n",
" rails=rails,\n",
" provide_explanation=True,\n",
" verbose=False,\n",
")\n",
"\n",
"tool_eval_results"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "code",
"id": "efd9620e",
"metadata": {
"id": "efd9620e"
},
"source": [
"relevance_eval_results = llm_classify(\n",
" dataframe=trace_df,\n",
" template=RECOMMENDATION_RELEVANCE,\n",
" model=model,\n",
" rails=rails,\n",
" provide_explanation=True,\n",
" verbose=False,\n",
")\n",
"\n",
"relevance_eval_results"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "4613e975",
"metadata": {
"id": "4613e975"
},
"source": [
"# Log Results Back to Phoenix"
]
},
{
"cell_type": "markdown",
"id": "afd4401d",
"metadata": {
"id": "afd4401d"
},
"source": [
"The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations."
]
},
{
"cell_type": "code",
"id": "aec87930",
"metadata": {
"id": "aec87930",
"ExecuteTime": {
"end_time": "2025-08-25T21:08:47.221747Z",
"start_time": "2025-08-25T21:08:47.174358Z"
}
},
"source": [
"root_spans = primary_df[primary_df[\"parent_id\"].isna()][[\"context.trace_id\", \"context.span_id\"]]\n",
"\n",
"tool_eval_results = tool_eval_results[[\"label\", \"explanation\"]]\n",
"\n",
"# Merge tool correctness eval results with trace_df\n",
"tool_correctness_df = pd.merge(\n",
" trace_df, tool_eval_results, left_index=True, right_index=True, how=\"left\"\n",
")\n",
"\n",
"# Merge with root spans to get valid span IDs\n",
"tool_correctness_df = pd.merge(\n",
" tool_correctness_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n",
").set_index(\"context.span_id\", drop=False)\n",
"\n",
"relevance_eval_results = relevance_eval_results[[\"label\", \"explanation\"]]\n",
"\n",
"# Merge relevance eval results with trace_df\n",
"relevance_df = pd.merge(\n",
" trace_df, relevance_eval_results, left_index=True, right_index=True, how=\"left\"\n",
")\n",
"\n",
"# Merge with root spans to get valid span IDs\n",
"relevance_df = pd.merge(\n",
" relevance_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n",
").set_index(\"context.span_id\", drop=False)\n",
"\n",
"\n",
"# Log to Phoenix\n",
"await px_client.spans.log_span_annotations_dataframe(\n",
" dataframe=tool_correctness_df,\n",
" annotation_name=\"Tool Correctness\",\n",
" annotator_kind=\"LLM\",\n",
")\n",
"await px_client.spans.log_span_annotations_dataframe(\n",
" dataframe=relevance_df,\n",
" annotation_name=\"Recommendation Relevance\",\n",
" annotator_kind=\"LLM\",\n",
")"
],
"outputs": [],
"execution_count": 32
},
{
"cell_type": "markdown",
"id": "xVG-DsKlWb6c",
"metadata": {
"id": "xVG-DsKlWb6c"
},
"source": [
""
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}