trace_level_evals.ipynb•22.2 kB
{
"cells": [
{
"cell_type": "markdown",
"id": "e541c8c8",
"metadata": {
"id": "e541c8c8"
},
"source": [
"<center>\n",
" <p style=\"text-align:center\">\n",
" <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
" <br>\n",
" <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
" |\n",
" <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
" |\n",
" <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
" </p>\n",
"</center>"
]
},
{
"cell_type": "markdown",
"id": "1c2c8e6f",
"metadata": {
"id": "1c2c8e6f"
},
"source": [
"# Trace-Level Evals for a Movie Recommendation Agent"
]
},
{
"cell_type": "markdown",
"id": "a68d55ab",
"metadata": {
"id": "a68d55ab"
},
"source": [
"This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.\n",
"\n",
"In this notebook, you will:\n",
"- Build and capture interactions (traces) from your movie recommendation agent\n",
"- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage\n",
"- Format the evaluation outputs to match Arize’s schema and log them to the platform\n",
"- Learn a robust pipeline for assessing trace-level performance\n",
"\n",
"✅ You will need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an OpenAI API key to run this notebook."
]
},
{
"cell_type": "markdown",
"id": "549464f8",
"metadata": {
"id": "549464f8"
},
"source": [
"# Set Up Keys & Dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b425a65",
"metadata": {
"collapsed": true,
"id": "5b425a65"
},
"outputs": [],
"source": [
"%pip install openinference-instrumentation-openai openinference-instrumentation-openai-agents openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio openai openai-agents"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15652859",
"metadata": {
"id": "15652859"
},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n",
" phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n",
"os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n",
"\n",
"\n",
"if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n",
" phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n",
"os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n",
"\n",
"if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
" openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
"os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
]
},
{
"cell_type": "markdown",
"id": "f47e0638",
"metadata": {
"id": "f47e0638"
},
"source": [
"# Configure Tracing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36dbc00a43257651",
"metadata": {
"id": "36dbc00a43257651"
},
"outputs": [],
"source": [
"from phoenix.otel import register\n",
"\n",
"# configure the Phoenix tracer\n",
"tracer_provider = register(project_name=\"movie-rec-agent\", auto_instrument=True)"
]
},
{
"cell_type": "markdown",
"id": "7c81c9bb",
"metadata": {
"id": "7c81c9bb"
},
"source": [
"First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:\n",
"1. Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming\n",
"2. Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.\n",
"3. Preview Summarizer: For each movie, return a 1-2 sentence description\n",
"\n",
"Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews."
]
},
{
"cell_type": "markdown",
"id": "c64844a7",
"metadata": {
"id": "c64844a7"
},
"source": [
"Let's test our agent & view traces in Arize"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1731db8",
"metadata": {
"id": "f1731db8"
},
"outputs": [],
"source": [
"import ast\n",
"from typing import List, Union\n",
"\n",
"from agents import Agent, Runner, function_tool\n",
"from openai import OpenAI\n",
"from opentelemetry import trace\n",
"\n",
"tracer = trace.get_tracer(__name__)\n",
"\n",
"client = OpenAI()\n",
"\n",
"\n",
"@function_tool\n",
"def movie_selector_llm(genre: str) -> List[str]:\n",
" prompt = (\n",
" f\"List up to 5 recent popular streaming movies in the {genre} genre. \"\n",
" \"Provide only movie titles as a Python list of strings.\"\n",
" )\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=0.7,\n",
" max_tokens=150,\n",
" )\n",
" content = response.choices[0].message.content\n",
" try:\n",
" movie_list = ast.literal_eval(content)\n",
" if isinstance(movie_list, list):\n",
" return movie_list[:5]\n",
" except Exception:\n",
" return content.split(\"\\n\")\n",
"\n",
"\n",
"@function_tool\n",
"def reviewer_llm(movies: Union[str, List[str]]) -> str:\n",
" if isinstance(movies, list):\n",
" movies_str = \", \".join(movies)\n",
" prompt = f\"Sort the following movies by rating from highest to lowest and provide a short review for each:\\n{movies_str}\"\n",
" else:\n",
" prompt = f\"Provide a short review and rating for the movie: {movies}\"\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=0.7,\n",
" max_tokens=300,\n",
" )\n",
" return response.choices[0].message.content.strip()\n",
"\n",
"\n",
"@function_tool\n",
"def preview_summarizer_llm(movie: str) -> str:\n",
" prompt = f\"Write a 1-2 sentence summary describing the movie '{movie}'.\"\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" messages=[{\"role\": \"user\", \"content\": prompt}],\n",
" temperature=0.7,\n",
" max_tokens=100,\n",
" )\n",
" return response.choices[0].message.content.strip()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ee91369",
"metadata": {
"id": "8ee91369"
},
"outputs": [],
"source": [
"agent = Agent(\n",
" name=\"MovieRecommendationAgentLLM\",\n",
" tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],\n",
" instructions=(\n",
" \"You are a helpful movie recommendation assistant with access to three tools:\\n\"\n",
" \"1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\\n\"\n",
" \"2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\\n\"\n",
" \"3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\\n\\n\"\n",
" \"Your goal is to provide a helpful, user-friendly response combining relevant information.\"\n",
" ),\n",
")\n",
"\n",
"\n",
"async def main():\n",
" user_input = \"Which comedy movie should I watch?\"\n",
" result = await Runner.run(agent, user_input)\n",
" print(result.final_output)\n",
"\n",
"\n",
"await main()"
]
},
{
"cell_type": "markdown",
"id": "6bef5b97",
"metadata": {
"id": "6bef5b97"
},
"source": [
"Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cf62bca",
"metadata": {
"id": "5cf62bca"
},
"outputs": [],
"source": [
"questions = [\n",
" \"Which Batman movie should I watch?\",\n",
" \"I want to watch a good romcom\",\n",
" \"What is a very scary horror movie?\",\n",
" \"Name a feel-good holiday movie\",\n",
" \"Recommend a musical with great songs\",\n",
" \"Give me a classic drama from the 90s\",\n",
"]\n",
"\n",
"for question in questions:\n",
" result = await Runner.run(agent, question)"
]
},
{
"cell_type": "markdown",
"id": "f819c8ee",
"metadata": {
"id": "f819c8ee"
},
"source": [
"# Get Span Data from Phoenix"
]
},
{
"cell_type": "markdown",
"id": "0de0363e",
"metadata": {
"id": "0de0363e"
},
"source": [
"Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "981913a80fa0a780",
"metadata": {
"id": "981913a80fa0a780"
},
"outputs": [],
"source": [
"from phoenix.client import AsyncClient\n",
"\n",
"px_client = AsyncClient()\n",
"primary_df = await px_client.spans.get_spans_dataframe(project_identifier=\"movie-rec-agent\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "91b78eee",
"metadata": {
"id": "91b78eee"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"trace_df = primary_df.groupby(\"context.trace_id\").agg(\n",
" {\n",
" \"attributes.input.value\": \"first\",\n",
" \"attributes.output.value\": lambda x: \" \".join(x.dropna()),\n",
" }\n",
")\n",
"trace_df = trace_df.rename(\n",
" columns={\n",
" \"attributes.input.value\": \"input\",\n",
" \"attributes.output.value\": \"output\",\n",
" }\n",
")\n",
"trace_df.head()"
]
},
{
"cell_type": "markdown",
"id": "d014913f",
"metadata": {
"id": "d014913f"
},
"source": [
"# Define and Run Evaluators"
]
},
{
"cell_type": "markdown",
"id": "a426e4ca",
"metadata": {
"id": "a426e4ca"
},
"source": [
"In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d0f1768d",
"metadata": {
"id": "d0f1768d"
},
"outputs": [],
"source": [
"TOOL_CALLING_ORDER = \"\"\"\n",
"You are evaluating the correctness of the tool calling order in an LLM application's trace.\n",
"\n",
"You will be given:\n",
"1. The user input that initiated the trace\n",
"2. The full trace output, including the sequence of tool calls made by the agent\n",
"\n",
"##\n",
"User Input:\n",
"{input}\n",
"\n",
"Trace Output:\n",
"{output}\n",
"##\n",
"\n",
"Respond with exactly one word: `correct` or `incorrect`.\n",
"1. `correct` →\n",
"- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively.\n",
"- A proper answer involves calls to reviews, summaries, and recommendations where relevant.\n",
"2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "131a5e7f",
"metadata": {
"id": "131a5e7f"
},
"outputs": [],
"source": [
"RECOMMENDATION_RELEVANCE = \"\"\"\n",
"You are evaluating the relevance of movie recommendations provided by an LLM application.\n",
"\n",
"You will be given:\n",
"1. The user input that initiated the trace\n",
"2. The list of movie recommendations output by the system\n",
"\n",
"##\n",
"User Input:\n",
"{input}\n",
"\n",
"Recommendations:\n",
"{output}\n",
"##\n",
"\n",
"Respond with exactly one word: `correct` or `incorrect`.\n",
"1. `correct` →\n",
"- All recommended movies match the requested genre or criteria in the user input.\n",
"- The recommendations should be relevant to the user's request and shouldn't be repetitive.\n",
"- `incorrect` → one or more recommendations do not match the requested genre or criteria.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce8d7f1b",
"metadata": {
"id": "ce8d7f1b"
},
"outputs": [],
"source": [
"from phoenix.evals import create_classifier\n",
"from phoenix.evals.evaluators import async_evaluate_dataframe\n",
"from phoenix.evals.llm import LLM\n",
"\n",
"llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n",
"\n",
"\n",
"tone_evaluator = create_classifier(\n",
" name=\"tool calling\",\n",
" llm=llm,\n",
" prompt_template=TOOL_CALLING_ORDER,\n",
" choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
")\n",
"\n",
"relevance_evaluator = create_classifier(\n",
" name=\"relevance\",\n",
" llm=llm,\n",
" prompt_template=RECOMMENDATION_RELEVANCE,\n",
" choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
")\n",
"\n",
"\n",
"results_df = await async_evaluate_dataframe(\n",
" dataframe=trace_df,\n",
" evaluators=[tone_evaluator, relevance_evaluator],\n",
")\n",
"results_df.head()"
]
},
{
"cell_type": "markdown",
"id": "4613e975",
"metadata": {
"id": "4613e975"
},
"source": [
"# Log Results Back to Phoenix"
]
},
{
"cell_type": "markdown",
"id": "afd4401d",
"metadata": {
"id": "afd4401d"
},
"source": [
"The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "IF25CDHBcxPL",
"metadata": {
"id": "IF25CDHBcxPL"
},
"outputs": [],
"source": [
"from phoenix.evals.utils import to_annotation_dataframe\n",
"\n",
"root_spans = primary_df[primary_df[\"parent_id\"].isna()][[\"context.trace_id\", \"context.span_id\"]]\n",
"\n",
"# Merge results with root spans to align on trace_id\n",
"results_with_spans = pd.merge(\n",
" results_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n",
").set_index(\"context.span_id\", drop=False)\n",
"\n",
"# Format for Phoenix logging\n",
"annotation_df = to_annotation_dataframe(results_with_spans)\n",
"\n",
"await px_client.spans.log_span_annotations_dataframe(\n",
" dataframe=annotation_df,\n",
" annotator_kind=\"LLM\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "xVG-DsKlWb6c",
"metadata": {
"id": "xVG-DsKlWb6c"
},
"source": [
""
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}