Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
session_level_evals.ipynb32.2 kB
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Gxv2-tMAGIG3" }, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": { "id": "luyELn4GGIG4" }, "source": [ "# Session Level Evals for an AI Tutor" ] }, { "cell_type": "markdown", "metadata": { "id": "PH-e9oudGIG4" }, "source": [ "This tutorial demonstrates how to run session-level evaluations on conversations with an AI tutor. You'll log the results back to Phoenix for further monitoring and analysis. Session-level evaluations are valuable because they provide a holistic view of the entire interaction, enabling you to assess broader patterns and answer high-level questions about user experience and system performance.\n", "\n", "In this tutorial, you will:\n", "- Trace and aggregate multi-turn interactions into structured sessions\n", "- Evaluate sessions across multiple dimensions such as Correctness, Goal Completion, and Frustration\n", "- Format the evaluation outputs to match the Phoenix schema and log them to the platform\n", "\n", "By the end, you’ll have a robust evaluation pipeline for analyzing and comparing session-level performance.\n", "\n", "✅ You’ll need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an Anthropic API key to run this notebook." ] }, { "cell_type": "markdown", "metadata": { "id": "kBepM2asGIG5" }, "source": [ "# Set up Dependencies & Keys" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aVXFvSJlGIG5" }, "outputs": [], "source": [ "%pip install openinference-instrumentation-anthropic openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio anthropic" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EAclg1SyGIG6" }, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "import nest_asyncio\n", "\n", "nest_asyncio.apply()\n", "\n", "if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n", " phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n", "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n", "\n", "\n", "if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n", " phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n", "os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n", "\n", "if not (anthropic_api_key := os.getenv(\"ANTHROPIC_API_KEY\")):\n", " anthropic_api_key = getpass(\"🔑 Enter your Anthropic API key: \")\n", "os.environ[\"ANTHROPIC_API_KEY\"] = anthropic_api_key" ] }, { "cell_type": "markdown", "metadata": { "id": "2hZF50xAGIG6" }, "source": [ "# Configure Tracing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2hoGRK7-GIG6" }, "outputs": [], "source": [ "from phoenix.otel import register\n", "\n", "# configure the Phoenix tracer\n", "tracer_provider = register(project_name=\"ai-tutor-session\", auto_instrument=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "TWBoBKyVGIG6" }, "source": [ "# Build and Run AI Tutor" ] }, { "cell_type": "markdown", "metadata": { "id": "xiFsdf88GIG7" }, "source": [ "In this example, we demonstrate how to evaluate AI tutor sessions. The tutor begins by receiving a user ID, topic, and question. It then explains the topic to the student and engages them with follow-up questions in a multi-turn conversation, continuing until the student ends the session. Our goal is to assess the overall quality of this interaction from start to finish." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UbZtq1KIGIG7" }, "outputs": [], "source": [ "import uuid\n", "\n", "import anthropic\n", "from openinference.instrumentation import using_attributes\n", "\n", "client = anthropic.Anthropic(api_key=os.getenv(\"ANTHROPIC_API_KEY\"))\n", "\n", "\n", "def run_session(user_id: str, topic: str, question: str):\n", " session_id = f\"tutor-{uuid.uuid4()}\"\n", " chat = [\n", " {\n", " \"role\": \"system\",\n", " \"content\": (\n", " f\"You are a thoughtful AI tutor teaching {topic}. \"\n", " \"Ask questions, give hints, and only suggest full answers \"\n", " \"when student shows correct reasoning.\"\n", " ),\n", " },\n", " {\"role\": \"user\", \"content\": question},\n", " ]\n", "\n", " while True:\n", " with using_attributes(session_id=session_id, user_id=user_id):\n", " messages = []\n", " for msg in chat:\n", " if msg[\"role\"] == \"system\":\n", " if not messages:\n", " messages.append({\"role\": \"user\", \"content\": msg[\"content\"]})\n", " else:\n", " messages.append(msg)\n", "\n", " resp = client.messages.create(\n", " model=\"claude-3-5-sonnet-20241022\",\n", " messages=messages,\n", " max_tokens=1000,\n", " temperature=0.5,\n", " )\n", " assistant_msg = resp.content[0].text.strip()\n", " assistant_msg += \"\\n\\n(You can type 'DONE' if you're finished.)\"\n", "\n", " chat.append({\"role\": \"assistant\", \"content\": assistant_msg})\n", " print(f\"Tutor: {assistant_msg}\")\n", "\n", " student_input = input(\"> your answer: \")\n", " if student_input.strip().upper() == \"DONE\":\n", " print(\"✅ Student is DONE — ending session.\")\n", " break\n", "\n", " chat.append({\"role\": \"user\", \"content\": student_input})\n", " return session_id" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lYlq8KctGIG7" }, "outputs": [], "source": [ "# Ask any question to the AI tutor!\n", "run_session(user_id=\"Sanjana\", topic=\"Science\", question=\"Why is the sky blue?\")" ] }, { "cell_type": "markdown", "metadata": { "id": "SY-O-94iGIG7" }, "source": [ "# Prepare Spans for Session-Level Evaluation" ] }, { "cell_type": "markdown", "metadata": { "id": "23dIrbxNGIG7" }, "source": [ "These following cells prepare the data for session-level evaluation. We start by loading all spans into a DataFrame, then sort them chronologically and group them by session ID. You can also group the spans by user ID.\n", "\n", "Next, we separate user inputs from AI responses, and finally, store the structured results in a dataframe. We will use this dataframe to run our evaluations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hPPPIIhqGIG7" }, "outputs": [], "source": [ "from phoenix.client import Client\n", "\n", "client = Client()\n", "primary_df = client.spans.get_spans_dataframe(project_identifier=\"ai-tutor-session\")" ] }, { "cell_type": "markdown", "metadata": { "id": "F3WKlr0XGIG7" }, "source": [ "Here, we group our spans together to make a session dataframe. We also include logic to truncate part of the sesssion messages if token limits are exceeded. This prevents context window issues for longer sessions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hCbogVhMGIG7" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "\n", "def truncate_text(text, max_chars, strategy=\"end\"):\n", " \"\"\"Truncate text to max_chars using the specified strategy.\"\"\"\n", " if not text or len(text) <= max_chars:\n", " return text\n", "\n", " if strategy == \"start\":\n", " return \"...\" + text[-(max_chars - 3) :]\n", " elif strategy == \"middle\":\n", " half = (max_chars - 3) // 2\n", " return text[:half] + \"...\" + text[-half:]\n", " else: # \"end\"\n", " return text[: max_chars - 3] + \"...\"\n", "\n", "\n", "def estimate_session_size(messages):\n", " \"\"\"Estimate total character count of session content.\"\"\"\n", " return sum(len(msg) for msg in messages if isinstance(msg, str))\n", "\n", "\n", "def prepare_sessions(\n", " df: pd.DataFrame,\n", " max_chars_per_value=10000, # Limit for each individual message\n", " max_chars_per_session=700000, # Based on claude-3-7-sonnet-latest having 200k tokens (~4 chars/token)\n", " truncation_strategy=\"end\",\n", ") -> pd.DataFrame:\n", " \"\"\"\n", " Collapse spans into a single row per session with truncation support,\n", " preserving message order (user/output interleaved).\n", " \"\"\"\n", " sessions = []\n", "\n", " # Sort and group\n", " grouped = df.sort_values(\"start_time\").groupby(\"attributes.session.id\", as_index=False)\n", "\n", " for session_id, group in grouped:\n", " # Collect all messages in order\n", " messages = []\n", " for _, row in group.iterrows():\n", " if pd.notna(row.get(\"attributes.input.value\")):\n", " messages.append(\n", " truncate_text(\n", " row[\"attributes.input.value\"], max_chars_per_value, truncation_strategy\n", " )\n", " )\n", " if pd.notna(row.get(\"attributes.output.value\")):\n", " messages.append(\n", " truncate_text(\n", " row[\"attributes.output.value\"], max_chars_per_value, truncation_strategy\n", " )\n", " )\n", "\n", " # Estimate total session size\n", " total_chars = estimate_session_size(messages)\n", "\n", " # Truncate session-level size if needed\n", " if total_chars > max_chars_per_session:\n", " print(f\"Session {session_id} exceeds {max_chars_per_session} chars. Truncating...\")\n", "\n", " # Keep messages evenly from start and end (half-half)\n", " keep_half = len(messages) // 2\n", " messages = messages[: keep_half // 2] + messages[-(keep_half - keep_half // 2) :]\n", "\n", " # Optional: truncate remaining messages again more aggressively\n", " total_chars = estimate_session_size(messages)\n", " if total_chars > max_chars_per_session:\n", " aggressive_limit = max_chars_per_value // 2\n", " messages = [\n", " truncate_text(m, aggressive_limit, truncation_strategy) for m in messages\n", " ]\n", "\n", " sessions.append(\n", " {\n", " \"session_id\": session_id,\n", " \"messages\": messages,\n", " \"trace_count\": group[\"context.trace_id\"].nunique(),\n", " }\n", " )\n", "\n", " return pd.DataFrame(sessions)\n", "\n", "\n", "sessions_df = prepare_sessions(primary_df, truncation_strategy=\"middle\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Z8XoQuI3GIG8" }, "outputs": [], "source": [ "sessions_df" ] }, { "cell_type": "markdown", "metadata": { "id": "CqZ1dUUPGIG8" }, "source": [ "# Session Correctness Eval" ] }, { "cell_type": "markdown", "metadata": { "id": "xeeP6qaSGIG8" }, "source": [ "We are ready to begin running our evals. Let's start with an eval that ensures the AI tutor is giving the student factual information:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zGoSZ8dwGIG8" }, "outputs": [], "source": [ "SESSION_CORRECTNESS_PROMPT = \"\"\"\n", "You are an expert tutor assistant evaluating the **correctness and educational quality** of an AI tutor's session with a student.\n", "\n", "A session consists of multiple traces (interactions) between a student and an AI tutor. Each message includes a role field:\n", "1. If role is user, the message is from the student.\n", "2. If role is assistant, the message is from the AI tutor.\n", "You will be provided with the series of messages that took place, in the order they occurred.\n", "\n", "An effective and correct tutoring session should:\n", "- Provide factually and conceptually accurate explanations\n", "- Correctly answer student questions\n", "- Clarify misunderstandings if they occur\n", "- Build upon previous context in a coherent way\n", "- Avoid hallucinations, vague responses, or incorrect reasoning\n", "\n", "##\n", "Messages:\n", "{messages}\n", "##\n", "\n", "Based on the above, evaluate the session **only for correctness and educational soundness**.\n", "\n", "Respond with a single word: `correct` or `incorrect`.\n", "\n", "- Respond with `correct` if the AI tutor consistently provides accurate, clear, and educationally sound answers.\n", "- Respond with `incorrect` if the AI tutor gives factually wrong, misleading, or incoherent explanations at any point.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eGAvBfktGIG8" }, "outputs": [], "source": [ "import anthropic\n", "import nest_asyncio\n", "\n", "from phoenix.evals import AnthropicModel, llm_classify\n", "\n", "nest_asyncio.apply()\n", "\n", "# Configure your evaluation model using Claude 3.5 Sonnet\n", "model = AnthropicModel(\n", " model=\"claude-3-7-sonnet-latest\",\n", ")\n", "\n", "# Run the evaluation\n", "rails = [\"correct\", \"incorrect\"]\n", "eval_results_correctness = llm_classify(\n", " data=sessions_df,\n", " template=SESSION_CORRECTNESS_PROMPT,\n", " model=model,\n", " rails=rails,\n", " provide_explanation=True,\n", " verbose=False,\n", ")\n", "\n", "eval_results_correctness" ] }, { "cell_type": "markdown", "metadata": { "id": "hTgm3EIsGIG8" }, "source": [ "# Session Frustration Prompt" ] }, { "cell_type": "markdown", "metadata": { "id": "iZkTcbr3GIG8" }, "source": [ "This evaluation is used to make sure the student isn't getting frustrated with the tutor:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fjYpo_pPGIG8" }, "outputs": [], "source": [ "SESSION_FRUSTRATION_PROMPT = \"\"\"\n", "You are an AI assistant evaluating whether a student became frustrated during a tutoring session with an AI tutor.\n", "\n", "A session consists of multiple traces (interactions) between a student and an AI tutor. Each message includes a role field:\n", "1. If role is user, the message is from the student.\n", "2. If role is assistant, the message is from the AI tutor.\n", "You will be provided with the series of messages that took place, in the order they occurred.\n", "\n", "Signs of student frustration may include:\n", "- Repeating or rephrasing the same question multiple times\n", "- Expressing confusion (\"I don't get it\", \"This doesn't make sense\", etc.)\n", "- Disagreeing with the tutor's responses\n", "- Asking for clarification frequently without resolution\n", "- Expressing annoyance, impatience, or disengagement\n", "- Abruptly ending the session\n", "\n", "##\n", "Messages:\n", "{messages}\n", "##\n", "\n", "\n", "Based on the above, evaluate whether the student showed signs of frustration at any point in the session.\n", "\n", "Respond with a single word: `frustrated` or `not_frustrated`.\n", "\n", "- Respond with `frustrated` if there is evidence of confusion, dissatisfaction, or emotional frustration.\n", "- Respond with `not_frustrated` if the student appears to stay engaged and satisfied throughout.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JMBhBzoPGIG8" }, "outputs": [], "source": [ "# Run the evaluation\n", "rails = [\"frustrated\", \"not_frustrated\"]\n", "eval_results_frustration = llm_classify(\n", " data=sessions_df,\n", " template=SESSION_FRUSTRATION_PROMPT,\n", " model=model,\n", " rails=rails,\n", " provide_explanation=True,\n", " verbose=False,\n", ")\n", "\n", "eval_results_frustration" ] }, { "cell_type": "markdown", "metadata": { "id": "2tnQB0FjGIG8" }, "source": [ "# Session Goal Achievement Eval" ] }, { "cell_type": "markdown", "metadata": { "id": "4e_CHDlOGIG8" }, "source": [ "Finally, we evaluate to ensure the tutor helped the student reach their learning goals:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "g-TAmXL2GIG8" }, "outputs": [], "source": [ "SESSION_GOAL_ACHIEVEMENT_PROMPT = \"\"\"\n", "You are an AI assistant evaluating whether the AI tutor successfully helped the student achieve their learning goals during a tutoring session.\n", "\n", "A session consists of multiple traces (interactions) between a student and an AI tutor. Each message includes a role field:\n", "1. If role is user, the message is from the student.\n", "2. If role is assistant, the message is from the AI tutor.\n", "You will be provided with the series of messages that took place, in the order they occurred.\n", "\n", "To determine if the student’s goals were achieved, consider:\n", "- Whether the AI tutor addressed the student’s questions and requests directly\n", "- Whether the explanations provided resolved the student’s doubts or problems\n", "- Whether the student’s inputs indicate understanding or closure by the end\n", "- Whether the conversation logically progressed toward completing the student’s objectives\n", "\n", "##\n", "Messages:\n", "{messages}\n", "##\n", "\n", "\n", "Evaluate the session and respond with a single word: `achieved` or `not_achieved`.\n", "\n", "- Respond with `achieved` if the tutoring session successfully met the student’s learning goals and resolved their questions.\n", "- Respond with `not_achieved` if the session left the student’s questions unanswered or goals unmet.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T__RehnKGIG8" }, "outputs": [], "source": [ "# Run the evaluation\n", "rails = [\"achieved\", \"not_achieved\"]\n", "eval_results_goal_achievement = llm_classify(\n", " data=sessions_df,\n", " template=SESSION_GOAL_ACHIEVEMENT_PROMPT,\n", " model=model,\n", " rails=rails,\n", " provide_explanation=True,\n", " verbose=False,\n", ")\n", "\n", "eval_results_goal_achievement" ] }, { "cell_type": "markdown", "metadata": { "id": "uUqxH3noGIG8" }, "source": [ "# Log Evaluations Back to Phoenix" ] }, { "cell_type": "markdown", "metadata": { "id": "PMYOgvMGGIG8" }, "source": [ "Finally, we can log the evaluation results back to Phoenix. In the sessions, tab of your project, you will see the evaluation results populate for each session." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X6_g19NgGIG9" }, "outputs": [], "source": [ "from phoenix.client import Client\n", "\n", "# --- Find the root span for each session ---\n", "root_spans = primary_df.sort_values(\"start_time\").drop_duplicates(\n", " subset=[\"attributes.session.id\"], keep=\"first\"\n", ")[[\"attributes.session.id\", \"context.span_id\"]]\n", "\n", "# --- Merge with Session Correctness Eval with Session Data ---\n", "eval_results_correctness = eval_results_correctness[[\"label\", \"explanation\"]]\n", "\n", "eval_results_correctness = pd.merge(\n", " sessions_df, eval_results_correctness, left_index=True, right_index=True\n", ")\n", "\n", "correctness_final_df = pd.merge(\n", " eval_results_correctness,\n", " root_spans,\n", " left_on=\"session_id\",\n", " right_on=\"attributes.session.id\",\n", " how=\"left\",\n", ")\n", "correctness_final_df = correctness_final_df.set_index(\"context.span_id\", drop=False)\n", "\n", "# --- Merge with Frustration Eval with Session Data ---\n", "eval_results_frustration = eval_results_frustration[[\"label\", \"explanation\"]]\n", "\n", "eval_results_frustration = pd.merge(\n", " sessions_df, eval_results_frustration, left_index=True, right_index=True\n", ")\n", "\n", "frustration_final_df = pd.merge(\n", " eval_results_frustration,\n", " root_spans,\n", " left_on=\"session_id\",\n", " right_on=\"attributes.session.id\",\n", " how=\"left\",\n", ")\n", "frustration_final_df = frustration_final_df.set_index(\"context.span_id\", drop=False)\n", "\n", "# --- Merge with Goal Eval with Session Data ---\n", "eval_results_goal_achievement = eval_results_goal_achievement[[\"label\", \"explanation\"]]\n", "\n", "eval_results_goal_achievement = pd.merge(\n", " sessions_df, eval_results_goal_achievement, left_index=True, right_index=True\n", ")\n", "\n", "goal_final_df = pd.merge(\n", " eval_results_goal_achievement,\n", " root_spans,\n", " left_on=\"session_id\",\n", " right_on=\"attributes.session.id\",\n", " how=\"left\",\n", ")\n", "goal_final_df = goal_final_df.set_index(\"context.span_id\", drop=False)\n", "\n", "\n", "from phoenix.client import AsyncClient\n", "\n", "px_client = AsyncClient()\n", "await px_client.spans.log_span_annotations_dataframe(\n", " dataframe=correctness_final_df,\n", " annotation_name=\"Session Correctness\",\n", " annotator_kind=\"LLM\",\n", ")\n", "await px_client.spans.log_span_annotations_dataframe(\n", " dataframe=frustration_final_df,\n", " annotation_name=\"Session Frustration\",\n", " annotator_kind=\"LLM\",\n", ")\n", "await px_client.spans.log_span_annotations_dataframe(\n", " dataframe=goal_final_df,\n", " annotation_name=\"Session Goal Achievement\",\n", " annotator_kind=\"LLM\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "LE61B1VpGIG9" }, "source": [ "![Session Eval Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-session-level-evals.png)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server