@arizeai/phoenix-mcp

Official

Overview Schema Related Servers Score Discussions

phoenix
tutorials
experiments

agents-cookbook.ipynb•28.4 KiB

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "SUknhuHKyc-E" }, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n", " </p>\n", "</center>\n", "\n", "# <center>Using Phoenix with AI agents</center>\n", "\n", "This guide shows you how to create and evaluate agents with Phoenix to improve performance. We'll go through the following steps:\n", "\n", "* Create a customer support agent using a router template\n", "\n", "* Trace the agent activity, including function calling\n", "\n", "* Create a dataset to benchmark performance\n", "\n", "* Evaluate agent performance using code, human annotation, and LLM as a judge\n", "\n", "* Experiment with different prompts and models" ] }, { "cell_type": "markdown", "metadata": { "id": "baTNFxbwX1e2" }, "source": [ "# Initial setup\n", "\n", "\n", "We'll setup our libraries, keys, and OpenAI tracing using Phoenix." ] }, { "cell_type": "markdown", "metadata": { "id": "n69HR7eJswNt" }, "source": [ "### Install Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -qq \"arize-phoenix[evals,llama-index]\" \"llama-index-llms-openai\" \"openai>=1\" gcsfs nest_asyncio openinference-instrumentation-openai" ] }, { "cell_type": "markdown", "metadata": { "id": "jQnyEnJisyn3" }, "source": [ "### Setup Keys" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n", " phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n", "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n", "\n", "\n", "if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n", " phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n", "os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "markdown", "metadata": { "id": "kfid5cE99yN5" }, "source": [ "### Setup Tracing\n", "\n", "To follow with this tutorial, you'll need to sign up for Phoenix Cloud, which is our free open source solution for observing LLM applications. You can see the [guide here](https://arize.com/docs/phoenix/quickstart), or use the code below after getting your API key." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import phoenix as px\n", "from phoenix.otel import register\n", "\n", "# configure the Phoenix tracer\n", "tracer_provider = register(project_name=\"agents-cookbook\", auto_instrument=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "bLVAqLi5_KAi" }, "source": [ "# Create customer support agent" ] }, { "cell_type": "markdown", "metadata": { "id": "z_L2NP8TLpNB" }, "source": [ "We'll be creating a customer support agent using function calling following the architecture below:\n", "\n", "<img src=\"https://storage.cloud.google.com/arize-assets/tutorials/images/agent_architecture.png\" width=\"800\"/>" ] }, { "cell_type": "markdown", "metadata": { "id": "xSHfLpf2_kLr" }, "source": [ "### Setup functions and create customer support agent\n", "\n", "We have 6 functions that we define below.\n", "\n", "1. product_comparison\n", "2. product_search\n", "3. customer_support\n", "4. track_package\n", "5. product_details\n", "6. apply_discount_code\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tools = [\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"product_comparison\",\n", " \"description\": \"Compare features of two products.\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"product_a_id\": {\n", " \"type\": \"string\",\n", " \"description\": \"The unique identifier of Product A.\",\n", " },\n", " \"product_b_id\": {\n", " \"type\": \"string\",\n", " \"description\": \"The unique identifier of Product B.\",\n", " },\n", " },\n", " \"required\": [\"product_a_id\", \"product_b_id\"],\n", " },\n", " },\n", " },\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"product_search\",\n", " \"description\": \"Search for products based on criteria.\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"query\": {\n", " \"type\": \"string\",\n", " \"description\": \"The search query string.\",\n", " },\n", " \"category\": {\n", " \"type\": \"string\",\n", " \"description\": \"The category to filter the search.\",\n", " },\n", " \"min_price\": {\n", " \"type\": \"number\",\n", " \"description\": \"The minimum price of the products to search.\",\n", " \"default\": 0,\n", " },\n", " \"max_price\": {\n", " \"type\": \"number\",\n", " \"description\": \"The maximum price of the products to search.\",\n", " },\n", " \"page\": {\n", " \"type\": \"integer\",\n", " \"description\": \"The page number for pagination.\",\n", " \"default\": 1,\n", " },\n", " \"page_size\": {\n", " \"type\": \"integer\",\n", " \"description\": \"The number of results per page.\",\n", " \"default\": 20,\n", " },\n", " },\n", " \"required\": [\"query\"],\n", " },\n", " },\n", " },\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"customer_support\",\n", " \"description\": \"Get contact information for customer support regarding an issue.\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"issue_type\": {\n", " \"type\": \"string\",\n", " \"description\": \"The type of issue (e.g., billing, technical support).\",\n", " }\n", " },\n", " \"required\": [\"issue_type\"],\n", " },\n", " },\n", " },\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"track_package\",\n", " \"description\": \"Track the status of a package based on the tracking number.\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"tracking_number\": {\n", " \"type\": \"integer\",\n", " \"description\": \"The tracking number of the package.\",\n", " }\n", " },\n", " \"required\": [\"tracking_number\"],\n", " },\n", " },\n", " },\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"product_details\",\n", " \"description\": \"Returns details for a given product id\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"product_id\": {\n", " \"type\": \"string\",\n", " \"description\": \"The id of a product to look up.\",\n", " }\n", " },\n", " \"required\": [\"product_id\"],\n", " },\n", " },\n", " },\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"apply_discount_code\",\n", " \"description\": \"Applies the discount code to a given order.\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"order_id\": {\n", " \"type\": \"integer\",\n", " \"description\": \"The id of the order to apply the discount code to.\",\n", " },\n", " \"discount_code\": {\n", " \"type\": \"string\",\n", " \"description\": \"The discount code to apply\",\n", " },\n", " },\n", " \"required\": [\"order_id\", \"discount_code\"],\n", " },\n", " },\n", " },\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "4DLzgs8aA-SL" }, "source": [ "We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import openai\n", "\n", "\n", "def run_prompt(input):\n", " client = openai.Client()\n", " response = client.chat.completions.create(\n", " model=\"gpt-4o-mini\",\n", " temperature=0,\n", " tools=tools,\n", " tool_choice=\"auto\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \" \",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": input,\n", " },\n", " ],\n", " )\n", "\n", " if (\n", " hasattr(response.choices[0].message, \"tool_calls\")\n", " and response.choices[0].message.tool_calls is not None\n", " and len(response.choices[0].message.tool_calls) > 0\n", " ):\n", " tool_calls = response.choices[0].message.tool_calls\n", " else:\n", " tool_calls = []\n", "\n", " if response.choices[0].message.content is None:\n", " response.choices[0].message.content = \"\"\n", " if response.choices[0].message.content:\n", " return response.choices[0].message.content\n", " else:\n", " return tool_calls" ] }, { "cell_type": "markdown", "metadata": { "id": "oOM5C6-eC3i9" }, "source": [ "Let's test it and see if it returns the right function! Based on whether we set tool_choice to \"auto\" or \"required\", the router will have different behavior." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run_prompt(\"Hi, I'd like to apply to apply a discount code to my order.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "sq4rcseCGKRc" }, "source": [ "Now we have a basic agent, let's generate a dataset of questions and run the prompt against this dataset!" ] }, { "cell_type": "markdown", "metadata": { "id": "k0Qvn8tAs9vL" }, "source": [ "# Create synthetic dataset of questions\n", "\n", "Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "GEN_TEMPLATE = \"\"\"\n", "You are an assistant that generates complex customer service questions.\n", "The questions should often involve:\n", "\n", "Multiple Categories: Questions that could logically fall into more than one category (e.g., combining product details with a discount code).\n", "Vague Details: Questions with limited or vague information that require clarification to categorize correctly.\n", "Mixed Intentions: Queries where the customer’s goal or need is unclear or seems to conflict within the question itself.\n", "Indirect Language: Use of indirect or polite phrasing that obscures the direct need or request (e.g., using \"I was wondering if...\" or \"Perhaps you could help me with...\").\n", "For specific categories:\n", "\n", "Track Package: Include vague timing references (e.g., \"recently\" or \"a while ago\") instead of specific dates.\n", "Product Comparison and Product Search: Include generic descriptors without specific product names or IDs (e.g., \"high-end smartphones\" or \"energy-efficient appliances\").\n", "Apply Discount Code: Include questions about discounts that might apply to hypothetical or past situations, or without mentioning if they have made a purchase.\n", "Product Details: Ask for comparisons or details that involve multiple products or categories ambiguously (e.g., \"Tell me about your range of electronics that are good for home office setups\").\n", "\n", "Examples of More Challenging Questions\n", "\"There's an issue with one of the items I think I bought last month—what should I do?\"\n", "\"I need help with something I ordered, or maybe I'm just looking for something new. Can you help?\"\n", "\n", "Some questions should be straightforward uses of the provided functions\n", "\n", "Respond with a list, one question per line. Do not include any numbering at the beginning of each line. Do not include any category headings.\n", "Generate 25 questions. Be sure there are no duplicate questions.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "import pandas as pd\n", "\n", "nest_asyncio.apply()\n", "from phoenix.evals import OpenAIModel\n", "\n", "pd.set_option(\"display.max_colwidth\", 500)\n", "\n", "model = OpenAIModel(model=\"gpt-4o\", max_tokens=1300)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = model(GEN_TEMPLATE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "split_response = resp.strip().split(\"\\n\")\n", "\n", "questions_df = pd.DataFrame(split_response, columns=[\"question\"])\n", "print(questions_df)" ] }, { "cell_type": "markdown", "metadata": { "id": "oGIbV49kHp4H" }, "source": [ "Now let's use this dataset and run it against the router prompt above!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response_df = questions_df.copy(deep=True)\n", "response_df[\"response\"] = response_df[\"question\"].apply(run_prompt)\n", "response_df[\"response\"] = response_df[\"response\"].astype(str)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response_df" ] }, { "cell_type": "markdown", "metadata": { "id": "beUkwcCgLaEa" }, "source": [ "# Evaluating your agent\n", "\n", "Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing." ] }, { "cell_type": "markdown", "metadata": { "id": "929BTLH_Ql97" }, "source": [ "Here, we are defining our evaluation templates to judge whether the router selected a function correctly, whether it selected the right function, and whether it filled the arguments correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ROUTER_EVAL_TEMPLATE = \"\"\" You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data:\n", " [BEGIN DATA]\n", " ************\n", " [Question]: {question}\n", " ************\n", " [LLM Response]: {response}\n", " ************\n", " [END DATA]\n", "\n", "Compare the Question above to the response. You must determine whether the reponse\n", "decided to call the correct function.\n", "Your response must be single word, either \"correct\" or \"incorrect\",\n", "and should not contain any text or characters aside from that word.\n", "\"incorrect\" means that the agent should have made function call instead of responding directly and did not, or the function call chosen was the incorrect one.\n", "\"correct\" means the selected function would correctly and fully answer the user's question.\n", "\n", "Here is more information on each function:\n", "product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.\n", "product_search: Search for products based on criteria.\n", "track_package: Track the status of a package based on the tracking number.\n", "customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.\n", "apply_discount_code: Applies a discount code to an order.\n", "product_details: Get detailed features on one product.\n", "\"\"\"\n", "\n", "FUNCTION_SELECTION_EVAL_TEMPLATE = \"\"\"You are comparing a function call response to a question and trying to determine if the generated call is correct. Here is the data:\n", " [BEGIN DATA]\n", " ************\n", " [Question]: {question}\n", " ************\n", " [LLM Response]: {response}\n", " ************\n", " [END DATA]\n", "\n", "Compare the Question above to the function call. You must determine whether the function call\n", "will return the answer to the Question. Please focus on whether the very specific\n", "question can be answered by the function call.\n", "Your response must be single word, either \"correct\", \"incorrect\", or \"not-applicable\",\n", "and should not contain any text or characters aside from that word.\n", "\"incorrect\" means that the function call will not provide an answer to the Question.\n", "\"correct\" means the function call will definitely provide an answer to the Question.\n", "\"not-applicable\" means that response was not a function call.\n", "\n", "Here is more information on each function:\n", "product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.\n", "product_search: Search for products based on criteria.\n", "track_package: Track the status of a package based on the tracking number.\n", "customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.\n", "apply_discount_code: Applies a discount code to an order.\n", "product_details: Get detailed features on one product.\n", "\"\"\"\n", "\n", "PARAMETER_EXTRACTION_EVAL_TEMPLATE = \"\"\" You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:\n", " [BEGIN DATA]\n", " ************\n", " [Question]: {question}\n", " ************\n", " [LLM Response]: {response}\n", " ************\n", " [END DATA]\n", "\n", "Compare the parameters in the generated function against the JSON provided below.\n", "The parameters extracted from the question must match the JSON below exactly.\n", "Your response must be single word, either \"correct\", \"incorrect\", or \"not-applicable\",\n", "and should not contain any text or characters aside from that word.\n", "\"incorrect\" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question.\n", "You should also respond with \"incorrect\" if the response makes up information that is not in the JSON schema.\n", "\"correct\" means the function call parameters match the JSON below and provides only relevant information.\n", "\"not-applicable\" means that response was not a function call.\n", "\n", "Here is more information on each function:\n", "product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.\n", "product_search: Search for products based on criteria.\n", "track_package: Track the status of a package based on the tracking number.\n", "customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.\n", "apply_discount_code: Applies a discount code to an order.\n", "product_details: Get detailed features on one product.\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "id": "1aivaxTCRQFl" }, "source": [ "Let's run evaluations using Phoenix's llm_classify function for our responses dataframe we generated above!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.evals import OpenAIModel, llm_classify\n", "\n", "rails = [\"incorrect\", \"correct\"]\n", "\n", "router_eval_df = llm_classify(\n", " data=response_df,\n", " template=ROUTER_EVAL_TEMPLATE,\n", " model=OpenAIModel(model=\"gpt-4o\"),\n", " rails=rails,\n", " provide_explanation=True,\n", " include_prompt=True,\n", " concurrency=4,\n", ")\n", "\n", "function_selection_eval_df = llm_classify(\n", " data=response_df,\n", " template=FUNCTION_SELECTION_EVAL_TEMPLATE,\n", " model=OpenAIModel(model=\"gpt-4o\"),\n", " rails=rails,\n", " provide_explanation=True,\n", " include_prompt=True,\n", " concurrency=4,\n", ")\n", "\n", "parameter_extraction_eval_df = llm_classify(\n", " data=response_df,\n", " template=PARAMETER_EXTRACTION_EVAL_TEMPLATE,\n", " model=OpenAIModel(model=\"gpt-4o\"),\n", " rails=rails,\n", " provide_explanation=True,\n", " include_prompt=True,\n", " concurrency=4,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "vDV1KBdYQ_vh" }, "source": [ "Let's look at and inspect the results of our evaluatiion!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "router_eval_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "function_selection_eval_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parameter_extraction_eval_df" ] }, { "cell_type": "markdown", "metadata": { "id": "cHYgS5cpRE3b" }, "source": [ "# Create an experiment\n", "\n", "With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent." ] }, { "cell_type": "markdown", "metadata": { "id": "SgTEu7U4Rd5i" }, "source": [ "Let's create this dataset and upload it into the platform." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from uuid import uuid1\n", "\n", "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n", "\n", "client = px.Client()\n", "\n", "dataset = client.upload_dataset(\n", " dataframe=questions_df,\n", " dataset_name=\"agents-cookbook-\" + str(uuid1()),\n", " input_keys=[\"question\"],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "\n", "from phoenix.evals import (\n", " OpenAIModel,\n", " llm_classify,\n", ")\n", "from phoenix.experiments.types import EvaluationResult\n", "\n", "\n", "def generic_eval(input, output, prompt_template):\n", " df_in = pd.DataFrame({\"question\": [str(input[\"question\"])], \"response\": [str(output)]})\n", " rails = [\"correct\", \"incorrect\"]\n", " eval_df = llm_classify(\n", " data=df_in,\n", " template=prompt_template,\n", " model=OpenAIModel(model=\"gpt-4o\"),\n", " rails=rails,\n", " provide_explanation=True,\n", " )\n", " label = eval_df[\"label\"][0]\n", " score = (\n", " 1 if rails and label == rails[0] else 0\n", " ) # Choose the 0 item in rails as the correct \"1\" label\n", " explanation = eval_df[\"explanation\"][0]\n", " return EvaluationResult(score=score, label=label, explanation=explanation)\n", "\n", "\n", "def routing_eval(input, output):\n", " return generic_eval(input, output, ROUTER_EVAL_TEMPLATE)\n", "\n", "\n", "def function_call_eval(input, output):\n", " return generic_eval(input, output, FUNCTION_SELECTION_EVAL_TEMPLATE)\n", "\n", "\n", "def parameter_extraction_eval(input, output):\n", " return generic_eval(input, output, PARAMETER_EXTRACTION_EVAL_TEMPLATE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import phoenix as px\n", "from phoenix.experiments import run_experiment\n", "\n", "\n", "def prompt_gen_task(input):\n", " return run_prompt(input[\"question\"])\n", "\n", "\n", "experiment = run_experiment(\n", " dataset=dataset,\n", " task=prompt_gen_task,\n", " evaluators=[routing_eval, function_call_eval, parameter_extraction_eval],\n", " experiment_name=\"agents-cookbook\",\n", ")" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

agents-cookbook.ipynb•28.4 KiB