Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
ragas_agents_cookbook_phoenix.ipynb18.3 kB
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "SUknhuHKyc-E" }, "source": [ "<p style=\"text-align:center\">\n", " <img alt=\"Ragas\" src=\"https://github.com/explodinggradients/ragas/blob/main/docs/_static/imgs/logo.png?raw=true\" width=\"250\">\n", "<br>\n", "<p>\n", " <a href=\"https://github.com/explodinggradients/ragas\">Ragas</a>\n", " |\n", " <a href=\"https://docs.arize.com/arize/\">Arize</a>\n", " |\n", " <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg\">Slack Community</a>\n", "</p>\n", "\n", "<center><h1>Tracing and Evaluating AI agents</h1></center>\n", "\n", "This guide will walk you through the process of creating and evaluating agents using Ragas and Arize Phoenix. We'll cover the following steps:\n", "\n", "* Build an agent that solves math problems with the OpenAI Agents SDK\n", "\n", "* Trace agent activity to monitor interactions\n", "\n", "* Generate a benchmark dataset for performance analysis\n", "\n", "* Evaluate agent performance using Ragas" ] }, { "cell_type": "markdown", "metadata": { "id": "baTNFxbwX1e2" }, "source": [ "# Initial setup\n", "\n", "\n", "We'll setup our libraries, keys, and OpenAI tracing using Phoenix." ] }, { "cell_type": "markdown", "metadata": { "id": "n69HR7eJswNt" }, "source": [ "### Install Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q \"arize-phoenix>=8.0.0\" arize-phoenix-otel openinference-instrumentation-openai-agents openinference-instrumentation-openai arize-phoenix-evals \"arize[Datasets]\" --upgrade\n", "\n", "!pip install langchain-openai\n", "\n", "!pip install -q openai opentelemetry-sdk opentelemetry-exporter-otlp gcsfs nest_asyncio ragas openai-agents --upgrade" ] }, { "cell_type": "markdown", "metadata": { "id": "jQnyEnJisyn3" }, "source": [ "### Setup Keys" ] }, { "cell_type": "markdown", "metadata": { "id": "cmuChHhPd-Z7" }, "source": [ "Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also [connect to a self-hosted Phoenix instance](https://arize.com/docs/phoenix/deployment) if you'd prefer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "import nest_asyncio\n", "\n", "nest_asyncio.apply()\n", "\n", "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n", "if not os.environ.get(\"PHOENIX_CLIENT_HEADERS\"):\n", " os.environ[\"PHOENIX_CLIENT_HEADERS\"] = \"api_key=\" + getpass(\"Enter your Phoenix API key: \")\n", "\n", "if not os.environ.get(\"OPENAI_API_KEY\"):\n", " os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")" ] }, { "cell_type": "markdown", "metadata": { "id": "kfid5cE99yN5" }, "source": [ "### Setup Tracing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.otel import register\n", "\n", "# Setup OpenTelemetry\n", "tracer_provider = register(\n", " project_name=\"agents-cookbook\",\n", " endpoint=\"https://app.phoenix.arize.com/v1/traces\",\n", " auto_instrument=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "bLVAqLi5_KAi" }, "source": [ "# Create your first agent with the OpenAI SDK" ] }, { "cell_type": "markdown", "metadata": { "id": "2RpnS8n8AirE" }, "source": [ "Here we've setup a basic agent that can solve math problems.\n", "\n", "We have a function tool that can solve math equations, and an agent that can use this tool.\n", "\n", "We'll use the `Runner` class to run the agent and get the final output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from agents import Runner, function_tool\n", "\n", "\n", "@function_tool\n", "def solve_equation(equation: str) -> str:\n", " \"\"\"Use python to evaluate the math equation, instead of thinking about it yourself.\n", "\n", " Args:\"\n", " equation: string which to pass into eval() in python\n", " \"\"\"\n", " return str(eval(equation))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from agents import Agent\n", "\n", "agent = Agent(\n", " name=\"Math Solver\",\n", " instructions=\"You solve math problems by evaluating them with python and returning the result\",\n", " tools=[solve_equation],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = await Runner.run(agent, \"what is 15 + 28?\")\n", "\n", "# Run Result object\n", "print(result)\n", "\n", "# Get the final output\n", "print(result.final_output)\n", "\n", "# Get the entire list of messages recorded to generate the final output\n", "print(result.to_input_list())" ] }, { "cell_type": "markdown", "metadata": { "id": "sq4rcseCGKRc" }, "source": [ "Now we have a basic agent, let's evaluate whether the agent responded correctly." ] }, { "cell_type": "markdown", "metadata": { "id": "y23F9ZP5AirG" }, "source": [ "# Evaluating our agent\n", "\n", "Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:\n", "1. Tool call accuracy - did our agent choose the right tool with the right arguments?\n", "2. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?" ] }, { "cell_type": "markdown", "metadata": { "id": "dr4C55YBAirG" }, "source": [ "Let's setup our evaluation by defining our task function, our evaluator, and our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "\n", "from agents import Runner\n", "\n", "\n", "# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.\n", "async def solve_math_problem(input):\n", " if isinstance(input, dict):\n", " input = next(iter(input.values()))\n", " result = await Runner.run(agent, input)\n", " return {\"final_output\": result.final_output, \"messages\": result.to_input_list()}\n", "\n", "\n", "result = asyncio.run(solve_math_problem(\"What is 15 + 28?\"))\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": { "id": "9SHK2ZhxAirH" }, "source": [ "This is helper code which converts the agent messages into a format that Ragas can use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def conversation_to_ragas_sample(messages, reference_equation=None, reference_answer=None):\n", " \"\"\"\n", " Convert a single conversation into a Ragas MultiTurnSample.\n", "\n", " Args:\n", " conversation: Dictionary containing conversation data with 'conversation' key\n", " reference_equation: Optional string with the reference equation for evaluation\n", "\n", " Returns:\n", " MultiTurnSample: Formatted sample for Ragas evaluation\n", " \"\"\"\n", "\n", " import json\n", "\n", " from ragas.dataset_schema import MultiTurnSample\n", " from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage\n", "\n", " ragas_messages = []\n", " pending_tool_call = None\n", " reference_tool_calls = None\n", "\n", " for item in messages:\n", " role = item.get(\"role\")\n", " item_type = item.get(\"type\")\n", "\n", " if role == \"user\":\n", " ragas_messages.append(HumanMessage(content=item[\"content\"]))\n", " pending_tool_call = None\n", "\n", " elif item_type == \"function_call\":\n", " args = json.loads(item[\"arguments\"])\n", " pending_tool_call = ToolCall(name=item[\"name\"], args=args)\n", "\n", " elif item_type == \"function_call_output\":\n", " if pending_tool_call is not None:\n", " ragas_messages.append(AIMessage(content=\"\", tool_calls=[pending_tool_call]))\n", " ragas_messages.append(\n", " ToolMessage(\n", " content=item[\"output\"],\n", " name=pending_tool_call.name,\n", " tool_call_id=\"tool_call_1\",\n", " )\n", " )\n", " pending_tool_call = None\n", " else:\n", " print(\"[WARN] ToolMessage without preceding ToolCall — skipping.\")\n", "\n", " elif role == \"assistant\":\n", " content = (\n", " item[\"content\"][0][\"text\"]\n", " if isinstance(item.get(\"content\"), list)\n", " else item.get(\"content\", \"\")\n", " )\n", " ragas_messages.append(AIMessage(content=content))\n", " print(\"Ragas_messages\", ragas_messages)\n", "\n", " if reference_equation:\n", " # Look for the first function call to extract the actual tool call\n", " for item in messages:\n", " if item.get(\"type\") == \"function_call\":\n", " args = json.loads(item[\"arguments\"])\n", " reference_tool_calls = [ToolCall(name=item[\"name\"], args=args)]\n", " break\n", "\n", " return MultiTurnSample(user_input=ragas_messages, reference_tool_calls=reference_tool_calls)\n", "\n", " elif reference_answer:\n", " return MultiTurnSample(user_input=ragas_messages, reference=reference_answer)\n", "\n", " else:\n", " return MultiTurnSample(user_input=ragas_messages)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Here is an example of the function in action\n", "sample = conversation_to_ragas_sample(\n", " # This is a list of messages recorded for \"Calculate 15 + 28.\"\n", " result[\"messages\"],\n", " reference_equation=\"15 + 28\",\n", " reference_answer=\"43\",\n", ")\n", "print(sample)" ] }, { "cell_type": "markdown", "metadata": { "id": "o85Vvi_tAirI" }, "source": [ "Now let's setup our evaluator. We'll import both metrics we're measuring from Ragas, and use the `multi_turn_ascore(sample)` to get the results.\n", "\n", "The `AgentGoalAccuracyWithReference` metric compares the final output to the reference to see if the goal was accomplished.\n", "\n", "The `ToolCallAccuracy` metric compares the tool call to the reference tool call to see if the tool call was made correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain_openai import ChatOpenAI\n", "from ragas.llms import LangchainLLMWrapper\n", "from ragas.metrics import AgentGoalAccuracyWithReference, ToolCallAccuracy\n", "\n", "\n", "# Setup evaluator LLM and metrics\n", "async def tool_call_evaluator(input, output):\n", " sample = conversation_to_ragas_sample(output[\"messages\"], reference_equation=input[\"question\"])\n", " tool_call_accuracy = ToolCallAccuracy()\n", " return await tool_call_accuracy.multi_turn_ascore(sample)\n", "\n", "\n", "async def goal_evaluator(input, output):\n", " sample = conversation_to_ragas_sample(\n", " output[\"messages\"], reference_answer=output[\"final_output\"]\n", " )\n", " evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\n", " goal_accuracy = AgentGoalAccuracyWithReference(llm=evaluator_llm)\n", " return await goal_accuracy.multi_turn_ascore(sample)\n", "\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": { "id": "k0Qvn8tAs9vL" }, "source": [ "# Create synthetic dataset of questions\n", "\n", "Using the template below, we're going to generate a dataframe of 10 questions we can use to test our math problem solving agent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MATH_GEN_TEMPLATE = \"\"\"\n", "You are an assistant that generates diverse math problems for testing a math solver agent.\n", "The problems should include:\n", "\n", "Basic Operations: Simple addition, subtraction, multiplication, division problems.\n", "Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.\n", "Exponents and Roots: Problems involving powers, square roots, and other nth roots.\n", "Percentages: Problems involving calculating percentages of numbers or finding percentage changes.\n", "Fractions: Problems with addition, subtraction, multiplication, or division of fractions.\n", "Algebra: Simple algebraic expressions that can be evaluated with specific values.\n", "Sequences: Finding sums, products, or averages of number sequences.\n", "Word Problems: Converting word problems into mathematical equations.\n", "\n", "Do not include any solutions in your generated problems.\n", "\n", "Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.\n", "Generate 10 diverse math problems. Ensure there are no duplicate problems.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "import pandas as pd\n", "\n", "from phoenix.evals import OpenAIModel\n", "\n", "nest_asyncio.apply()\n", "pd.set_option(\"display.max_colwidth\", 500)\n", "\n", "# Initialize the model\n", "model = OpenAIModel(model=\"gpt-4o\", max_tokens=1300)\n", "\n", "# Generate math problems\n", "resp = model(MATH_GEN_TEMPLATE)\n", "\n", "# Create DataFrame\n", "split_response = resp.strip().split(\"\\n\")\n", "math_problems_df = pd.DataFrame(split_response, columns=[\"question\"])\n", "math_problems_df = math_problems_df[math_problems_df[\"question\"].str.strip() != \"\"]\n", "\n", "print(math_problems_df)" ] }, { "cell_type": "markdown", "metadata": { "id": "oGIbV49kHp4H" }, "source": [ "Now let's use this dataset and run it with the agent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "\n", "from agents import Runner\n", "\n", "# Run agent on all problems and collect conversation history\n", "conversations = []\n", "for idx, row in math_problems_df.iterrows():\n", " result = asyncio.run(solve_math_problem(row[\"question\"]))\n", " conversations.append(\n", " {\n", " \"question\": row[\"question\"],\n", " \"final_output\": result[\"final_output\"],\n", " \"messages\": result[\"messages\"],\n", " }\n", " )\n", "\n", "print(f\"Processed {len(conversations)} problems\")\n", "print(conversations[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "cHYgS5cpRE3b" }, "source": [ "# Create an experiment\n", "\n", "With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent." ] }, { "cell_type": "markdown", "metadata": { "id": "SgTEu7U4Rd5i" }, "source": [ "Let's create this dataset and upload it into the platform." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import phoenix as px\n", "\n", "dataset_df = pd.DataFrame(\n", " {\n", " \"question\": [conv[\"question\"] for conv in conversations],\n", " \"final_output\": [conv[\"final_output\"] for conv in conversations],\n", " }\n", ")\n", "\n", "dataset = px.Client().upload_dataset(\n", " dataframe=dataset_df,\n", " dataset_name=\"math-questions\",\n", " input_keys=[\"question\"],\n", " output_keys=[\"final_output\"],\n", ")\n", "\n", "print(dataset)" ] }, { "cell_type": "markdown", "metadata": { "id": "ykdwc-XSgMJx" }, "source": [ "Finally, we run our experiment and view the results in Phoenix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.experiments import run_experiment\n", "\n", "experiment = run_experiment(\n", " dataset, task=solve_math_problem, evaluators=[goal_evaluator, tool_call_evaluator]\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "T088EMx4gXKS" }, "source": [ "![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/ragas_results.gif)" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server