ragas_agents_cookbook_phoenix.ipynb•18.3 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "SUknhuHKyc-E"
},
"source": [
"<p style=\"text-align:center\">\n",
" <img alt=\"Ragas\" src=\"https://github.com/explodinggradients/ragas/blob/main/docs/_static/imgs/logo.png?raw=true\" width=\"250\">\n",
"<br>\n",
"<p>\n",
" <a href=\"https://github.com/explodinggradients/ragas\">Ragas</a>\n",
" |\n",
" <a href=\"https://docs.arize.com/arize/\">Arize</a>\n",
" |\n",
" <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg\">Slack Community</a>\n",
"</p>\n",
"\n",
"<center><h1>Tracing and Evaluating AI agents</h1></center>\n",
"\n",
"This guide will walk you through the process of creating and evaluating agents using Ragas and Arize Phoenix. We'll cover the following steps:\n",
"\n",
"* Build an agent that solves math problems with the OpenAI Agents SDK\n",
"\n",
"* Trace agent activity to monitor interactions\n",
"\n",
"* Generate a benchmark dataset for performance analysis\n",
"\n",
"* Evaluate agent performance using Ragas"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "baTNFxbwX1e2"
},
"source": [
"# Initial setup\n",
"\n",
"\n",
"We'll setup our libraries, keys, and OpenAI tracing using Phoenix."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n69HR7eJswNt"
},
"source": [
"### Install Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -q \"arize-phoenix>=8.0.0\" arize-phoenix-otel openinference-instrumentation-openai-agents openinference-instrumentation-openai arize-phoenix-evals \"arize[Datasets]\" --upgrade\n",
"\n",
"!pip install langchain-openai\n",
"\n",
"!pip install -q openai opentelemetry-sdk opentelemetry-exporter-otlp gcsfs nest_asyncio ragas openai-agents --upgrade"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jQnyEnJisyn3"
},
"source": [
"### Setup Keys"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cmuChHhPd-Z7"
},
"source": [
"Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also [connect to a self-hosted Phoenix instance](https://arize.com/docs/phoenix/deployment) if you'd prefer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n",
"if not os.environ.get(\"PHOENIX_CLIENT_HEADERS\"):\n",
" os.environ[\"PHOENIX_CLIENT_HEADERS\"] = \"api_key=\" + getpass(\"Enter your Phoenix API key: \")\n",
"\n",
"if not os.environ.get(\"OPENAI_API_KEY\"):\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kfid5cE99yN5"
},
"source": [
"### Setup Tracing"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.otel import register\n",
"\n",
"# Setup OpenTelemetry\n",
"tracer_provider = register(\n",
" project_name=\"agents-cookbook\",\n",
" endpoint=\"https://app.phoenix.arize.com/v1/traces\",\n",
" auto_instrument=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bLVAqLi5_KAi"
},
"source": [
"# Create your first agent with the OpenAI SDK"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2RpnS8n8AirE"
},
"source": [
"Here we've setup a basic agent that can solve math problems.\n",
"\n",
"We have a function tool that can solve math equations, and an agent that can use this tool.\n",
"\n",
"We'll use the `Runner` class to run the agent and get the final output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from agents import Runner, function_tool\n",
"\n",
"\n",
"@function_tool\n",
"def solve_equation(equation: str) -> str:\n",
" \"\"\"Use python to evaluate the math equation, instead of thinking about it yourself.\n",
"\n",
" Args:\"\n",
" equation: string which to pass into eval() in python\n",
" \"\"\"\n",
" return str(eval(equation))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from agents import Agent\n",
"\n",
"agent = Agent(\n",
" name=\"Math Solver\",\n",
" instructions=\"You solve math problems by evaluating them with python and returning the result\",\n",
" tools=[solve_equation],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result = await Runner.run(agent, \"what is 15 + 28?\")\n",
"\n",
"# Run Result object\n",
"print(result)\n",
"\n",
"# Get the final output\n",
"print(result.final_output)\n",
"\n",
"# Get the entire list of messages recorded to generate the final output\n",
"print(result.to_input_list())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sq4rcseCGKRc"
},
"source": [
"Now we have a basic agent, let's evaluate whether the agent responded correctly."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y23F9ZP5AirG"
},
"source": [
"# Evaluating our agent\n",
"\n",
"Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:\n",
"1. Tool call accuracy - did our agent choose the right tool with the right arguments?\n",
"2. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dr4C55YBAirG"
},
"source": [
"Let's setup our evaluation by defining our task function, our evaluator, and our dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"\n",
"from agents import Runner\n",
"\n",
"\n",
"# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.\n",
"async def solve_math_problem(input):\n",
" if isinstance(input, dict):\n",
" input = next(iter(input.values()))\n",
" result = await Runner.run(agent, input)\n",
" return {\"final_output\": result.final_output, \"messages\": result.to_input_list()}\n",
"\n",
"\n",
"result = asyncio.run(solve_math_problem(\"What is 15 + 28?\"))\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9SHK2ZhxAirH"
},
"source": [
"This is helper code which converts the agent messages into a format that Ragas can use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def conversation_to_ragas_sample(messages, reference_equation=None, reference_answer=None):\n",
" \"\"\"\n",
" Convert a single conversation into a Ragas MultiTurnSample.\n",
"\n",
" Args:\n",
" conversation: Dictionary containing conversation data with 'conversation' key\n",
" reference_equation: Optional string with the reference equation for evaluation\n",
"\n",
" Returns:\n",
" MultiTurnSample: Formatted sample for Ragas evaluation\n",
" \"\"\"\n",
"\n",
" import json\n",
"\n",
" from ragas.dataset_schema import MultiTurnSample\n",
" from ragas.messages import AIMessage, HumanMessage, ToolCall, ToolMessage\n",
"\n",
" ragas_messages = []\n",
" pending_tool_call = None\n",
" reference_tool_calls = None\n",
"\n",
" for item in messages:\n",
" role = item.get(\"role\")\n",
" item_type = item.get(\"type\")\n",
"\n",
" if role == \"user\":\n",
" ragas_messages.append(HumanMessage(content=item[\"content\"]))\n",
" pending_tool_call = None\n",
"\n",
" elif item_type == \"function_call\":\n",
" args = json.loads(item[\"arguments\"])\n",
" pending_tool_call = ToolCall(name=item[\"name\"], args=args)\n",
"\n",
" elif item_type == \"function_call_output\":\n",
" if pending_tool_call is not None:\n",
" ragas_messages.append(AIMessage(content=\"\", tool_calls=[pending_tool_call]))\n",
" ragas_messages.append(\n",
" ToolMessage(\n",
" content=item[\"output\"],\n",
" name=pending_tool_call.name,\n",
" tool_call_id=\"tool_call_1\",\n",
" )\n",
" )\n",
" pending_tool_call = None\n",
" else:\n",
" print(\"[WARN] ToolMessage without preceding ToolCall — skipping.\")\n",
"\n",
" elif role == \"assistant\":\n",
" content = (\n",
" item[\"content\"][0][\"text\"]\n",
" if isinstance(item.get(\"content\"), list)\n",
" else item.get(\"content\", \"\")\n",
" )\n",
" ragas_messages.append(AIMessage(content=content))\n",
" print(\"Ragas_messages\", ragas_messages)\n",
"\n",
" if reference_equation:\n",
" # Look for the first function call to extract the actual tool call\n",
" for item in messages:\n",
" if item.get(\"type\") == \"function_call\":\n",
" args = json.loads(item[\"arguments\"])\n",
" reference_tool_calls = [ToolCall(name=item[\"name\"], args=args)]\n",
" break\n",
"\n",
" return MultiTurnSample(user_input=ragas_messages, reference_tool_calls=reference_tool_calls)\n",
"\n",
" elif reference_answer:\n",
" return MultiTurnSample(user_input=ragas_messages, reference=reference_answer)\n",
"\n",
" else:\n",
" return MultiTurnSample(user_input=ragas_messages)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Here is an example of the function in action\n",
"sample = conversation_to_ragas_sample(\n",
" # This is a list of messages recorded for \"Calculate 15 + 28.\"\n",
" result[\"messages\"],\n",
" reference_equation=\"15 + 28\",\n",
" reference_answer=\"43\",\n",
")\n",
"print(sample)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o85Vvi_tAirI"
},
"source": [
"Now let's setup our evaluator. We'll import both metrics we're measuring from Ragas, and use the `multi_turn_ascore(sample)` to get the results.\n",
"\n",
"The `AgentGoalAccuracyWithReference` metric compares the final output to the reference to see if the goal was accomplished.\n",
"\n",
"The `ToolCallAccuracy` metric compares the tool call to the reference tool call to see if the tool call was made correctly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from ragas.llms import LangchainLLMWrapper\n",
"from ragas.metrics import AgentGoalAccuracyWithReference, ToolCallAccuracy\n",
"\n",
"\n",
"# Setup evaluator LLM and metrics\n",
"async def tool_call_evaluator(input, output):\n",
" sample = conversation_to_ragas_sample(output[\"messages\"], reference_equation=input[\"question\"])\n",
" tool_call_accuracy = ToolCallAccuracy()\n",
" return await tool_call_accuracy.multi_turn_ascore(sample)\n",
"\n",
"\n",
"async def goal_evaluator(input, output):\n",
" sample = conversation_to_ragas_sample(\n",
" output[\"messages\"], reference_answer=output[\"final_output\"]\n",
" )\n",
" evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\n",
" goal_accuracy = AgentGoalAccuracyWithReference(llm=evaluator_llm)\n",
" return await goal_accuracy.multi_turn_ascore(sample)\n",
"\n",
"\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "k0Qvn8tAs9vL"
},
"source": [
"# Create synthetic dataset of questions\n",
"\n",
"Using the template below, we're going to generate a dataframe of 10 questions we can use to test our math problem solving agent."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MATH_GEN_TEMPLATE = \"\"\"\n",
"You are an assistant that generates diverse math problems for testing a math solver agent.\n",
"The problems should include:\n",
"\n",
"Basic Operations: Simple addition, subtraction, multiplication, division problems.\n",
"Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.\n",
"Exponents and Roots: Problems involving powers, square roots, and other nth roots.\n",
"Percentages: Problems involving calculating percentages of numbers or finding percentage changes.\n",
"Fractions: Problems with addition, subtraction, multiplication, or division of fractions.\n",
"Algebra: Simple algebraic expressions that can be evaluated with specific values.\n",
"Sequences: Finding sums, products, or averages of number sequences.\n",
"Word Problems: Converting word problems into mathematical equations.\n",
"\n",
"Do not include any solutions in your generated problems.\n",
"\n",
"Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.\n",
"Generate 10 diverse math problems. Ensure there are no duplicate problems.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"import pandas as pd\n",
"\n",
"from phoenix.evals import OpenAIModel\n",
"\n",
"nest_asyncio.apply()\n",
"pd.set_option(\"display.max_colwidth\", 500)\n",
"\n",
"# Initialize the model\n",
"model = OpenAIModel(model=\"gpt-4o\", max_tokens=1300)\n",
"\n",
"# Generate math problems\n",
"resp = model(MATH_GEN_TEMPLATE)\n",
"\n",
"# Create DataFrame\n",
"split_response = resp.strip().split(\"\\n\")\n",
"math_problems_df = pd.DataFrame(split_response, columns=[\"question\"])\n",
"math_problems_df = math_problems_df[math_problems_df[\"question\"].str.strip() != \"\"]\n",
"\n",
"print(math_problems_df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oGIbV49kHp4H"
},
"source": [
"Now let's use this dataset and run it with the agent."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"\n",
"from agents import Runner\n",
"\n",
"# Run agent on all problems and collect conversation history\n",
"conversations = []\n",
"for idx, row in math_problems_df.iterrows():\n",
" result = asyncio.run(solve_math_problem(row[\"question\"]))\n",
" conversations.append(\n",
" {\n",
" \"question\": row[\"question\"],\n",
" \"final_output\": result[\"final_output\"],\n",
" \"messages\": result[\"messages\"],\n",
" }\n",
" )\n",
"\n",
"print(f\"Processed {len(conversations)} problems\")\n",
"print(conversations[0])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cHYgS5cpRE3b"
},
"source": [
"# Create an experiment\n",
"\n",
"With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SgTEu7U4Rd5i"
},
"source": [
"Let's create this dataset and upload it into the platform."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import phoenix as px\n",
"\n",
"dataset_df = pd.DataFrame(\n",
" {\n",
" \"question\": [conv[\"question\"] for conv in conversations],\n",
" \"final_output\": [conv[\"final_output\"] for conv in conversations],\n",
" }\n",
")\n",
"\n",
"dataset = px.Client().upload_dataset(\n",
" dataframe=dataset_df,\n",
" dataset_name=\"math-questions\",\n",
" input_keys=[\"question\"],\n",
" output_keys=[\"final_output\"],\n",
")\n",
"\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ykdwc-XSgMJx"
},
"source": [
"Finally, we run our experiment and view the results in Phoenix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from phoenix.experiments import run_experiment\n",
"\n",
"experiment = run_experiment(\n",
" dataset, task=solve_math_problem, evaluators=[goal_evaluator, tool_call_evaluator]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T088EMx4gXKS"
},
"source": [
""
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}