openai_agents_cookbook.ipynb•29 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "SUknhuHKyc-E"
},
"source": [
"<center>\n",
"<p style=\"text-align:center\">\n",
"<img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
"<br>\n",
"<br>\n",
"<a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
"|\n",
"<a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
"|\n",
"<a href=\"https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg?__hstc=259489365.a667dfafcfa0169c8aee4178d115dc81.1733501603539.1733501603539.1733501603539.1&__hssc=259489365.1.1733501603539&__hsfp=3822854628&submissionGuid=381a0676-8f38-437b-96f2-fc10875658df#/shared-invite/email\">Community</a>\n",
"</p>\n",
"</center>\n",
"<h1 align=\"center\">Tracing and Evaluating OpenAI Agents</h1>\n",
"\n",
"\n",
"This guide shows you how to create and evaluate agents with Phoenix to improve performance. We'll go through the following steps:\n",
"\n",
"* Create an agent using the OpenAI agents SDK\n",
"\n",
"* Trace the agent activity\n",
"\n",
"* Create a dataset to benchmark performance\n",
"\n",
"* Run an experiment to evaluate agent performance using LLM as a judge\n",
"\n",
"* Learn how to evaluate traces in production"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "baTNFxbwX1e2"
},
"source": [
"# Initial setup\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n69HR7eJswNt"
},
"source": [
"### Install Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KaDomv7x7yux"
},
"outputs": [],
"source": [
"!pip install -q \"arize-phoenix>=8.0.0\" openinference-instrumentation-openai-agents openinference-instrumentation-openai --upgrade\n",
"!pip install -q openai nest_asyncio openai-agents"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jQnyEnJisyn3"
},
"source": [
"### Setup Dependencies and Keys\n",
"\n",
"Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also [connect to a self-hosted Phoenix instance](https://arize.com/docs/phoenix/deployment) if you'd prefer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lcyYCP8U7yux"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"from getpass import getpass\n",
"\n",
"os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n",
"if not os.environ.get(\"PHOENIX_CLIENT_HEADERS\"):\n",
" os.environ[\"PHOENIX_CLIENT_HEADERS\"] = \"api_key=\" + getpass(\"Enter your Phoenix API key: \")\n",
"\n",
"if not os.environ.get(\"OPENAI_API_KEY\"):\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kfid5cE99yN5"
},
"source": [
"### Setup Tracing"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5mIEk4sX7yuy"
},
"outputs": [],
"source": [
"from phoenix.otel import register\n",
"\n",
"# Setup Tracing\n",
"tracer_provider = register(\n",
" project_name=\"openai-agents-cookbook\",\n",
" endpoint=\"https://app.phoenix.arize.com/v1/traces\",\n",
" auto_instrument=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bLVAqLi5_KAi"
},
"source": [
"# Create your first agent with the OpenAI SDK"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "K-VJR06F7yuy"
},
"source": [
"Here we've setup a basic agent that can solve math problems.\n",
"\n",
"We have a function tool that can solve math equations, and an agent that can use this tool.\n",
"\n",
"We'll use the `Runner` class to run the agent and get the final output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XtqFH4Y67yuy"
},
"outputs": [],
"source": [
"from agents import Runner, function_tool\n",
"\n",
"\n",
"@function_tool\n",
"def solve_equation(equation: str) -> str:\n",
" \"\"\"Use python to evaluate the math equation, instead of thinking about it yourself.\n",
"\n",
" Args:\n",
" equation: string which to pass into eval() in python\n",
" \"\"\"\n",
" return str(eval(equation))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1DcpXaNn7yuy"
},
"outputs": [],
"source": [
"from agents import Agent\n",
"\n",
"agent = Agent(\n",
" name=\"Math Solver\",\n",
" instructions=\"You solve math problems by evaluating them with python and returning the result\",\n",
" tools=[solve_equation],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "e4Te7N9u7yuz"
},
"outputs": [],
"source": [
"result = await Runner.run(agent, \"what is 15 + 28?\")\n",
"\n",
"# Run Result object\n",
"print(result)\n",
"\n",
"# Get the final output\n",
"print(result.final_output)\n",
"\n",
"# Get the entire list of messages recorded to generate the final output\n",
"print(result.to_input_list())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sq4rcseCGKRc"
},
"source": [
"Now we have a basic agent, let's evaluate whether the agent responded correctly!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "orIdvsw87yuz"
},
"source": [
"# Evaluating our agent\n",
"\n",
"Agents can go awry for a variety of reasons.\n",
"1. Tool call accuracy - did our agent choose the right tool with the right arguments?\n",
"2. Tool call results - did the tool respond with the right results?\n",
"3. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?\n",
"\n",
"We'll setup a simple evaluator that will check if the agent's response is correct, you can read about different types of agent evals [here](https://docs.arize.com/arize/llm-evaluation-and-annotations/how-does-evaluation-work/agent-evaluation)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wogYKH8t7yuz"
},
"source": [
"Let's setup our evaluation by defining our task function, our evaluator, and our dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Gey3x1PK7yuz"
},
"outputs": [],
"source": [
"import asyncio\n",
"\n",
"\n",
"# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.\n",
"async def solve_math_problem(dataset_row: dict):\n",
" result = await Runner.run(agent, dataset_row.get(\"question\"))\n",
" return {\n",
" \"final_output\": result.final_output,\n",
" \"messages\": result.to_input_list(),\n",
" }\n",
"\n",
"\n",
"dataset_row = {\"question\": \"What is 15 + 28?\"}\n",
"\n",
"result = asyncio.run(solve_math_problem(dataset_row))\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DbSBhwo77yuz"
},
"source": [
"Next, we create our evaluator."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vGBH4Mem7yuz"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from phoenix.evals import OpenAIModel, llm_classify\n",
"\n",
"\n",
"def correctness_eval(input, output):\n",
" # Template for evaluating math problem solutions\n",
" MATH_EVAL_TEMPLATE = \"\"\"\n",
" You are evaluating whether a math problem was solved correctly.\n",
"\n",
" [BEGIN DATA]\n",
" ************\n",
" [Question]: {question}\n",
" ************\n",
" [Response]: {response}\n",
" [END DATA]\n",
"\n",
" Assess if the answer to the math problem is correct. First work out the correct answer yourself,\n",
" then compare with the provided response. Consider that there may be different ways to express the same answer\n",
" (e.g., \"43\" vs \"The answer is 43\" or \"5.0\" vs \"5\").\n",
"\n",
" Your answer must be a single word, either \"correct\" or \"incorrect\"\n",
" \"\"\"\n",
"\n",
" # Run the evaluation\n",
" rails = [\"correct\", \"incorrect\"]\n",
" eval_df = llm_classify(\n",
" data=pd.DataFrame([{\"question\": input[\"question\"], \"response\": output[\"final_output\"]}]),\n",
" template=MATH_EVAL_TEMPLATE,\n",
" model=OpenAIModel(model=\"gpt-4.1\"),\n",
" rails=rails,\n",
" provide_explanation=True,\n",
" )\n",
" label = eval_df[\"label\"][0]\n",
" score = 1 if label == \"correct\" else 0\n",
" return score"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "k0Qvn8tAs9vL"
},
"source": [
"# Create synthetic dataset of questions\n",
"\n",
"Using the template below, we're going to generate a dataframe of 25 questions we can use to test our math problem solving agent."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VJo84Qpy7yuz"
},
"outputs": [],
"source": [
"MATH_GEN_TEMPLATE = \"\"\"\n",
"You are an assistant that generates diverse math problems for testing a math solver agent.\n",
"The problems should include:\n",
"\n",
"Basic Operations: Simple addition, subtraction, multiplication, division problems.\n",
"Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.\n",
"Exponents and Roots: Problems involving powers, square roots, and other nth roots.\n",
"Percentages: Problems involving calculating percentages of numbers or finding percentage changes.\n",
"Fractions: Problems with addition, subtraction, multiplication, or division of fractions.\n",
"Algebra: Simple algebraic expressions that can be evaluated with specific values.\n",
"Sequences: Finding sums, products, or averages of number sequences.\n",
"Word Problems: Converting word problems into mathematical equations.\n",
"\n",
"Do not include any solutions in your generated problems.\n",
"\n",
"Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.\n",
"Generate 25 diverse math problems. Ensure there are no duplicate problems.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tty8VKJq7yuz"
},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"pd.set_option(\"display.max_colwidth\", 500)\n",
"\n",
"# Initialize the model\n",
"model = OpenAIModel(model=\"gpt-4o\", max_tokens=1300)\n",
"\n",
"# Generate math problems\n",
"resp = model(MATH_GEN_TEMPLATE)\n",
"\n",
"# Create DataFrame\n",
"split_response = resp.strip().split(\"\\n\")\n",
"math_problems_df = pd.DataFrame(split_response, columns=[\"question\"])\n",
"print(math_problems_df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oGIbV49kHp4H"
},
"source": [
"Now let's use this dataset and run it with the agent!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hWDTJnsh736a"
},
"source": [
"# Experiment in Development"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zsUpEuyYbR7r"
},
"source": [
"During development, experimentation helps iterate quickly by revealing agent failures during evaluation. You can test against datasets to refine prompts, logic, and tool usage before deploying.\n",
"\n",
"In this section, we run our agent against the dataset defined above and evaluate for correctness using LLM as Judge."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cHYgS5cpRE3b"
},
"source": [
"## Create an experiment\n",
"\n",
"With our dataset of questions we generated above, we can use our experiment feature to track changes across models, prompts, parameters for our agent."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SgTEu7U4Rd5i"
},
"source": [
"Let's create this dataset and upload it into the platform."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dYRk0FGI7yuz"
},
"outputs": [],
"source": [
"import uuid\n",
"\n",
"import phoenix as px\n",
"\n",
"unique_id = uuid.uuid4()\n",
"\n",
"dataset_name = \"math-questions-\" + str(uuid.uuid4())[:5]\n",
"\n",
"# Upload the dataset to Phoenix\n",
"dataset = px.Client().upload_dataset(\n",
" dataframe=math_problems_df,\n",
" input_keys=[\"question\"],\n",
" dataset_name=f\"math-questions-{unique_id}\",\n",
")\n",
"print(dataset)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UjFfDaYR7yuz"
},
"outputs": [],
"source": [
"from phoenix.experiments import run_experiment\n",
"\n",
"initial_experiment = run_experiment(\n",
" dataset,\n",
" task=solve_math_problem,\n",
" evaluators=[correctness_eval],\n",
" experiment_description=\"Solve Math Problems\",\n",
" experiment_name=f\"solve-math-questions-{str(uuid.uuid4())[:5]}\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QrMPyC0jY9I7"
},
"source": [
"## View Traces in Phoenix"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KNUqOZN9yOZr"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MyPCrSReEoNb"
},
"source": [
"# Evaluating in Production"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Fmyqd99hSpcR"
},
"source": [
"In production, evaluation provides real-time insights into how agents perform on user data.\n",
"\n",
"This section simulates a live production setting, showing how you can collect traces, model outputs, and evaluation results in real time.\n",
"\n",
"Another option is to pull traces from completed production runs and batch process evaluations on them. You can then log the results of those evaluations in Phoenix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "PUtZQK7G_ciH"
},
"outputs": [],
"source": [
"!pip install openinference-instrumentation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1B-e1KPv8yMV"
},
"outputs": [],
"source": [
"from opentelemetry.trace import StatusCode, format_span_id"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "buzwGdQgQVDO"
},
"source": [
"After importing the necessary libraries, we set up a tracer object to enable span creation for tracing our task function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OIrTZk9G_r1P"
},
"outputs": [],
"source": [
"tracer = tracer_provider.get_tracer(__name__)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "odzG40dvP3pT"
},
"source": [
"Next, we update our correctness evaluator to return both a label and an explanation, enabling metadata to be captured during tracing.\n",
"\n",
"We also revise the task function to include `with` clauses that generate structured spans in Phoenix. These spans capture key details such as input values, output values, and the results of the evaluation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UW09ZMtpJS_K"
},
"outputs": [],
"source": [
"# This is our modified correctness evaluator.\n",
"def correctness_eval(input, output):\n",
" # Template for evaluating math problem solutions\n",
" MATH_EVAL_TEMPLATE = \"\"\"\n",
" You are evaluating whether a math problem was solved correctly.\n",
"\n",
" [BEGIN DATA]\n",
" ************\n",
" [Question]: {question}\n",
" ************\n",
" [Response]: {response}\n",
" [END DATA]\n",
"\n",
" Assess if the answer to the math problem is correct. First work out the correct answer yourself,\n",
" then compare with the provided response. Consider that there may be different ways to express the same answer\n",
" (e.g., \"43\" vs \"The answer is 43\" or \"5.0\" vs \"5\").\n",
"\n",
" Your answer must be a single word, either \"correct\" or \"incorrect\"\n",
" \"\"\"\n",
"\n",
" # Run the evaluation\n",
" rails = [\"correct\", \"incorrect\"]\n",
" eval_df = llm_classify(\n",
" data=pd.DataFrame([{\"question\": input[\"question\"], \"response\": output[\"final_output\"]}]),\n",
" template=MATH_EVAL_TEMPLATE,\n",
" model=OpenAIModel(model=\"gpt-4.1\"),\n",
" rails=rails,\n",
" provide_explanation=True,\n",
" )\n",
"\n",
" return eval_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This is our modified task function.\n",
"async def solve_math_problem(dataset_row: dict):\n",
" with tracer.start_as_current_span(name=\"agent\", openinference_span_kind=\"agent\") as agent_span:\n",
" question = dataset_row.get(\"question\")\n",
" agent_span.set_input(question)\n",
" agent_span.set_status(StatusCode.OK)\n",
"\n",
" result = await Runner.run(agent, question)\n",
" agent_span.set_output(result.final_output)\n",
"\n",
" task_result = {\n",
" \"final_output\": result.final_output,\n",
" \"messages\": result.to_input_list(),\n",
" }\n",
"\n",
" # Evaluation span for correctness\n",
" with tracer.start_as_current_span(\n",
" \"correctness-evaluator\",\n",
" openinference_span_kind=\"evaluator\",\n",
" ) as eval_span:\n",
" evaluation_result = correctness_eval(dataset_row, task_result)\n",
" eval_span.set_attribute(\"eval.label\", evaluation_result[\"label\"][0])\n",
" eval_span.set_attribute(\"eval.explanation\", evaluation_result[\"explanation\"][0])\n",
"\n",
" # Logging our evaluation\n",
" span_id = format_span_id(eval_span.get_span_context().span_id)\n",
" score = 1 if evaluation_result[\"label\"][0] == \"correct\" else 0\n",
" eval_data = {\n",
" \"span_id\": span_id,\n",
" \"label\": evaluation_result[\"label\"][0],\n",
" \"score\": score,\n",
" \"explanation\": evaluation_result[\"explanation\"][0],\n",
" }\n",
" df = pd.DataFrame([eval_data])\n",
" from phoenix.client import AsyncClient\n",
"\n",
" px_client = AsyncClient()\n",
" await px_client.spans.log_span_annotations_dataframe(\n",
" dataframe=df,\n",
" annotation_name=\"correctness\",\n",
" annotator_kind=\"LLM\",\n",
" )\n",
"\n",
" return task_result\n",
"\n",
"\n",
"dataset_row = {\"question\": \"What is 15 + 28?\"}\n",
"\n",
"result = asyncio.run(solve_math_problem(dataset_row))\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jqUGE_6oSU-5"
},
"source": [
"Finally, we run an experiment to simulate traces in production."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "c6Xk99_fAlPY"
},
"outputs": [],
"source": [
"from phoenix.experiments import run_experiment\n",
"\n",
"initial_experiment = run_experiment(\n",
" dataset,\n",
" task=solve_math_problem,\n",
" experiment_description=\"Solve Math Problems\",\n",
" experiment_name=f\"solve-math-questions-{str(uuid.uuid4())[:5]}\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1EzwX1GtPr03"
},
"source": [
"## View Traces and Evaluator Results in Phoenix as Traces Populate"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "djALV8gQy4A1"
},
"source": [
""
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}