Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
generating_synthetic_datasets.ipynb19.7 kB
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "FyLuI5YGow3V" }, "source": [ "<center>\n", " <p style=\"text-align:center\">\n", " <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n", " <br>\n", " <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n", " |\n", " <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n", " |\n", " <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n", " </p>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": { "id": "avfBA7w3o0mM" }, "source": [ "# **Generating Synthetic Datasets using LLMs**" ] }, { "cell_type": "markdown", "metadata": { "id": "l1OF_apdr1rk" }, "source": [ "Synthetic datasets are a powerful way to test and refine your LLM applications, especially when real-world data is limited, sensitive, or hard to collect. By guiding the model to generate structured examples, you can quickly create datasets that cover common scenarios, complex multi-step cases, and edge cases like typos or out-of-scope queries.\n", "\n", "In this notebook, we’ll walk through different strategies for dataset generation and show how they can be used to run experiments and test evaluators." ] }, { "cell_type": "markdown", "metadata": { "id": "GUeScEYno51J" }, "source": [ "# **Set up Dependencies and Keys**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qq openai arize-phoenix-client arize-phoenix-otel arize-phoenix-evals openinference-instrumentation-openai" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "\n", "import nest_asyncio\n", "\n", "nest_asyncio.apply()\n", "\n", "if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n", " phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n", "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n", "\n", "\n", "if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n", " phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n", "os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n", "\n", "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n", " openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.otel import register\n", "\n", "tracer_provider = register(project_name=\"generating-datasets\", auto_instrument=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from openai import AsyncOpenAI\n", "\n", "openai_client = AsyncOpenAI()" ] }, { "cell_type": "markdown", "metadata": { "id": "OY5kw0M8qC3r" }, "source": [ "# **Creating Synthetic Benchmark Datasets to Test Evaluators**" ] }, { "cell_type": "markdown", "metadata": { "id": "UKKlZvNILZaY" }, "source": [ "**Goal:**\n", "Create a synthetic dataset that allows you to test the accuracy and coverage of your evaluator.\n", "\n", "**Use Case:**\n", "Feed the generated dataset into an LLM-as-a-Judge or other evaluator to ensure it correctly labels intent, identifies errors, and handles a variety of query types including edge cases and noisy inputs.\n", "\n", "----\n", "\n", "Synthetic data is especially useful when you want to stress-test evaluators such as an LLM-as-a-Judge across a wide range of scenarios. By generating examples systematically, you can cover straightforward cases, tricky edge cases, ambiguous queries, and noisy inputs, ensuring your evaluator captures different angles of behavior.\n", "\n", "A strong synthetic dataset, in this case, serves as the benchmark dataset, providing a reliable benchmark for evaluating and comparing application changes.\n", "\n", "In the example below, we generate customer support queries in JSON, each with a user query, intent label, and sample response. This dataset can then be used to check how well your evaluator identifies intent and judges correctness across varied cases." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "generate_queries_template = \"\"\"\n", "Generate 30 synthetic customer support classification examples.\n", "Ensure good coverage across intents (refund, order_status, product_info),\n", "and include both correct and incorrect classifications.\n", "Each entry should follow this JSON schema:\n", "\n", "{\n", " \"input\": \"string (the user query)\",\n", " \"output\": \"refund | order_status | product_info (the predicted intent)\",\n", " \"classification\": \"correct | incorrect\"\n", "}\n", "Respond ONLY with valid JSON array, no code fences, no extra text.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = await openai_client.chat.completions.create(\n", " model=\"gpt-4o-mini\", messages=[{\"role\": \"user\", \"content\": generate_queries_template}]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "support_data = json.loads(resp.choices[0].message.content)\n", "df_support_data = pd.DataFrame(support_data)\n", "df_support_data.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "meh-wcA_MoWz" }, "source": [ "## Upload Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client import AsyncClient\n", "\n", "client = AsyncClient()\n", "\n", "dataset = await client.datasets.create_dataset(\n", " dataframe=df_support_data,\n", " name=\"customer_support_queries\",\n", " input_keys=[\"input\"],\n", " output_keys=[\"output\", \"classification\"],\n", ")\n", "df = dataset.to_dataframe()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "wPqEQdW8_x39" }, "source": [ "## Example Usage: Test LLM Judge Effectiveness" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "llm_judge_template = \"\"\"\n", "You are an evaluator judging whether a model's classification of a customer support query is correct.\n", "The possible classifications are: refund, order_status, product_info\n", "\n", "Query: {input}\n", "Model Prediction: {output}\n", "\n", "Decide if the model's prediction is correct or incorrect.\n", "Respond ONLY with one of: \"correct\" or \"incorrect\".\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.evals import OpenAIModel, llm_classify\n", "\n", "\n", "async def task_function(input, reference):\n", " response_classification = llm_classify(\n", " data=pd.DataFrame([{\"input\": input[\"input\"], \"output\": reference[\"output\"]}]),\n", " template=llm_judge_template,\n", " model=OpenAIModel(model=\"gpt-4.1\"),\n", " rails=[\"correct\", \"incorrect\"],\n", " provide_explanation=True,\n", " )\n", " label = response_classification.iloc[0][\"label\"]\n", " return label\n", "\n", "\n", "def evaluate_response(output, reference):\n", " expected_label = reference[\"classification\"]\n", " predicted_label = output\n", " return 1 if expected_label == predicted_label else 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from phoenix.client.experiments import async_run_experiment\n", "\n", "initial_experiment = await async_run_experiment(\n", " dataset=dataset,\n", " task=task_function,\n", " evaluators=[evaluate_response],\n", " experiment_name=\"evaluator performance\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "_0Z-Zutmucs1" }, "source": [ "# **Using Few Shot Examples for Synthetic Dataset Generation**" ] }, { "cell_type": "markdown", "metadata": { "id": "sbNjiP8tucCZ" }, "source": [ "**Goal:**\n", "Guide the LLM to generate synthetic examples that reflect different types of queries and scenarios while maintaining consistent labeling and structure. This approach allows for more customization and higher-quality examples in your dataset. Your few-shot examples can be real data as well.\n", "\n", "-----\n", "\n", "Few-shot prompting allows you to guide an LLM by showing a handful of examples, which helps produce more consistent and realistic outputs. In this approach, we provide a few labeled customer queries with their intents and sample responses, and ask the model to generate additional examples in the same format.\n", "\n", "This is particularly useful for testing evaluators such as an LLM-as-a-Judge, because it ensures the synthetic dataset reflects patterns, labels, and structures the evaluator is expected to handle. By controlling the examples in the prompt, you can produce a dataset that covers a variety of scenarios, including tricky or ambiguous queries, to check whether your evaluator captures different angles of behavior." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "few_shot_prompt = \"\"\"\n", "Generate synthetic customer support classification examples.\n", "Ensure good coverage across intents (refund, order_status, product_info),\n", "and include both correct and incorrect classifications.\n", "Here are some examples of synthetic customer queries and labels:\n", "\n", "Example 1:\n", "{\n", " \"user_query\": \"Ughhh I bought sneakers that squeak louder than a rubber duck... how do I return these?\",\n", " \"intent\": \"refund\",\n", " \"response\": \"Oh no, squeaky shoes aren’t fun! Let’s get that return started. Could you share your order number?\",\n", " \"classification\": \"correct\"\n", "}\n", "\n", "Example 2:\n", "{\n", " \"user_query\": \"My package has been saying 'out for delivery' since last Tuesday… did it decide to take a vacation? Is it actually going to show up?\",\n", " \"intent\": \"refund\",\n", " \"response\": \"Looks like your package is taking its sweet time. Let me check where it’s stuck — can you give me the tracking number?\",\n", " \"classification: \"incorrect\"\n", "}\n", "\n", "\n", "Example 3:\n", "{\n", " \"user_query\": \"Thinking about upgrading my blender… does your new model actually crush ice?\",\n", " \"intent\": \"product_info\",\n", " \"response\": \"Haha our blender keeps its promises! It can definitely crush ice. Would you like more details on the specs?\",\n", " \"classification\": \"correct\"\n", "}\n", "\n", "Now generate 25 new examples in the same format, keeping the reesponses friendly.\n", "Respond ONLY with valid JSON array, no code fences, no extra text.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = await openai_client.chat.completions.create(\n", " model=\"gpt-4o-mini\", messages=[{\"role\": \"user\", \"content\": few_shot_prompt}]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "few_shot_data = json.loads(resp.choices[0].message.content)\n", "few_shot_df = pd.DataFrame(few_shot_data)\n", "few_shot_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "QtHnw0AILQfV" }, "source": [ "## Upload Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_support_queries_few_shot_dataset = await client.datasets.create_dataset(\n", " dataframe=few_shot_df,\n", " name=\"customer_support_queries_few_shot\",\n", " input_keys=[\"user_query\"],\n", " output_keys=[\"intent\", \"response\", \"classification\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "KI4okudfszQJ" }, "source": [ "## Example Usage: Test LLM Judge Effectiveness" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "llm_judge_template = \"\"\"\n", "You are an evaluator judging whether a model's classification of a customer support query is correct.\n", "The possible classifications are: refund, order_status, product_info\n", "\n", "Query: {query}\n", "Model Prediction: {intent}\n", "\n", "Decide if the model's prediction is correct or incorrect.\n", "Respond ONLY with one of: \"correct\" or \"incorrect\".\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "async def task_function(input, reference):\n", " response_classification = llm_classify(\n", " data=pd.DataFrame([{\"query\": input[\"user_query\"], \"intent\": reference[\"intent\"]}]),\n", " template=llm_judge_template,\n", " model=OpenAIModel(model=\"gpt-4.1\"),\n", " rails=[\"correct\", \"incorrect\"],\n", " provide_explanation=True,\n", " )\n", " label = response_classification.iloc[0][\"label\"]\n", " return label\n", "\n", "\n", "def evaluate_response(output, reference):\n", " expected_label = reference[\"classification\"]\n", " predicted_label = output\n", " return 1 if expected_label == predicted_label else 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "initial_experiment = await async_run_experiment(\n", " dataset=customer_support_queries_few_shot_dataset,\n", " task=task_function,\n", " evaluators=[evaluate_response],\n", " experiment_name=\"evaluator performance\",\n", " client=client,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "YEekr2TI4Cyl" }, "source": [ "# **Creating Synthetic Datasets for Agents**" ] }, { "cell_type": "markdown", "metadata": { "id": "09dwlnP_4xj1" }, "source": [ "**Goal:**\n", "Build synthetic test data that captures a wide range of queries to evaluate an agent’s reliability and safety.\n", "\n", "**Use Case:**\n", "Test how an agent handles in-scope requests, refuses out-of-scope queries, and manages edge cases, adversarial inputs, and noisy data.\n", "\n", "------\n", "\n", "When creating synthetic datasets, first define the agent’s capabilities and boundaries (tools, in-scope vs. out-of-scope). Then organize queries into categories to ensure balanced coverage:\n", "\n", "1. Happy-path: simple, common requests\n", "2. Complex: multi-step or reasoning-heavy\n", "3. Adversarial / refusal: out-of-scope or unsafe\n", "4. Edge cases: ambiguous or incomplete inputs\n", "5. Noise: typos, slang, multilingual\n", "\n", "This structure makes it easier to stress-test the agent across realistic scenarios and confirm it behaves consistently.\n", "\n", "**Why This Approach?**\n", "\n", "This structure ensures comprehensive evaluation (core tasks, edge conditions, and safety) and systematic coverage (no major scenario overlooked). By simulating a wide range of real-world interactions, you can validate that the agent is reliable, robust, and safe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "AGENT_DATASET_PROMPT = \"\"\"\n", "You are helping me create a synthetic test dataset for evaluating an AI agent.\n", "The agent has the following capabilities:\n", "- search products, compare items, track orders, answer shipping questions\n", "\n", "The dataset should cover a wide variety of use cases, not just the “happy path.”\n", "Generate realistic **user queries**, grouped into categories:\n", "\n", "1. **Happy-path**: straightforward, common use cases where the agent should succeed.\n", "2. **Complex / multi-step**: queries requiring reasoning, multiple steps, or tool calls.\n", "3. **Edge cases**: ambiguous requests, incomplete info, or queries with constraints.\n", "4. **Adversarial / refusal**: queries that are out-of-scope or unsafe (where the agent should refuse or fallback).\n", "5. **Noise / robustness**: queries with typos, slang, or in multiple languages.\n", "\n", "For each example, return JSON with this schema:\n", "{\n", " \"category\": \"happy_path | multi_step | edge_case | adversarial | noise\",\n", " \"query\": \"string (the user’s input)\",\n", " \"expected_action\": \"string (the tool, behavior, or refusal the agent should take)\",\n", " \"expected_outcome\": \"string (what a correct response would look like at a high level)\"\n", "}\n", "\n", "Generate **10 examples total**, ensuring at least a few from each category.\n", "The queries should be diverse, realistic, and not repetitive.\n", "\n", "Respond ONLY with valid JSON, no code fences, no extra text.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = await openai_client.chat.completions.create(\n", " model=\"gpt-4o-mini\", messages=[{\"role\": \"user\", \"content\": AGENT_DATASET_PROMPT}]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "agent_data = json.loads(resp.choices[0].message.content)\n", "agent_data_df = pd.DataFrame(agent_data)\n", "agent_data_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "fEysEH7DM_Wa" }, "source": [ "## Upload Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_support_agent_datasets = await client.datasets.create_dataset(\n", " dataframe=agent_data_df,\n", " name=\"customer_support_agent\",\n", " input_keys=[\"category\", \"query\"],\n", " output_keys=[\"expected_action\", \"expected_outcome\"],\n", ")" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server