@arizeai/phoenix-mcp

Official

Overview Schema Related Servers Score Discussions

run-experiments.md•7.87 KiB

--- description: >- The following are the key steps of running an experiment illustrated by simple example. --- # Run Experiments ## Setup Make sure you have the Phoenix client and the instrumentors needed for the experiment setup. For this example we will use the OpenAI instrumentor to trace the LLM calls. ```bash pip install arize-phoenix-client arize-phoenix-otel openinference-instrumentation-openai openai datasets duckdb pandas ``` ## Run Experiments The key steps of running an experiment are: 1. **Define/upload a `Dataset`** (e.g. a dataframe) * Each record of the dataset is called an `Example` 2. **Define a task** * A task is a function that takes each `Example` and returns an output 3. **Define Evaluators** * An `Evaluator` is a function that evaluates the output for each `Example` 4. **Run the experiment** We'll start by initializing the Phoenix client to connect to your deployed Phoenix instance. ```python from phoenix.client import Client # Initialize client - automatically reads from environment variables: # PHOENIX_BASE_URL and PHOENIX_API_KEY (if using Phoenix Cloud) client = Client() # Or explicitly configure for your Phoenix instance: # client = Client(base_url="https://your-phoenix-instance.com", api_key="your-api-key") ``` ### Load a Dataset A dataset can be as simple as a list of strings inside a dataframe. More sophisticated datasets can be also extracted from traces based on actual production data. Here we just have a small list of questions that we want to ask an LLM about the NBA games: **Create pandas dataframe** ```python import pandas as pd df = pd.DataFrame( { "question": [ "Which team won the most games?", "Which team won the most games in 2015?", "Who led the league in 3 point shots?", ] } ) ``` The dataframe can be sent to `Phoenix` via the `Client`. `input_keys` and `output_keys` are column names of the dataframe, representing the input/output to the task in question. Here we have just questions, so we left the outputs blank: **Upload dataset to Phoenix** ```python dataset = client.datasets.create_dataset( name="nba-questions", dataframe=df, input_keys=["question"], output_keys=[], ) ``` Each row of the dataset is called an `Example`. ### Create a Task A task is any function/process that returns a JSON serializable output. Task can also be an `async` function, but we used sync function here for simplicity. If the task is a function of one argument, then that argument will be bound to the `input` field of the dataset example. ```python def task(x): return ... ``` For our example here, we'll ask an LLM to build SQL queries based on our question, which we'll run on a database and obtain a set of results: **Set Up Database** ```python import duckdb from datasets import load_dataset data = load_dataset("suzyanil/nba-data")["train"] conn = duckdb.connect(database=":memory:", read_only=False) conn.register("nba", data.to_pandas()) ``` **Set Up Prompt and LLM** ```python from textwrap import dedent import openai # Create OpenAI client (separate from Phoenix client) openai_client = openai.Client() columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records") LLM_MODEL = "gpt-4o" columns_str = ",".join(column["column_name"] + ": " + column["column_type"] for column in columns) system_prompt = dedent(f""" You are a SQL expert, and you are given a single table named nba with the following columns: {columns_str}\n Write a SQL query corresponding to the user's request. Return just the query text, with no formatting (backticks, markdown, etc.).""") def generate_query(question): response = openai_client.chat.completions.create( model=LLM_MODEL, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": question}, ], ) return response.choices[0].message.content def execute_query(query): return conn.query(query).fetchdf().to_dict(orient="records") def text2sql(question): results = error = None query = None try: query = generate_query(question) results = execute_query(query) except duckdb.Error as e: error = str(e) return {"query": query, "results": results, "error": error} ``` **Define `task` as a Function** Recall that each row of the dataset is encapsulated as `Example` object. Recall that the input keys were defined when we uploaded the dataset: ```python def task(x): return text2sql(x["question"]) ``` **More complex `task` inputs** More complex tasks can use additional information. These values can be accessed by defining a task function with specific parameter names which are bound to special values associated with the dataset example: <table><thead><tr><th width="203">Parameter name</th><th width="226">Description</th><th>Example</th></tr></thead><tbody><tr><td><code>input</code></td><td>example input</td><td><code>def task(input): ...</code></td></tr><tr><td><code>expected</code></td><td>example output</td><td><code>def task(expected): ...</code></td></tr><tr><td><code>reference</code></td><td>alias for <code>expected</code></td><td><code>def task(reference): ...</code></td></tr><tr><td><code>metadata</code></td><td>example metadata</td><td><code>def task(metadata): ...</code></td></tr><tr><td><code>example</code></td><td><code>Example</code> object</td><td><code>def task(example): ...</code></td></tr></tbody></table> A `task` can be defined as a sync or async function that takes any number of the above argument names in any order! ### Define Evaluators An evaluator is any function that takes the task output and return an assessment. Here we'll simply check if the queries succeeded in obtaining any result from the database: ```python def no_error(output) -> bool: return not bool(output.get("error")) def has_results(output) -> bool: return bool(output.get("results")) ``` ### Run an Experiment **Instrument OpenAI** Instrumenting the LLM will also give us the spans and traces that will be linked to the experiment, and can be examined in the Phoenix UI: ```python from openinference.instrumentation.openai import OpenAIInstrumentor from phoenix.otel import register tracer_provider = register() OpenAIInstrumentor().instrument(tracer_provider=tracer_provider) ``` **Run the Task and Evaluators** Running an experiment is as easy as calling `run_experiment` with the components we defined above. The results of the experiment will show up in Phoenix: ```python experiment = client.experiments.run_experiment( dataset=dataset, task=task, evaluators=[no_error, has_results] ) ``` ### Add More Evaluations #### If you want to attach more evaluations to the same experiment after the fact, you can do so with `evaluate_experiment`. ```python evaluators = [ # add evaluators here ] experiment = client.experiments.evaluate_experiment( experiment=experiment, evaluators=evaluators ) ``` If you no longer have access to the original `experiment` object, you can retrieve it from Phoenix using the `get_experiment` client method. ```python experiment_id = "experiment-id" # set your experiment ID here experiment = client.experiments.get_experiment(experiment_id=experiment_id) evaluators = [ # add evaluators here ] experiment = client.experiments.evaluate_experiment( experiment=experiment, evaluators=evaluators ) ``` #### Dry Run Sometimes we may want to do a quick sanity check on the task function or the evaluators before unleashing them on the full dataset. `run_experiment()` and `evaluate_experiment()` both are equipped with a `dry_run=` parameter for this purpose: it executes the task and evaluators on a small subset without sending data to the Phoenix server. Setting `dry_run=True` selects one sample from the dataset, and setting it to a number, e.g. `dry_run=3`, selects multiple. The sampling is also deterministic, so you can keep re-running it for debugging purposes.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

run-experiments.md•7.87 KiB