Skip to main content
Glama

@arizeai/phoenix-mcp

Official
by Arize-ai
comparing-llamaindex-query-engines-with-a-pairwise-evaluator.md2.81 kB
# Comparing LlamaIndex Query Engines with a Pairwise Evaluator This tutorial sets up an experiment to determine which LlamaIndex query engine is preferred by an evaluation LLM. Using the `PairwiseEvaluator` module, we compare responses from different engines and identify which one produces more helpful or relevant outputs. {% hint style="info" %} See Llama-Index [notebook](https://github.com/run-llama/llama_index/blob/a7c79201bbc5e195a0447ae557980791010b4747/docs/docs/examples/evaluation/pairwise_eval.ipynb) for more info  {% endhint %} *** ## Notebook Walkthrough We will go through key code snippets on this page. To follow the full tutorial, check out the Colab notebook above. ## Upload Dataset to Phoenix Here, we will grab 7 examples from a Hugging Face dataset.  ```python sample_size = 7 category = "creative_writing" url = "hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl" df = pd.read_json(url, lines=True) df = df.loc[df.category == category, ["instruction", "response"]] df = df.sample(sample_size, random_state=42) px_client = Client() dataset = px_client.datasets.create_dataset( name=f"{category}_{time_ns()}", dataframe=df, ) ``` {% embed url="https://storage.googleapis.com/arize-phoenix-assets/assets/images/pairwise-eval-cookbook-1.png" %} ## Define Task Function Task function can be either sync or async. ```python async def task(input): return (await OpenAI(model="gpt-3.5-turbo").acomplete(input["instruction"])).text ``` ## Dry-Run Experiment Conduct a dry-run experiment on 3 randomly selected examples. ```python experiment = run_experiment(dataset, task, dry_run=3) ``` ## Define Evaluators For Each Experiment Run Evaluators can be sync or async. Function arguments `output` and `expected` refer to the attributes of the same name in the `ExperimentRun` data structure shown above. The `PairwiseEvaluator` in **LlamaIndex** is used to **compare two outputs side-by-side** and determine which one is preferred. This setup allows you to: * Run automated A/B tests on different LlamaIndex query engine configurations * Capture LLM-based preference data to guide iteration * Aggregate pairwise win rates and qualitative feedback ```python llm = OpenAI(temperature=0, model="gpt-4o") async def pairwise(output, input, expected) -> Tuple[Score, Explanation]: ans = await PairwiseComparisonEvaluator(llm=llm).aevaluate( query=input["instruction"], response=output, second_response=expected["response"], ) return ans.score, ans.feedback evaluators = [pairwise] ``` ```python experiment = evaluate_experiment(experiment, evaluators) ``` ## View Results in Phoenix {% embed url="https://storage.googleapis.com/arize-phoenix-assets/assets/images/pairwise-eval-cookbook-2.png" %}

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server