batch-evaluations.md•3.24 kB
# Batch Evaluations
## Dataframe Evaluation Methods
* `evaluate_dataframe` for synchronous dataframe evaluations
* `async_evaluate_dataframe` an asynchronous version for optimized speed and ability to specify concurrency.
Both methods run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:
1. `{score_name}_score` contains the JSON serialized score (or None if the evaluation failed)
2. `{evaluator_name}_execution_details` contains information about the execution status, duration, and any exceptions that occurred.
#### Notes:
* Bind `input_mappings` to your evaluators beforehand so they match your dataframe columns.
* Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution\_details column and the score will be None.
#### Examples
1. Evaluator with more than one score returned:
```python
import pandas as pd
from phoenix.evals import evaluate_dataframe
from phoenix.evals.metrics import PrecisionRecallFScore
precision_recall_fscore = PrecisionRecallFScore(positive_label="Yes")
df = pd.DataFrame(
{
"output": [["Yes", "Yes", "No"], ["Yes", "No", "No"]],
"expected": [["Yes", "No", "No"], ["Yes", "No", "No"]],
}
)
result = evaluate_dataframe(df, [precision_recall_fscore])
result.head()
```
2. Running multiple evaluators, one bound with an input\_mapping:
```python
from phoenix.evals import bind_evaluator, evaluate_dataframe
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator, exact_match
df = pd.DataFrame(
{
# exact_match columns
"output": ["Yes", "Yes", "No"],
"expected": ["Yes", "No", "No"],
# hallucination columns (need mapping)
"context": ["This is a test", "This is another test", "This is a third test"],
"query": [
"What is the name of this test?",
"What is the name of this test?",
"What is the name of this test?",
],
"response": ["First test", "Another test", "Third test"],
}
)
llm = LLM(provider="openai", model="gpt-4o")
hallucination_evaluator = bind_evaluator(
HallucinationEvaluator(llm=llm), {"input": "query", "output": "response"}
)
result = evaluate_dataframe(df, [exact_match, hallucination_evaluator])
result.head()
```
3. Asynchronous evaluation
```python
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator
from phoenix.evals import async_evaluate_dataframe
df = pd.DataFrame(
{
"context": ["This is a test", "This is another test", "This is a third test"],
"input": [
"What is the name of this test?",
"What is the name of this test?",
"What is the name of this test?",
],
"output": ["First test", "Another test", "Third test"],
}
)
llm = LLM(provider="openai", model="gpt-4o")
hallucination_evaluator = HallucinationEvaluator(llm=llm)
result = await async_evaluate_dataframe(df, [hallucination_evaluator], concurrency=5)
result.head()
```
See [Using Evals with Phoenix](using-evals-with-phoenix.md) to learn how to run evals on project traces and upload them to Phoenix.