metrics.md•2.46 kB
# Built-in metrics and evaluators
This module includes all of the built-in evaluators that are available to use out of the box. 
## exact_match
Evaluator to determine if two strings are an exact match. Behavior: returns 1.0 if `output == expected`, else 0.0
No text normalization is performed. 
Examples
```python
from phoenix.evals.metrics.exact_match import exact_match
# 1) No mapping
scores = exact_match({"output": "no", "expected": "yes"})
print(scores[0].score)  # 0.0
# 2) With field mapping
scores = exact_match(
    {"prediction": "yes", "gold": "yes"},
    input_mapping={"output": "prediction", "expected": "gold"},
)
print(scores[0].score)  # 1.0
```
## HallucinationEvaluator
Evaluation to determine if a response to a query is grounded in the context or hallucinated. 
- Inherits: `ClassificationEvaluator`
- Required fields: `{input, output, context}` 
- Choices: `{"hallucinated": 0.0, "factual": 1.0}`
Examples
```python
from phoenix.evals.metrics.hallucination import HallucinationEvaluator
from phoenix.evals.llm import LLM
llm = LLM(provider="openai", model="gpt-4o-mini", client="openai")
hallucination = HallucinationEvaluator(llm=llm)
eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France.",
}
scores = hallucination(eval_input)
print(scores[0].label, scores[0].score)
```
Note: requires an LLM that supports tool calling or structured output. 
## Precision Recall F Score 
Calculates the precision, recall, and f score (default f1) given lists of output and expected values. 
Returns: 
A list of three `Score` objects: precision, recall, and f 
Notes:
- Works for binary or multi-class classification
- Inputs can be lists of integers or string labels. 
- If binary, 1.0 is presumed positive. Otherwise, provide `positive_label` for best results.
- Beta is configurable if you wish to calculate an F score other than F1.
- Default averaging technique is macro, but it is configurable.
Examples
```python
from phoenix.evals.metrics.precision_recall import PrecisionRecallFScore
precision_recall_fscore = PrecisionRecallFScore(positive_label="yes") # can also specify beta and averaging technique
result = precision_recall_fscore({"output": ["no", "yes", "yes"], "expected": ["yes", "no", "yes"]})
print("Results:")
print(result[0]) # precision
print(result[1]) # recall
print(result[2]) # f1
```