@arizeai/phoenix-mcp

Official

227

7,296

Overview InspectNew Endpoints Schema Related Servers Reviews Score

metrics.md•2.46 kB

# Built-in metrics and evaluators This module includes all of the built-in evaluators that are available to use out of the box. ## exact_match Evaluator to determine if two strings are an exact match. Behavior: returns 1.0 if `output == expected`, else 0.0 No text normalization is performed. Examples ```python from phoenix.evals.metrics.exact_match import exact_match # 1) No mapping scores = exact_match({"output": "no", "expected": "yes"}) print(scores[0].score) # 0.0 # 2) With field mapping scores = exact_match( {"prediction": "yes", "gold": "yes"}, input_mapping={"output": "prediction", "expected": "gold"}, ) print(scores[0].score) # 1.0 ``` ## HallucinationEvaluator Evaluation to determine if a response to a query is grounded in the context or hallucinated. - Inherits: `ClassificationEvaluator` - Required fields: `{input, output, context}` - Choices: `{"hallucinated": 0.0, "factual": 1.0}` Examples ```python from phoenix.evals.metrics.hallucination import HallucinationEvaluator from phoenix.evals.llm import LLM llm = LLM(provider="openai", model="gpt-4o-mini", client="openai") hallucination = HallucinationEvaluator(llm=llm) eval_input = { "input": "What is the capital of France?", "output": "Paris is the capital of France.", "context": "Paris is the capital and largest city of France.", } scores = hallucination(eval_input) print(scores[0].label, scores[0].score) ``` Note: requires an LLM that supports tool calling or structured output. ## Precision Recall F Score Calculates the precision, recall, and f score (default f1) given lists of output and expected values. Returns: A list of three `Score` objects: precision, recall, and f Notes: - Works for binary or multi-class classification - Inputs can be lists of integers or string labels. - If binary, 1.0 is presumed positive. Otherwise, provide `positive_label` for best results. - Beta is configurable if you wish to calculate an F score other than F1. - Default averaging technique is macro, but it is configurable. Examples ```python from phoenix.evals.metrics.precision_recall import PrecisionRecallFScore precision_recall_fscore = PrecisionRecallFScore(positive_label="yes") # can also specify beta and averaging technique result = precision_recall_fscore({"output": ["no", "yes", "yes"], "expected": ["yes", "no", "yes"]}) print("Results:") print(result[0]) # precision print(result[1]) # recall print(result[2]) # f1 ```

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server