@arizeai/phoenix-mcp

Official

227

7,296

Overview InspectNew Endpoints Schema Related Servers Reviews Score

evaluators.md•11.8 kB

# Evaluators At the core, an Evaluator is anything that returns a Score. ## Abstractions - Score: immutable result container - Evaluator: base class for sync/async evaluation with input validation and mapping - LLMEvaluator: base class that integrates with an LLM and a prompt template - ClassificationEvaluator: LLM-powered single-criteria classification (similar to) - create_classifier: factory for ClassificationEvaluator - create_evaluator decorator: turn any function into an evaluator ### Score - Every score has the following properties: - name: The human-readable name of the score/evaluator. - source: The origin of the evaluation signal (llm, heuristic, or human) - direction: The optimization direction; whether a high score is better or worse - Scores may also have some of the following properties: - score: numeric score - label: The categorical outcome (e.g., "good", "bad", or other label). - explanation: A brief rationale or justification for the result (e.g. for LLM-as-a-judge). - metadata: Arbitrary extra context such as model details, intermediate scores, or run info. ### Evaluator - The abstract base class for all evaluators. - Subclasses implement `_evaluate(eval_input) -> list[Score]` and optionally `_async_evaluate(eval_input)`. ### Input Mapping - Evaluators use Pydantic schemas for input validation and type safety. Many evaluator subclasses infer the `input_schema`, but users can always provide their own. - Required fields are enforced. Missing or empty values raise `ValueError`. - Use `input_mapping` to map evaluator-required field names to your input keys or paths. - You can bind an `input_mapping` to an evaluator for reuse with multiple inputs using `.bind` or `bind_evaluator` ### LLMEvaluator - Abstract base class for evaluators that use LLMs. - Infers `input_schema` from the prompt template placeholders when not provided. - Synchronous `evaluate` requires a sync `LLM`. Asynchronous `async_evaluate` requires the same `LLM` instance (which supports both sync and async methods). ### ClassificationEvaluator - Main evaluator type for any LLM-as-a-judge style evaluations. - Frames LLM judgement as a classification task where the LLM is asked to produce a categorical label from a set of choices. - Behavior: - `choices` may be a list of labels, a mapping label->score, or label->(score, description) - Returns one `Score` with `label` and optional `score` (from mapping) and `explanation` - Validates that the generated label is within the provided choices - Explanations are generated by default, but may be turned off. - Requires an LLM that supports tool calling or structured output. ### Heuristic Evaluators - `create_evaluator(name: str, source: SourceType = "heuristic", direction: DirectionType = "maximize")` registers a function as an evaluator. Input schema is inferred from the function signature. # Input Mapping and Binding The evals library provides powerful input mapping capabilities that allow you to extract and transform data from complex nested structures. ## Input Mapping Types The `input_mapping` parameter accepts several types of mappings: 1. **Simple key mapping**: `{"field": "key"}` - maps evaluator field to input key 2. **Path mapping**: `{"field": "nested.path"}` - uses JSON path syntax from [jsonpath-ng](https://pypi.org/project/jsonpath-ng/) 3. **Callable mapping**: `{"field": lambda x: x["key"]}` - custom extraction logic ### Path Mapping Examples ```python # Nested dictionary access input_mapping = { "query": "input.query", "context": "input.documents", "response": "output.answer" } # Array indexing input_mapping = { "first_doc": "input.documents[0]", "last_doc": "input.documents[-1]" } # Combined nesting and list indexing input_mapping = { "user_query": "data.user.messages[0].content", } ``` ### Callable Mappings For complex transformations, use callable functions that accept an `eval_input` payload: ```python # Callable example def extract_context(eval_input): docs = eval_input.get("input", {}).get("documents", []) return " ".join(docs[:3]) # Join first 3 documents input_mapping = { "query": "input.query", "context": extract_context, "response": "output.answer" } # Lambda example input_mapping = { "user_query": lambda x: x["input"]["query"].lower(), "context": lambda x: " ".join(x["documents"][:3]) } ``` ## Pydantic Input Schemas Evaluators use Pydantic models for input validation and type safety. Most of the time (e.g. for `ClassificationEvaluator` or functions decorated with `create_evaluator`), the input schema is inferred. But, you can always define your own. The Pydantic model allows you to annotate input fields with additional information such as aliases or descriptions. ```python from pydantic import BaseModel from typing import List class HallucinationInput(BaseModel): query: str context: List[str] response: str evaluator = HallucinationEvaluator( name="hallucination", llm=llm, prompt_template="...", input_schema=HallucinationInput ) ``` ### Schema Inference Most evaluators automatically infer schemas if not provided at instantiation. LLM evaluators infer schemas from prompt templates: ```python # This creates a schema with required str fields: query, context, response evaluator = LLMEvaluator( name="hallucination", llm=llm, prompt_template="Query: {query}\nContext: {context}\nResponse: {response}" ) ``` Decorated function evaluators infer schemas from the function signature: ```python @create_evaluator(name="exact_match") def exact_match(output: str, expected: str) -> Score: ... # creates input_schema with required str fields: output, expected {'properties': { 'output': {'title': 'Output','type': 'string'}, 'expected': {'title': 'Expected', 'type': 'string'} }, 'required': ['output', 'expected'] } ``` ## Binding System Use `bind_evaluator` or `.bind` to create a pre-configured evaluator with a fixed input mapping. At evaluation time, you only need to provide the `eval_input` and the mapping is handled internally. ```python from phoenix.evals import bind_evaluator # Create a bound evaluator with fixed mapping bound_evaluator = bind_evaluator( evaluator, { "query": "input.query", "context": "input.documents", "response": "output.answer" } ) # Run evaluation scores = bound_evaluator({ "input": {"query": "How do I reset?", "documents": ["Manual", "Guide"]}, "output": {"answer": " Go to settings > reset. "} }) ``` ### BoundEvaluator Features - **Static validation**: Mapping syntax is validated at creation time - **Introspection**: `describe()` shows mapping details alongside schema # Dataframe Evaluation - `evaluate_dataframe` for synchronous dataframe evaluations - `async_evaluate_dataframe` an asynchronous version for optimized speed and ability to specify concurrency. Both methods run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score: 1. `{score_name}_score` contains the JSON serialized score (or None if the evaluation failed) 2. `{evaluator_name}_execution_details` contains information about the execution status, duration, and any exceptions that ocurred. ### Notes: - Bind `input_mappings` to your evaluators beforehand so they match your dataframe columns. - Do not use dot notation in the dataframe column names e.g. "input.query" because it will interfere with the input mapping. - Score name collisions: If multiple evaluators return scores with the same name, they will write to the same column (e.g., 'same_name_score'). This can lead to data loss as later scores overwrite earlier ones. - Similarly, evaluator names should be unique to ensure execution_details columns don't collide. - Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None. ### Examples 1) Evaluator with more than one score returned: ```python import pandas as pd from phoenix.evals import evaluate_dataframe from phoenix.evals.metrics import PrecisionRecallFScore precision_recall_fscore = PrecisionRecallFScore(positive_label="Yes") df = pd.DataFrame( { "output": [["Yes", "Yes", "No"], ["Yes", "No", "No"]], "expected": [["Yes", "No", "No"], ["Yes", "No", "No"]], } ) result = evaluate_dataframe(df, [precision_recall_fscore]) result.head() ``` 2) Running multiple evaluators, one bound with an input_mapping: ```python from phoenix.evals import bind_evaluator, evaluate_dataframe from phoenix.evals.llm import LLM from phoenix.evals.metrics import HallucinationEvaluator, exact_match df = pd.DataFrame( { # exact_match columns "output": ["Yes", "Yes", "No"], "expected": ["Yes", "No", "No"], # hallucination columns (need mapping) "context": ["This is a test", "This is another test", "This is a third test"], "query": [ "What is the name of this test?", "What is the name of this test?", "What is the name of this test?", ], "response": ["First test", "Another test", "Third test"], } ) llm = LLM(provider="openai", model="gpt-4o") hallucination_evaluator = bind_evaluator( HallucinationEvaluator(llm=llm), {"input": "query", "output": "response"} ) result = evaluate_dataframe(df, [exact_match, hallucination_evaluator]) result.head() ``` 3) Asynchronous evaluation ```python from phoenix.evals.llm import LLM from phoenix.evals.metrics import HallucinationEvaluator from phoenix.evals import async_evaluate_dataframe df = pd.DataFrame( { "context": ["This is a test", "This is another test", "This is a third test"], "input": [ "What is the name of this test?", "What is the name of this test?", "What is the name of this test?", ], "output": ["First test", "Another test", "Third test"], } ) llm = LLM(provider="openai", model="gpt-4o") hallucination_evaluator = HallucinationEvaluator(llm=llm) result = await async_evaluate_dataframe(df, [hallucination_evaluator], concurrency=5) result.head() ``` ## FAQ ### Why do evaluators accept a payload and an input_mapping vs. kwargs? Different evaluators require different keyword arguments to operate. These arguments may not perfectly match those in your example or dataset. Let's say our example looks like this, where the inputs and outputs contain nested dictionaries: ```python eval_input = { "input": { "query": "user input query", "documents": ["doc A", "doc B"] }, "output": {"response": "model answer"}, "expected": "correct answer" } ``` We want to run two evaluators over this example: - `Hallucination`, which requires `query`, `context`, and `response` - `exact_match`, which requires `expected` and `output` Rather than modifying our data to fit the two evaluators, we make the evaluators fit the data. Binding an `input_mapping` enables the evaluators to run on the same payload - the map/transform steps are handled by the evaluator itself. ### How are missing or optional fields handled? The input mapping system distinguishes between required and optional fields: - **Required fields**: Must be present and non-empty, or evaluation fails - **Optional fields**: Can be missing or empty, and are only included if successfully extracted from the input For optional fields, use the binding system or ensure your input schema marks fields as optional: ```python from pydantic import BaseModel from typing import Optional class MyInput(BaseModel): required_field: str optional_field: Optional[str] = None ```

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server