mcp-run-python

Official

Overview Schema Related Servers Score Discussions

report-evaluators.md•23.7 KiB

# Report Evaluators Report evaluators analyze entire experiment results rather than individual cases. Use them to compute experiment-wide statistics like confusion matrices, precision-recall curves, accuracy scores, or custom summary tables. ## How Report Evaluators Work Regular [evaluators](overview.md) run once per case and assess individual outputs. Report evaluators run once per experiment _after_ all cases have been evaluated, receiving the full [`EvaluationReport`][pydantic_evals.reporting.EvaluationReport] as input. ``` Cases executed → Case evaluators run → Report evaluators run → Final report ``` Results from report evaluators are stored as **analyses** on the report and, when Logfire is configured, are attached to the experiment span as structured attributes for visualization. ## Using Report Evaluators Pass report evaluators to `Dataset` via the `report_evaluators` parameter: ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import ConfusionMatrixEvaluator def my_classifier(text: str) -> str: text = text.lower() if 'cat' in text or 'meow' in text: return 'cat' elif 'dog' in text or 'bark' in text: return 'dog' return 'unknown' dataset = Dataset( cases=[ Case(name='cat', inputs='The cat goes meow', expected_output='cat'), Case(name='dog', inputs='The dog barks', expected_output='dog'), ], report_evaluators=[ ConfusionMatrixEvaluator( predicted_from='output', expected_from='expected_output', title='Animal Classification', ), ], ) report = dataset.evaluate_sync(my_classifier) # report.analyses contains the ConfusionMatrix result ``` ## Built-in Report Evaluators ### ConfusionMatrixEvaluator Builds a confusion matrix comparing predicted vs expected labels across all cases. ```python from pydantic_evals.evaluators import ConfusionMatrixEvaluator ConfusionMatrixEvaluator( predicted_from='output', expected_from='expected_output', title='My Confusion Matrix', ) ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `predicted_from` | `'expected_output' \| 'output' \| 'metadata' \| 'labels'` | `'output'` | Source for predicted values | | `predicted_key` | `str \| None` | `None` | Key to extract when using `metadata` or `labels` | | `expected_from` | `'expected_output' \| 'output' \| 'metadata' \| 'labels'` | `'expected_output'` | Source for expected/true values | | `expected_key` | `str \| None` | `None` | Key to extract when using `metadata` or `labels` | | `title` | `str` | `'Confusion Matrix'` | Title shown in reports | **Returns:** [`ConfusionMatrix`][pydantic_evals.reporting.analyses.ConfusionMatrix] **Data Sources:** - `'output'` — the task's actual output (converted to string) - `'expected_output'` — the case's expected output (converted to string) - `'metadata'` — a value from the case's metadata dict (requires `key`) - `'labels'` — a label result from a case-level evaluator (requires `key`) **Example — classification with expected outputs:** ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import ConfusionMatrixEvaluator dataset = Dataset( cases=[ Case(inputs='meow', expected_output='cat'), Case(inputs='woof', expected_output='dog'), Case(inputs='chirp', expected_output='bird'), ], report_evaluators=[ ConfusionMatrixEvaluator( predicted_from='output', expected_from='expected_output', ), ], ) ``` **Example — using evaluator labels:** If a case-level evaluator produces a label like `predicted_class`, you can reference it: ```python from dataclasses import dataclass from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import ( ConfusionMatrixEvaluator, Evaluator, EvaluatorContext, ) @dataclass class ClassifyOutput(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> dict[str, str]: # Classify the output into a category return {'predicted_class': categorize(ctx.output)} def categorize(output: str) -> str: return 'positive' if 'good' in output.lower() else 'negative' dataset = Dataset( cases=[Case(inputs='test', expected_output='positive')], evaluators=[ClassifyOutput()], report_evaluators=[ ConfusionMatrixEvaluator( predicted_from='labels', predicted_key='predicted_class', expected_from='expected_output', ), ], ) ``` --- ### PrecisionRecallEvaluator Computes a precision-recall curve with AUC (area under the curve) from numeric scores and binary ground-truth labels. ```python from pydantic_evals.evaluators import PrecisionRecallEvaluator PrecisionRecallEvaluator( score_from='scores', score_key='confidence', positive_from='assertions', positive_key='is_correct', ) ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `score_key` | `str` | _(required)_ | Key in scores or metrics dict | | `positive_from` | `'expected_output' \| 'assertions' \| 'labels'` | _(required)_ | Source for ground-truth binary labels | | `positive_key` | `str \| None` | `None` | Key in assertions or labels dict | | `score_from` | `'scores' \| 'metrics'` | `'scores'` | Source for numeric scores | | `title` | `str` | `'Precision-Recall Curve'` | Title shown in reports | | `n_thresholds` | `int` | `100` | Number of threshold points on the curve | **Returns:** [`PrecisionRecall`][pydantic_evals.reporting.analyses.PrecisionRecall] + [`ScalarResult`][pydantic_evals.reporting.analyses.ScalarResult] (AUC) The AUC is computed at full resolution (using every unique score as a threshold) for accuracy, then the curve points are downsampled to `n_thresholds` for display. The AUC is returned both on the curve (for chart rendering) and as a separate `ScalarResult` for querying and sorting. **Score Sources:** - `'scores'` — a numeric score from a case-level evaluator (looked up by `score_key`) - `'metrics'` — a custom metric set during task execution (looked up by `score_key`) **Positive Sources:** - `'assertions'` — a boolean assertion from a case-level evaluator (looked up by `positive_key`) - `'labels'` — a label result cast to boolean (looked up by `positive_key`) - `'expected_output'` — the case's expected output cast to boolean **Example:** ```python from dataclasses import dataclass from typing import Any from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import ( Evaluator, EvaluatorContext, PrecisionRecallEvaluator, ) @dataclass class ConfidenceEvaluator(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> dict[str, Any]: confidence = calculate_confidence(ctx.output) return { 'confidence': confidence, # numeric score 'is_correct': ctx.output == ctx.expected_output, # boolean assertion } def calculate_confidence(output: str) -> float: return 0.85 # placeholder dataset = Dataset( cases=[ Case(inputs='test 1', expected_output='cat'), Case(inputs='test 2', expected_output='dog'), ], evaluators=[ConfidenceEvaluator()], report_evaluators=[ PrecisionRecallEvaluator( score_from='scores', score_key='confidence', positive_from='assertions', positive_key='is_correct', ), ], ) ``` --- ### ROCAUCEvaluator Computes an ROC (Receiver Operating Characteristic) curve and AUC from numeric scores and binary ground-truth labels. The ROC curve plots the True Positive Rate against the False Positive Rate at various threshold values, with a dashed random-baseline diagonal for reference. ```python from pydantic_evals.evaluators import ROCAUCEvaluator ROCAUCEvaluator( score_key='confidence', positive_from='assertions', positive_key='is_correct', ) ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `score_key` | `str` | _(required)_ | Key in scores or metrics dict | | `positive_from` | `'expected_output' \| 'assertions' \| 'labels'` | _(required)_ | Source for ground-truth binary labels | | `positive_key` | `str \| None` | `None` | Key in assertions or labels dict | | `score_from` | `'scores' \| 'metrics'` | `'scores'` | Source for numeric scores | | `title` | `str` | `'ROC Curve'` | Title shown in reports | | `n_thresholds` | `int` | `100` | Number of threshold points on the curve | **Returns:** [`LinePlot`][pydantic_evals.reporting.analyses.LinePlot] + [`ScalarResult`][pydantic_evals.reporting.analyses.ScalarResult] (AUC) The AUC is computed at full resolution. The chart includes a dashed "Random" baseline diagonal from (0, 0) to (1, 1) for visual comparison. **Score and Positive Sources:** Same as [`PrecisionRecallEvaluator`](#precisionrecallevaluator). --- ### KolmogorovSmirnovEvaluator Computes a Kolmogorov-Smirnov plot and KS statistic from numeric scores and binary ground-truth labels. The KS plot shows the empirical CDFs (cumulative distribution functions) of the score distribution for positive and negative cases. The KS statistic is the maximum vertical distance between the two CDFs — higher values indicate better class separation. ```python from pydantic_evals.evaluators import KolmogorovSmirnovEvaluator KolmogorovSmirnovEvaluator( score_key='confidence', positive_from='assertions', positive_key='is_correct', ) ``` **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `score_key` | `str` | _(required)_ | Key in scores or metrics dict | | `positive_from` | `'expected_output' \| 'assertions' \| 'labels'` | _(required)_ | Source for ground-truth binary labels | | `positive_key` | `str \| None` | `None` | Key in assertions or labels dict | | `score_from` | `'scores' \| 'metrics'` | `'scores'` | Source for numeric scores | | `title` | `str` | `'KS Plot'` | Title shown in reports | | `n_thresholds` | `int` | `100` | Number of threshold points on the curve | **Returns:** [`LinePlot`][pydantic_evals.reporting.analyses.LinePlot] + [`ScalarResult`][pydantic_evals.reporting.analyses.ScalarResult] (KS Statistic) **Score and Positive Sources:** Same as [`PrecisionRecallEvaluator`](#precisionrecallevaluator). --- ## Custom Report Evaluators Write custom report evaluators by inheriting from [`ReportEvaluator`][pydantic_evals.evaluators.ReportEvaluator] and implementing the `evaluate` method: ```python from dataclasses import dataclass from pydantic_evals.evaluators import ReportEvaluator, ReportEvaluatorContext from pydantic_evals.reporting.analyses import ScalarResult @dataclass class AccuracyEvaluator(ReportEvaluator): """Computes overall accuracy as a scalar metric.""" def evaluate(self, ctx: ReportEvaluatorContext) -> ScalarResult: cases = ctx.report.cases if not cases: return ScalarResult(title='Accuracy', value=0.0, unit='%') correct = sum( 1 for case in cases if case.output == case.expected_output ) accuracy = correct / len(cases) * 100 return ScalarResult(title='Accuracy', value=accuracy, unit='%') ``` ### ReportEvaluatorContext The context passed to `evaluate()` contains: - `ctx.name` — the experiment name - `ctx.report` — the full [`EvaluationReport`][pydantic_evals.reporting.EvaluationReport] with all case results - `ctx.experiment_metadata` — optional experiment-level metadata dict Through `ctx.report.cases`, you can access each case's inputs, outputs, expected outputs, scores, labels, assertions, metrics, and attributes. ### Return Types Report evaluators must return a `ReportAnalysis` or a `list[ReportAnalysis]`. The available analysis types are: #### ScalarResult A single numeric statistic: ```python from pydantic_evals.reporting.analyses import ScalarResult ScalarResult( title='Accuracy', value=93.3, unit='%', description='Percentage of correctly classified cases.', ) ``` | Field | Type | Description | |-------|------|-------------| | `title` | `str` | Display name | | `value` | `float \| int` | The numeric value | | `unit` | `str \| None` | Optional unit label (e.g., `'%'`, `'ms'`) | | `description` | `str \| None` | Optional longer description | --- #### TableResult A generic table of data: ```python from pydantic_evals.reporting.analyses import TableResult TableResult( title='Per-Class Metrics', columns=['Class', 'Precision', 'Recall', 'F1'], rows=[ ['cat', 0.95, 0.90, 0.924], ['dog', 0.88, 0.92, 0.899], ], description='Precision, recall, and F1 per class.', ) ``` | Field | Type | Description | |-------|------|-------------| | `title` | `str` | Display name | | `columns` | `list[str]` | Column headers | | `rows` | `list[list[str \| int \| float \| bool \| None]]` | Row data | | `description` | `str \| None` | Optional longer description | --- #### ConfusionMatrix A confusion matrix (typically produced by `ConfusionMatrixEvaluator`, but can be constructed directly): ```python from pydantic_evals.reporting.analyses import ConfusionMatrix ConfusionMatrix( title='Sentiment', class_labels=['positive', 'negative', 'neutral'], matrix=[ [45, 3, 2], # expected=positive [5, 40, 5], # expected=negative [1, 2, 47], # expected=neutral ], ) ``` | Field | Type | Description | |-------|------|-------------| | `title` | `str` | Display name | | `class_labels` | `list[str]` | Ordered labels for both axes | | `matrix` | `list[list[int]]` | `matrix[expected][predicted]` = count | | `description` | `str \| None` | Optional longer description | --- #### PrecisionRecall Precision-recall curve data (typically produced by `PrecisionRecallEvaluator`): | Field | Type | Description | |-------|------|-------------| | `title` | `str` | Display name | | `curves` | `list[PrecisionRecallCurve]` | One or more curves | | `description` | `str \| None` | Optional longer description | Each `PrecisionRecallCurve` contains a `name`, a list of `PrecisionRecallPoint`s (with `threshold`, `precision`, `recall`), and an optional `auc` value. --- #### LinePlot A generic XY line chart with labeled axes, supporting multiple curves. Use this for ROC curves, KS plots, calibration curves, or any custom line chart: ```python from pydantic_evals.reporting.analyses import LinePlot, LinePlotCurve, LinePlotPoint LinePlot( title='ROC Curve', x_label='False Positive Rate', y_label='True Positive Rate', x_range=(0, 1), y_range=(0, 1), curves=[ LinePlotCurve( name='Model (AUC: 0.95)', points=[LinePlotPoint(x=0.0, y=0.0), LinePlotPoint(x=0.1, y=0.8), LinePlotPoint(x=1.0, y=1.0)], ), LinePlotCurve( name='Random', points=[LinePlotPoint(x=0, y=0), LinePlotPoint(x=1, y=1)], style='dashed', ), ], ) ``` | Field | Type | Description | |-------|------|-------------| | `title` | `str` | Display name | | `x_label` | `str` | Label for the x-axis | | `y_label` | `str` | Label for the y-axis | | `x_range` | `tuple[float, float] \| None` | Optional fixed range for x-axis | | `y_range` | `tuple[float, float] \| None` | Optional fixed range for y-axis | | `curves` | `list[LinePlotCurve]` | One or more curves to plot | | `description` | `str \| None` | Optional longer description | Each `LinePlotCurve` contains a `name`, a list of `LinePlotPoint`s (with `x`, `y`), an optional `style` (`'solid'` or `'dashed'`), and an optional `step` interpolation mode (`'start'`, `'middle'`, or `'end'`) for step functions like empirical CDFs. `LinePlot` is the recommended return type for custom curve-based evaluators — any evaluator that returns a `LinePlot` will be rendered as a line chart in the Logfire UI without requiring any frontend changes. ### Returning Multiple Analyses A single report evaluator can return multiple analyses by returning a list: ```python from dataclasses import dataclass from pydantic_evals.evaluators import ReportEvaluator, ReportEvaluatorContext from pydantic_evals.reporting.analyses import ReportAnalysis, ScalarResult, TableResult @dataclass class ClassificationSummary(ReportEvaluator): """Produces both a scalar accuracy and a per-class metrics table.""" def evaluate(self, ctx: ReportEvaluatorContext) -> list[ReportAnalysis]: cases = ctx.report.cases if not cases: return [] labels = sorted({str(c.expected_output) for c in cases if c.expected_output}) # Scalar: overall accuracy correct = sum(1 for c in cases if c.output == c.expected_output) accuracy = ScalarResult( title='Accuracy', value=correct / len(cases) * 100, unit='%' ) # Table: per-class breakdown rows = [] for label in labels: tp = sum(1 for c in cases if str(c.output) == label and str(c.expected_output) == label) fp = sum(1 for c in cases if str(c.output) == label and str(c.expected_output) != label) fn = sum(1 for c in cases if str(c.output) != label and str(c.expected_output) == label) p = tp / (tp + fp) if (tp + fp) > 0 else 0.0 r = tp / (tp + fn) if (tp + fn) > 0 else 0.0 f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0 rows.append([label, round(p, 3), round(r, 3), round(f1, 3)]) table = TableResult( title='Per-Class Metrics', columns=['Class', 'Precision', 'Recall', 'F1'], rows=rows, ) return [accuracy, table] ``` ### Async Report Evaluators Report evaluators support async `evaluate` methods, handled automatically via `evaluate_async`: ```python from dataclasses import dataclass from pydantic_evals.evaluators import ReportEvaluator, ReportEvaluatorContext from pydantic_evals.reporting.analyses import ScalarResult @dataclass class AsyncAccuracy(ReportEvaluator): async def evaluate(self, ctx: ReportEvaluatorContext) -> ScalarResult: # Can use async I/O here (e.g., call an external API) cases = ctx.report.cases correct = sum(1 for c in cases if c.output == c.expected_output) return ScalarResult( title='Accuracy', value=correct / len(cases) * 100 if cases else 0.0, unit='%', ) ``` ## Serialization Report evaluators are serialized to and from YAML/JSON dataset files using the same format as case-level evaluators. This means datasets with report evaluators can be fully round-tripped through file serialization. **Example YAML dataset with report evaluators:** ```yaml # yaml-language-server: $schema=./test_cases_schema.json name: classifier_eval cases: - name: cat_test inputs: The cat meows expected_output: cat - name: dog_test inputs: The dog barks expected_output: dog report_evaluators: - ConfusionMatrixEvaluator - PrecisionRecallEvaluator: score_key: confidence positive_from: assertions positive_key: is_correct ``` Built-in report evaluators (`ConfusionMatrixEvaluator`, `PrecisionRecallEvaluator`, `ROCAUCEvaluator`, `KolmogorovSmirnovEvaluator`) are recognized automatically. For custom report evaluators, pass them via `custom_report_evaluator_types`: ```python {test="skip" lint="skip"} from pydantic_evals import Dataset dataset = Dataset[str, str, None].from_file( 'test_cases.yaml', custom_report_evaluator_types=[MyCustomReportEvaluator], ) ``` Similarly, when saving a dataset with custom report evaluators, pass them to `to_file` so the JSON schema includes them: ```python {test="skip" lint="skip"} dataset.to_file( 'test_cases.yaml', custom_report_evaluator_types=[MyCustomReportEvaluator], ) ``` ## Viewing Analyses in Logfire When [Logfire is configured](../how-to/logfire-integration.md), analyses are automatically attached to the experiment span as the `logfire.experiment.analyses` attribute. The Logfire UI renders them as interactive visualizations: - **Confusion matrices** are displayed as heatmaps - **Precision-recall curves** are rendered as line charts with AUC in the legend - **Line plots** (ROC curves, KS plots, etc.) are rendered as line charts with configurable axes - **Scalar results** are shown as labeled values - **Tables** are rendered as formatted data tables When comparing multiple experiments in the Logfire Evals view, analyses of the same type are displayed side by side for easy comparison. ## Complete Example A full example combining case-level evaluators with report evaluators: ```python from dataclasses import dataclass from typing import Any from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import ( ConfusionMatrixEvaluator, Evaluator, EvaluatorContext, KolmogorovSmirnovEvaluator, PrecisionRecallEvaluator, ReportEvaluator, ReportEvaluatorContext, ROCAUCEvaluator, ) from pydantic_evals.reporting.analyses import ScalarResult def my_classifier(text: str) -> str: text = text.lower() if 'cat' in text or 'meow' in text: return 'cat' elif 'dog' in text or 'bark' in text: return 'dog' elif 'bird' in text or 'chirp' in text: return 'bird' return 'unknown' # Case-level evaluator: runs per case @dataclass class ConfidenceEvaluator(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> dict[str, Any]: confidence = compute_confidence(ctx.output, ctx.inputs) is_correct = ctx.output == ctx.expected_output return { 'confidence': confidence, 'is_correct': is_correct, } def compute_confidence(output: str, inputs: str) -> float: return 0.85 # placeholder # Report-level evaluator: runs once over the full report @dataclass class AccuracyEvaluator(ReportEvaluator): def evaluate(self, ctx: ReportEvaluatorContext) -> ScalarResult: cases = ctx.report.cases correct = sum(1 for c in cases if c.output == c.expected_output) return ScalarResult( title='Accuracy', value=correct / len(cases) * 100 if cases else 0.0, unit='%', ) dataset = Dataset( cases=[ Case(inputs='The cat meows', expected_output='cat'), Case(inputs='The dog barks', expected_output='dog'), Case(inputs='A bird chirps', expected_output='bird'), ], evaluators=[ConfidenceEvaluator()], report_evaluators=[ ConfusionMatrixEvaluator( predicted_from='output', expected_from='expected_output', title='Animal Classification', ), PrecisionRecallEvaluator( score_from='scores', score_key='confidence', positive_from='assertions', positive_key='is_correct', ), ROCAUCEvaluator( score_from='scores', score_key='confidence', positive_from='assertions', positive_key='is_correct', ), KolmogorovSmirnovEvaluator( score_from='scores', score_key='confidence', positive_from='assertions', positive_key='is_correct', ), AccuracyEvaluator(), ], ) report = dataset.evaluate_sync(my_classifier) # Access analyses programmatically for analysis in report.analyses: print(f'{analysis.type}: {analysis.title}') #> confusion_matrix: Animal Classification #> precision_recall: Precision-Recall Curve #> scalar: Precision-Recall Curve AUC #> line_plot: ROC Curve #> scalar: ROC Curve AUC #> line_plot: KS Plot #> scalar: KS Statistic #> scalar: Accuracy ``` ## Next Steps - **[Built-in Evaluators](built-in.md)** — Case-level evaluator reference - **[Custom Evaluators](custom.md)** — Writing case-level evaluators - **[Logfire Integration](../how-to/logfire-integration.md)** — Viewing analyses in the Logfire UI

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pydantic/pydantic-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

report-evaluators.md•23.7 KiB