mcp-run-python

Official

Overview Schema Related Servers Score Discussions

multi-run.md•4.21 KiB

# Multi-Run Evaluation Run each case multiple times to measure variability and get more reliable aggregate results. ## Overview AI systems are inherently stochastic — the same input can produce different outputs across runs. The `repeat` parameter lets you run each case multiple times and automatically aggregates the results, giving you a clearer picture of your system's typical behavior. ## Basic Usage Pass `repeat` to [`evaluate()`][pydantic_evals.dataset.Dataset.evaluate] or [`evaluate_sync()`][pydantic_evals.dataset.Dataset.evaluate_sync]: ```python from pydantic_evals import Case, Dataset dataset = Dataset( cases=[ Case(name='greeting', inputs='Say hello'), Case(name='farewell', inputs='Say goodbye'), ] ) def task(inputs: str) -> str: return inputs.upper() # Run each case 5 times report = dataset.evaluate_sync(task, repeat=5) # 2 cases × 5 repeats = 10 total runs print(len(report.cases)) #> 10 ``` When `repeat > 1`, each run gets an indexed name like `greeting [1/5]`, `greeting [2/5]`, etc., while the original case name is preserved in [`source_case_name`][pydantic_evals.reporting.ReportCase.source_case_name] for grouping. ## Accessing Grouped Results Use [`case_groups()`][pydantic_evals.reporting.EvaluationReport.case_groups] to access runs organized by original case, with per-group aggregated statistics: ```python from pydantic_evals import Case, Dataset dataset = Dataset( cases=[ Case(name='greeting', inputs='Say hello'), Case(name='farewell', inputs='Say goodbye'), ] ) def task(inputs: str) -> str: return inputs.upper() report = dataset.evaluate_sync(task, repeat=3) groups = report.case_groups() assert groups is not None # None for single-run (repeat=1) print(len(groups)) #> 2 group_names = [g.name for g in groups] print(group_names) #> ['greeting', 'farewell'] # Each group has 3 runs and aggregated statistics for group in groups: assert len(group.runs) == 3 assert len(group.failures) == 0 assert group.summary.task_duration > 0 ``` Each [`ReportCaseGroup`][pydantic_evals.reporting.ReportCaseGroup] contains: - `name` — the original case name - `runs` — the individual [`ReportCase`][pydantic_evals.reporting.ReportCase] results - `failures` — any runs that raised exceptions - `summary` — a [`ReportCaseAggregate`][pydantic_evals.reporting.ReportCaseAggregate] with averaged scores, metrics, labels, assertions, and durations ## Aggregation With `repeat > 1`, the report's [`averages()`][pydantic_evals.reporting.EvaluationReport.averages] uses a two-level aggregation strategy: 1. **Per-group averages**: Each case's runs are averaged into a group summary 2. **Cross-group averages**: The group summaries are averaged to produce the final result This ensures each original case contributes equally to the overall averages, regardless of how many runs succeeded or failed. ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import EqualsExpected dataset = Dataset( cases=[ Case(name='easy', inputs='hello', expected_output='HELLO'), Case(name='hard', inputs='world', expected_output='WORLD'), ], evaluators=[EqualsExpected()], ) def task(inputs: str) -> str: return inputs.upper() report = dataset.evaluate_sync(task, repeat=3) averages = report.averages() assert averages is not None print(f'Overall assertion rate: {averages.assertions}') #> Overall assertion rate: 1.0 ``` ## Default Behavior When `repeat=1` (the default), behavior is identical to a standard evaluation — no run indexing, no `source_case_name`, and `case_groups()` returns `None`: ```python from pydantic_evals import Case, Dataset dataset = Dataset(cases=[Case(name='test', inputs='hello')]) def task(inputs: str) -> str: return inputs.upper() report = dataset.evaluate_sync(task) # repeat=1 by default assert report.case_groups() is None assert all(c.source_case_name is None for c in report.cases) ``` ## Next Steps - **[Concurrency & Performance](concurrency.md)** — Control parallel execution with `max_concurrency` - **[Metrics & Attributes](metrics-attributes.md)** — Track custom metrics across runs - **[Logfire Integration](logfire-integration.md)** — Visualize multi-run results

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pydantic/pydantic-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

multi-run.md•4.21 KiB