Skip to main content
Glama

mcp-run-python

Official
by pydantic
built-in.md11.8 kB
# Built-in Evaluators Pydantic Evals provides several built-in evaluators for common evaluation tasks. ## Comparison Evaluators ### EqualsExpected Check if the output exactly equals the expected output from the case. ```python from pydantic_evals.evaluators import EqualsExpected EqualsExpected() ``` **Parameters:** None **Returns:** `bool` - `True` if `ctx.output == ctx.expected_output` **Example:** ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import EqualsExpected dataset = Dataset( cases=[ Case( name='addition', inputs='2 + 2', expected_output='4', ), ], evaluators=[EqualsExpected()], ) ``` **Notes:** - Skips evaluation if `expected_output` is `None` (returns empty dict `{}`) - Uses Python's `==` operator, so works with any comparable types - For structured data, considers nested equality --- ### Equals Check if the output equals a specific value. ```python from pydantic_evals.evaluators import Equals Equals(value='expected_result') ``` **Parameters:** - `value` (Any): The value to compare against - `evaluation_name` (str | None): Custom name for this evaluation in reports **Returns:** `bool` - `True` if `ctx.output == value` **Example:** ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import Equals # Check output is always "success" dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ Equals(value='success', evaluation_name='is_success'), ], ) ``` **Use Cases:** - Checking for sentinel values - Validating consistent outputs - Testing classification into specific categories --- ### Contains Check if the output contains a specific value or substring. ```python from pydantic_evals.evaluators import Contains Contains( value='substring', case_sensitive=True, as_strings=False, ) ``` **Parameters:** - `value` (Any): The value to search for - `case_sensitive` (bool): Case-sensitive comparison for strings (default: `True`) - `as_strings` (bool): Convert both values to strings before checking (default: `False`) - `evaluation_name` (str | None): Custom name for this evaluation in reports **Returns:** [`EvaluationReason`][pydantic_evals.evaluators.EvaluationReason] - Pass/fail with explanation **Behavior:** For **strings**: checks substring containment - `Contains(value='hello', case_sensitive=False)` - Matches: "Hello World", "say hello", "HELLO" - Doesn't match: "hi there" For **lists/tuples**: checks membership - `Contains(value='apple')` - Matches: `['apple', 'banana']`, `('apple',)` - Doesn't match: `['apples', 'orange']` For **dicts**: checks key-value pairs - `Contains(value={'name': 'Alice'})` - Matches: `{'name': 'Alice', 'age': 30}` - Doesn't match: `{'name': 'Bob'}` **Example:** ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import Contains dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ # Check for required keywords Contains(value='terms and conditions', case_sensitive=False), # Check for PII (fail if found) # Note: Use a custom evaluator that returns False when PII found ], ) ``` **Use Cases:** - Required content verification - Keyword detection - PII/sensitive data detection - Multi-value validation --- ## Type Validation ### IsInstance Check if the output is an instance of a type with the given name. ```python from pydantic_evals.evaluators import IsInstance IsInstance(type_name='str') ``` **Parameters:** - `type_name` (str): The type name to check (uses `__name__` or `__qualname__`) - `evaluation_name` (str | None): Custom name for this evaluation in reports **Returns:** [`EvaluationReason`][pydantic_evals.evaluators.EvaluationReason] - Pass/fail with type information **Example:** ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import IsInstance dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ # Check output is always a string IsInstance(type_name='str'), # Check for Pydantic model IsInstance(type_name='MyModel'), # Check for dict IsInstance(type_name='dict'), ], ) ``` **Notes:** - Matches against both `__name__` and `__qualname__` of the type - Works with built-in types (`str`, `int`, `dict`, `list`, etc.) - Works with custom classes and Pydantic models - Checks the entire MRO (Method Resolution Order) for inheritance **Use Cases:** - Format validation - Structured output verification - Type consistency checks --- ## Performance Evaluation ### MaxDuration Check if task execution time is under a maximum threshold. ```python from datetime import timedelta from pydantic_evals.evaluators import MaxDuration MaxDuration(seconds=2.0) # or MaxDuration(seconds=timedelta(seconds=2)) ``` **Parameters:** - `seconds` (float | timedelta): Maximum allowed duration **Returns:** `bool` - `True` if `ctx.duration <= seconds` **Example:** ```python from datetime import timedelta from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import MaxDuration dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ # SLA: must respond in under 2 seconds MaxDuration(seconds=2.0), # Or using timedelta MaxDuration(seconds=timedelta(milliseconds=500)), ], ) ``` **Use Cases:** - SLA compliance - Performance regression testing - Latency requirements - Timeout validation **See Also:** [Concurrency & Performance](../how-to/concurrency.md) --- ## LLM-as-a-Judge ### LLMJudge Use an LLM to evaluate subjective qualities based on a rubric. ```python from pydantic_evals.evaluators import LLMJudge LLMJudge( rubric='Response is accurate and helpful', model='openai:gpt-4o', include_input=False, include_expected_output=False, model_settings=None, score=False, assertion={'include_reason': True}, ) ``` **Parameters:** - `rubric` (str): The evaluation criteria (required) - `model` (Model | KnownModelName | None): Model to use (default: `'openai:gpt-4o'`) - `include_input` (bool): Include task inputs in the prompt (default: `False`) - `include_expected_output` (bool): Include expected output in the prompt (default: `False`) - `model_settings` (ModelSettings | None): Custom model settings - `score` (OutputConfig | False): Configure score output (default: `False`) - `assertion` (OutputConfig | False): Configure assertion output (default: includes reason) **Returns:** Depends on `score` and `assertion` parameters (see below) **Output Modes:** By default, returns a **boolean assertion** with reason: - `LLMJudge(rubric='Response is polite')` - Returns: `{'LLMJudge_pass': EvaluationReason(value=True, reason='...')}` Return a **score** (0.0 to 1.0) instead: - `LLMJudge(rubric='Response quality', score={'include_reason': True}, assertion=False)` - Returns: `{'LLMJudge_score': EvaluationReason(value=0.85, reason='...')}` Return **both** score and assertion: - `LLMJudge(rubric='Response quality', score={'include_reason': True}, assertion={'include_reason': True})` - Returns: `{'LLMJudge_score': EvaluationReason(value=0.85, reason='...'), 'LLMJudge_pass': EvaluationReason(value=True, reason='...')}` **Customize evaluation names:** - `LLMJudge(rubric='Response is factually accurate', assertion={'evaluation_name': 'accuracy', 'include_reason': True})` - Returns: `{'accuracy': EvaluationReason(value=True, reason='...')}` **Example:** ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import LLMJudge dataset = Dataset( cases=[Case(inputs='test', expected_output='result')], evaluators=[ # Basic accuracy check LLMJudge( rubric='Response is factually accurate', include_input=True, ), # Quality score with different model LLMJudge( rubric='Overall response quality', model='anthropic:claude-3-7-sonnet-latest', score={'evaluation_name': 'quality', 'include_reason': False}, assertion=False, ), # Check against expected output LLMJudge( rubric='Response matches the expected answer semantically', include_input=True, include_expected_output=True, ), ], ) ``` **See Also:** [LLM Judge Deep Dive](llm-judge.md) --- ## Span-Based Evaluation ### HasMatchingSpan Check if OpenTelemetry spans match a query (requires Logfire configuration). ```python from pydantic_evals.evaluators import HasMatchingSpan HasMatchingSpan( query={'name_contains': 'tool_call'}, evaluation_name='called_tool', ) ``` **Parameters:** - `query` ([`SpanQuery`][pydantic_evals.otel.SpanQuery]): Query to match against spans - `evaluation_name` (str | None): Custom name for this evaluation in reports **Returns:** `bool` - `True` if any span matches the query **Example:** ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import HasMatchingSpan dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ # Check that a specific tool was called HasMatchingSpan( query={'name_contains': 'search_database'}, evaluation_name='used_database', ), # Check for errors HasMatchingSpan( query={'has_attributes': {'error': True}}, evaluation_name='had_errors', ), # Check duration constraints HasMatchingSpan( query={ 'name_equals': 'llm_call', 'max_duration': 2.0, # seconds }, evaluation_name='llm_fast_enough', ), ], ) ``` **See Also:** [Span-Based Evaluation](span-based.md) --- ## Quick Reference Table | Evaluator | Purpose | Return Type | Cost | Speed | |-----------|---------|-------------|------|-------| | [`EqualsExpected`][pydantic_evals.evaluators.EqualsExpected] | Exact match with expected | `bool` | Free | Instant | | [`Equals`][pydantic_evals.evaluators.Equals] | Equals specific value | `bool` | Free | Instant | | [`Contains`][pydantic_evals.evaluators.Contains] | Contains value/substring | `bool` + reason | Free | Instant | | [`IsInstance`][pydantic_evals.evaluators.IsInstance] | Type validation | `bool` + reason | Free | Instant | | [`MaxDuration`][pydantic_evals.evaluators.MaxDuration] | Performance threshold | `bool` | Free | Instant | | [`LLMJudge`][pydantic_evals.evaluators.LLMJudge] | Subjective quality | `bool` and/or `float` | $$ | Slow | | [`HasMatchingSpan`][pydantic_evals.evaluators.HasMatchingSpan] | Behavioral check | `bool` | Free | Fast | ## Combining Evaluators Best practice is to combine fast deterministic checks with slower LLM evaluations: ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import ( Contains, IsInstance, LLMJudge, MaxDuration, ) dataset = Dataset( cases=[Case(inputs='test')], evaluators=[ # Fast checks first (fail fast) IsInstance(type_name='str'), Contains(value='required_field'), MaxDuration(seconds=2.0), # Expensive LLM checks last LLMJudge(rubric='Response is helpful and accurate'), ], ) ``` This approach: 1. Catches format/structure issues immediately 2. Validates required content quickly 3. Only runs expensive LLM evaluation if basic checks pass 4. Provides comprehensive quality assessment ## Next Steps - **[LLM Judge](llm-judge.md)** - Deep dive on LLM-as-a-Judge evaluation - **[Custom Evaluators](custom.md)** - Write your own evaluation logic - **[Span-Based Evaluation](span-based.md)** - Using OpenTelemetry spans for behavioral checks

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pydantic/pydantic-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server