Skip to main content
Glama

mcp-run-python

Official
by pydantic
overview.md13.7 kB
# Evaluators Overview Evaluators are the core of Pydantic Evals. They analyze task outputs and provide scores, labels, or pass/fail assertions. ## When to Use Different Evaluators ### Deterministic Checks (Fast & Reliable) Use deterministic evaluators when you can define exact rules: | Evaluator | Use Case | Example | |-----------|----------|---------| | [`EqualsExpected`][pydantic_evals.evaluators.EqualsExpected] | Exact output match | Structured data, classification | | [`Equals`][pydantic_evals.evaluators.Equals] | Equals specific value | Checking for sentinel values | | [`Contains`][pydantic_evals.evaluators.Contains] | Substring/element check | Required keywords, PII detection | | [`IsInstance`][pydantic_evals.evaluators.IsInstance] | Type validation | Format validation | | [`MaxDuration`][pydantic_evals.evaluators.MaxDuration] | Performance threshold | SLA compliance | | [`HasMatchingSpan`][pydantic_evals.evaluators.HasMatchingSpan] | Behavior verification | Tool calls, code paths | **Advantages:** - Fast execution (microseconds to milliseconds) - Deterministic results - No cost - Easy to debug **When to use:** - Format validation (JSON structure, type checking) - Required content checks (must contain X, must not contain Y) - Performance requirements (latency, token counts) - Behavioral checks (which tools were called, which code paths executed) ### LLM-as-a-Judge (Flexible & Nuanced) Use [`LLMJudge`][pydantic_evals.evaluators.LLMJudge] when evaluation requires understanding or judgment: ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import LLMJudge dataset = Dataset( cases=[Case(inputs='What is 2+2?', expected_output='4')], evaluators=[ LLMJudge( rubric='Response is factually accurate based on the input', include_input=True, ) ], ) ``` **Advantages:** - Can evaluate subjective qualities (helpfulness, tone, creativity) - Understands natural language - Can follow complex rubrics - Flexible across domains **Disadvantages:** - Slower (seconds per evaluation) - Costs money - Non-deterministic - Can have biases **When to use:** - Factual accuracy - Relevance and helpfulness - Tone and style - Completeness - Following instructions - RAG quality (groundedness, citation accuracy) ### Custom Evaluators Custom evaluators can be useful if you want to make use of any evaluation logic we don't provide with the framework. They are frequently useful for domain-specific logic: ```python from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class ValidSQL(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> bool: try: import sqlparse sqlparse.parse(ctx.output) return True except Exception: return False ``` **When to use:** - Domain-specific validation (SQL syntax, regex patterns, business rules) - External API calls (running generated code, checking databases) - Complex calculations (precision/recall, BLEU scores) - Integration checks (does API call succeed?) ## Evaluation Types !!! info "Detailed Return Types Guide" For full detail about precisely what custom Evaluators may return, see [Custom Evaluator Return Types](custom.md#return-types). Evaluators essentially return three types of results: ### 1. Assertions (bool) Pass/fail checks that appear as ✔ or ✗ in reports: ```python from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class HasKeyword(Evaluator): keyword: str def evaluate(self, ctx: EvaluatorContext) -> bool: return self.keyword in ctx.output ``` **Use for:** Binary checks, quality gates, compliance requirements ### 2. Scores (int or float) Numeric metrics: ```python from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class ConfidenceScore(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> float: # Analyze and return score return 0.87 # 87% confidence ``` **Use for:** Quality metrics, ranking, A/B testing, regression tracking ### 3. Labels (str) Categorical classifications: ```python from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class SentimentClassifier(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> str: if 'error' in ctx.output.lower(): return 'error' elif 'success' in ctx.output.lower(): return 'success' return 'neutral' ``` **Use for:** Classification, error categorization, quality buckets ### Multiple Results You can return multiple evaluations from a single evaluator: ```python from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class ComprehensiveCheck(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | float | str]: return { 'valid_format': self._check_format(ctx.output), # bool 'quality_score': self._score_quality(ctx.output), # float 'category': self._classify(ctx.output), # str } def _check_format(self, output: str) -> bool: return True def _score_quality(self, output: str) -> float: return 0.85 def _classify(self, output: str) -> str: return 'good' ``` ## Combining Evaluators Mix and match evaluators to create comprehensive evaluation suites: ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import ( Contains, IsInstance, LLMJudge, MaxDuration, ) dataset = Dataset( cases=[Case(inputs='test', expected_output='result')], evaluators=[ # Fast deterministic checks first IsInstance(type_name='str'), Contains(value='required_field'), MaxDuration(seconds=2.0), # Slower LLM checks after LLMJudge( rubric='Response is accurate and helpful', include_input=True, ), ], ) ``` ## Case-specific evaluators Case-specific evaluators are one of the most powerful features for building comprehensive evaluation suites. You can attach evaluators to individual [`Case`][pydantic_evals.Case] objects that only run for those specific cases: ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import IsInstance, LLMJudge dataset = Dataset( cases=[ Case( name='greeting_response', inputs='Say hello', evaluators=[ # This evaluator only runs for this case LLMJudge( rubric='Response is warm and friendly, uses casual tone', include_input=True, ), ], ), Case( name='formal_response', inputs='Write a business email', evaluators=[ # Different requirements for this case LLMJudge( rubric='Response is professional and formal, uses business language', include_input=True, ), ], ), ], evaluators=[ # This runs for ALL cases IsInstance(type_name='str'), ], ) ``` ### Why Case-Specific Evaluators Matter Case-specific evaluators solve a fundamental problem with one-size-fits-all evaluation: **if you could write a single evaluator rubric that perfectly captured your requirements across all cases, you'd just incorporate that rubric into your agent's instructions**. (Note: this is less relevant in cases where you want to use a cheaper model in production and assess it using a more expensive model, but in many cases it makes sense to use the best model you can in production.) The power of case-specific evaluation comes from the nuance: - **Different cases have different requirements**: A customer support response needs empathy; a technical API response needs precision - **Avoid "inmates running the asylum"**: If your LLMJudge rubric is generic enough to work everywhere, your agent should already be following it - **Capture nuanced golden behavior**: Each case can specify exactly what "good" looks like for that scenario ### Building Golden Datasets with Case-Specific LLMJudge A particularly powerful pattern is using case-specific [`LLMJudge`][pydantic_evals.evaluators.LLMJudge] evaluators to quickly build comprehensive, maintainable evaluation suites. Instead of needing exact `expected_output` values, you can describe what you care about: ```python from pydantic_evals import Case, Dataset from pydantic_evals.evaluators import LLMJudge dataset = Dataset( cases=[ Case( name='handle_refund_request', inputs={'query': 'I want my money back', 'order_id': '12345'}, evaluators=[ LLMJudge( rubric=""" Response should: 1. Acknowledge the refund request empathetically 2. Ask for the reason for the refund 3. Mention our 30-day refund policy 4. NOT process the refund immediately (needs manager approval) """, include_input=True, ), ], ), Case( name='handle_shipping_question', inputs={'query': 'Where is my order?', 'order_id': '12345'}, evaluators=[ LLMJudge( rubric=""" Response should: 1. Confirm the order number 2. Provide tracking information 3. Give estimated delivery date 4. Be brief and factual (not overly apologetic) """, include_input=True, ), ], ), Case( name='handle_angry_customer', inputs={'query': 'This is completely unacceptable!', 'order_id': '12345'}, evaluators=[ LLMJudge( rubric=""" Response should: 1. Prioritize de-escalation with empathy 2. Avoid being defensive 3. Offer concrete next steps 4. Use phrases like "I understand" and "Let me help" """, include_input=True, ), ], ), ], ) ``` This approach lets you: - **Build comprehensive test suites quickly**: Just describe what you want per case - **Maintain easily**: Update rubrics as requirements change, without regenerating outputs - **Cover edge cases naturally**: Add new cases with specific requirements as you discover them - **Capture domain knowledge**: Each rubric documents what "good" means for that scenario The LLM evaluator excels at understanding nuanced requirements and assessing compliance, making this a practical way to create thorough evaluation coverage without brittleness. ## Async vs Sync Evaluators can be sync or async: ```python from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext @dataclass class SyncEvaluator(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> bool: return True async def some_async_operation() -> bool: return True @dataclass class AsyncEvaluator(Evaluator): async def evaluate(self, ctx: EvaluatorContext) -> bool: result = await some_async_operation() return result ``` Pydantic Evals handles both automatically. Use async when: - Making API calls - Running database queries - Performing I/O operations - Calling LLMs (like [`LLMJudge`][pydantic_evals.evaluators.LLMJudge]) ## Evaluation Context All evaluators receive an [`EvaluatorContext`][pydantic_evals.evaluators.EvaluatorContext]: - `ctx.inputs` - Task inputs - `ctx.output` - Task output (to evaluate) - `ctx.expected_output` - Expected output (if provided) - `ctx.metadata` - Case metadata (if provided) - `ctx.duration` - Task execution time (seconds) - `ctx.span_tree` - OpenTelemetry spans (if logfire configured) - `ctx.metrics` - Custom metrics dict - `ctx.attributes` - Custom attributes dict This gives evaluators full context to make informed assessments. ## Error Handling If an evaluator raises an exception, it's captured as an [`EvaluatorFailure`][pydantic_evals.evaluators.EvaluatorFailure]: ```python from dataclasses import dataclass from pydantic_evals.evaluators import Evaluator, EvaluatorContext def risky_operation(output: str) -> bool: # This might raise an exception if 'error' in output: raise ValueError('Found error in output') return True @dataclass class RiskyEvaluator(Evaluator): def evaluate(self, ctx: EvaluatorContext) -> bool: # If this raises an exception, it will be captured result = risky_operation(ctx.output) return result ``` Failures appear in `report.cases[i].evaluator_failures` with: - Evaluator name - Error message - Full stacktrace Use retry configuration to handle transient failures (see [Retry Strategies](../how-to/retry-strategies.md)). ## Next Steps - **[Built-in Evaluators](built-in.md)** - Complete reference of all provided evaluators - **[LLM Judge](llm-judge.md)** - Deep dive on LLM-as-a-Judge evaluation - **[Custom Evaluators](custom.md)** - Write your own evaluation logic - **[Span-Based Evaluation](span-based.md)** - Evaluate using OpenTelemetry spans

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pydantic/pydantic-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server