span-based.md•15.2 kB
# Span-Based Evaluation
Evaluate AI system behavior by analyzing OpenTelemetry spans captured during execution.
!!! note "Requires Logfire"
Span-based evaluation requires `logfire` to be installed and configured:
```bash
pip install 'pydantic-evals[logfire]'
```
## Overview
Span-based evaluation enables you to evaluate **how** your AI system executes, not just **what** it produces. This is essential for complex agents where ensuring the desired behavior depends on the execution path taken, not just the final output.
### Why Span-Based Evaluation?
Traditional evaluators assess task inputs and outputs. For simple tasks, this may be sufficient—if the output is correct, the task succeeded. But for complex multi-step agents, the _process_ matters as much as the result:
- **A correct answer reached incorrectly** - An agent might produce the right output by accident (e.g., guessing, using cached data when it should have searched, calling the wrong tools but getting lucky)
- **Verification of required behaviors** - You need to ensure specific tools were called, certain code paths executed, or particular patterns followed
- **Performance and efficiency** - The agent should reach the answer efficiently, without unnecessary tool calls, infinite loops, or excessive retries
- **Safety and compliance** - Critical to verify that dangerous operations weren't attempted, sensitive data wasn't accessed inappropriately, or guardrails weren't bypassed
### Real-World Scenarios
Span-based evaluation is particularly valuable for:
- **RAG systems** - Verify documents were retrieved and reranked before generation, not just that the answer included citations
- **Multi-agent coordination** - Ensure the orchestrator delegated to the right specialist agents in the correct order
- **Tool-calling agents** - Confirm specific tools were used (or avoided), and in the expected sequence
- **Debugging and regression testing** - Catch behavioral regressions where outputs remain correct but the internal logic deteriorates
- **Production alignment** - Ensure your evaluation assertions operate on the same telemetry data captured in production, so eval insights directly translate to production monitoring
### How It Works
When you configure logfire (`logfire.configure()`), Pydantic Evals captures all OpenTelemetry spans generated during task execution. You can then write evaluators that assert conditions on:
- **Which tools were called** - `HasMatchingSpan(query={'name_contains': 'search_tool'})`
- **Code paths executed** - Verify specific functions ran or particular branches taken
- **Timing characteristics** - Check that operations complete within SLA bounds
- **Error conditions** - Detect retries, fallbacks, or specific failure modes
- **Execution structure** - Verify parent-child relationships, delegation patterns, or execution order
This creates a fundamentally different evaluation paradigm: you're testing behavioral contracts, not just input-output relationships.
## Basic Usage
```python
import logfire
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan
# Configure logfire to capture spans
logfire.configure(send_to_logfire='if-token-present')
dataset = Dataset(
cases=[Case(inputs='test')],
evaluators=[
# Check that database was queried
HasMatchingSpan(
query={'name_contains': 'database_query'},
evaluation_name='used_database',
),
],
)
```
## HasMatchingSpan Evaluator
The [`HasMatchingSpan`][pydantic_evals.evaluators.HasMatchingSpan] evaluator checks if any span matches a query:
```python
from pydantic_evals.evaluators import HasMatchingSpan
HasMatchingSpan(
query={'name_contains': 'test'},
evaluation_name='span_check',
)
```
**Returns:** `bool` - `True` if any span matches the query
## SpanQuery Reference
A [`SpanQuery`][pydantic_evals.otel.SpanQuery] is a dictionary with query conditions:
### Name Conditions
Match spans by name:
```python
# Exact name match
{'name_equals': 'search_database'}
# Contains substring
{'name_contains': 'tool_call'}
# Regex pattern
{'name_matches_regex': r'llm_call_\d+'}
```
### Attribute Conditions
Match spans with specific attributes:
```python
# Has specific attribute values
{'has_attributes': {'operation': 'search', 'status': 'success'}}
# Has attribute keys (any value)
{'has_attribute_keys': ['user_id', 'request_id']}
```
### Duration Conditions
Match based on execution time:
```python
from datetime import timedelta
# Minimum duration
{'min_duration': 1.0} # seconds
{'min_duration': timedelta(seconds=1)}
# Maximum duration
{'max_duration': 5.0} # seconds
{'max_duration': timedelta(seconds=5)}
# Range
{'min_duration': 0.5, 'max_duration': 2.0}
```
### Logical Operators
Combine conditions:
```python
# NOT
{'not_': {'name_contains': 'error'}}
# AND (all must match)
{'and_': [
{'name_contains': 'tool'},
{'max_duration': 1.0},
]}
# OR (any must match)
{'or_': [
{'name_equals': 'search'},
{'name_equals': 'query'},
]}
```
### Child/Descendant Conditions
Query relationships between spans:
```python
# Count direct children
{'min_child_count': 1}
{'max_child_count': 5}
# Some child matches query
{'some_child_has': {'name_contains': 'retry'}}
# All children match query
{'all_children_have': {'max_duration': 0.5}}
# No children match query
{'no_child_has': {'has_attributes': {'error': True}}}
# Descendant queries (recursive)
{'min_descendant_count': 5}
{'some_descendant_has': {'name_contains': 'api_call'}}
```
### Ancestor/Depth Conditions
Query span hierarchy:
```python
# Depth (root spans have depth 0)
{'min_depth': 1} # Not a root span
{'max_depth': 2} # At most 2 levels deep
# Ancestor queries
{'some_ancestor_has': {'name_equals': 'agent_run'}}
{'all_ancestors_have': {'max_duration': 10.0}}
{'no_ancestor_has': {'has_attributes': {'error': True}}}
```
### Stop Recursing
Control recursive queries:
```python
{
'some_descendant_has': {'name_contains': 'expensive'},
'stop_recursing_when': {'name_equals': 'boundary'},
}
# Only search descendants until hitting a span named 'boundary'
```
## Practical Examples
### Verify Tool Usage
Check that specific tools were called:
```python
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan
dataset = Dataset(
cases=[Case(inputs='test')],
evaluators=[
# Must call search tool
HasMatchingSpan(
query={'name_contains': 'search_tool'},
evaluation_name='used_search',
),
# Must NOT call dangerous tool
HasMatchingSpan(
query={'not_': {'name_contains': 'delete_database'}},
evaluation_name='safe_execution',
),
],
)
```
### Check Multiple Tools
Verify a sequence of operations:
```python
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
HasMatchingSpan(
query={'name_contains': 'retrieve_context'},
evaluation_name='retrieved_context',
),
HasMatchingSpan(
query={'name_contains': 'generate_response'},
evaluation_name='generated_response',
),
HasMatchingSpan(
query={'and_': [
{'name_contains': 'cite'},
{'has_attribute_keys': ['source_id']},
]},
evaluation_name='added_citations',
),
]
```
### Performance Assertions
Ensure operations meet latency requirements:
```python
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Database queries should be fast
HasMatchingSpan(
query={'and_': [
{'name_contains': 'database'},
{'max_duration': 0.1}, # 100ms max
]},
evaluation_name='fast_db_queries',
),
# Overall should complete quickly
HasMatchingSpan(
query={'and_': [
{'name_equals': 'task_execution'},
{'max_duration': 2.0},
]},
evaluation_name='within_sla',
),
]
```
### Error Detection
Check for error conditions:
```python
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# No errors occurred
HasMatchingSpan(
query={'not_': {'has_attributes': {'error': True}}},
evaluation_name='no_errors',
),
# Retries happened
HasMatchingSpan(
query={'name_contains': 'retry'},
evaluation_name='had_retries',
),
# Fallback was used
HasMatchingSpan(
query={'name_contains': 'fallback_model'},
evaluation_name='used_fallback',
),
]
```
### Complex Behavioral Checks
Verify sophisticated behavior patterns:
```python
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Agent delegated to sub-agent
HasMatchingSpan(
query={'and_': [
{'name_contains': 'agent'},
{'some_child_has': {'name_contains': 'delegate'}},
]},
evaluation_name='used_delegation',
),
# Made multiple LLM calls with retries
HasMatchingSpan(
query={'and_': [
{'name_contains': 'llm_call'},
{'some_descendant_has': {'name_contains': 'retry'}},
{'min_descendant_count': 3},
]},
evaluation_name='retry_pattern',
),
]
```
## Custom Evaluators with SpanTree
For more complex span analysis, write custom evaluators:
```python
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class CustomSpanCheck(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | int]:
span_tree = ctx.span_tree
# Find specific spans
llm_spans = span_tree.find(lambda node: 'llm' in node.name)
tool_spans = span_tree.find(lambda node: 'tool' in node.name)
# Calculate metrics
total_llm_time = sum(
span.duration.total_seconds() for span in llm_spans
)
return {
'used_llm': len(llm_spans) > 0,
'used_tools': len(tool_spans) > 0,
'tool_count': len(tool_spans),
'llm_fast': total_llm_time < 2.0,
}
```
### SpanTree API
The [`SpanTree`][pydantic_evals.otel.SpanTree] provides methods for span analysis:
```python
from pydantic_evals.otel import SpanTree
# Example API (requires span_tree from context)
def example_api(span_tree: SpanTree) -> None:
span_tree.find(lambda n: True) # Find all matching nodes
span_tree.any({'name_contains': 'test'}) # Check if any span matches
span_tree.all({'name_contains': 'test'}) # Check if all spans match
span_tree.count({'name_contains': 'test'}) # Count matching spans
# Iteration
for node in span_tree:
print(node.name, node.duration, node.attributes)
```
### SpanNode Properties
Each [`SpanNode`][pydantic_evals.otel.SpanNode] has:
```python
from pydantic_evals.otel import SpanNode
# Example properties (requires node from context)
def example_properties(node: SpanNode) -> None:
_ = node.name # Span name
_ = node.duration # timedelta
_ = node.attributes # dict[str, AttributeValue]
_ = node.start_timestamp # datetime
_ = node.end_timestamp # datetime
_ = node.children # list[SpanNode]
_ = node.descendants # list[SpanNode] (recursive)
_ = node.ancestors # list[SpanNode]
_ = node.parent # SpanNode | None
```
## Debugging Span Queries
### View Spans in Logfire
If you're sending data to Logfire, you can view all spans in the web UI to understand the trace structure.
### Print Span Tree
```python
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class DebugSpans(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> bool:
for node in ctx.span_tree:
print(f"{' ' * len(node.ancestors)}{node.name} ({node.duration})")
return True
```
### Query Testing
Test queries incrementally:
```python
from pydantic_evals.evaluators import HasMatchingSpan
# Start simple
query = {'name_contains': 'tool'}
# Add conditions gradually
query = {'and_': [
{'name_contains': 'tool'},
{'max_duration': 1.0},
]}
# Test in evaluator
HasMatchingSpan(query=query, evaluation_name='test')
```
## Use Cases
### RAG System Verification
Verify retrieval-augmented generation workflow:
```python
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Retrieved documents
HasMatchingSpan(
query={'name_contains': 'vector_search'},
evaluation_name='retrieved_docs',
),
# Reranked results
HasMatchingSpan(
query={'name_contains': 'rerank'},
evaluation_name='reranked_results',
),
# Generated with context
HasMatchingSpan(
query={'and_': [
{'name_contains': 'generate'},
{'has_attribute_keys': ['context_ids']},
]},
evaluation_name='used_context',
),
]
```
### Multi-Agent Systems
Verify agent coordination:
```python
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Master agent ran
HasMatchingSpan(
query={'name_equals': 'master_agent'},
evaluation_name='master_ran',
),
# Delegated to specialist
HasMatchingSpan(
query={'and_': [
{'name_contains': 'specialist_agent'},
{'some_ancestor_has': {'name_equals': 'master_agent'}},
]},
evaluation_name='delegated_correctly',
),
# No circular delegation
HasMatchingSpan(
query={'not_': {'and_': [
{'name_contains': 'agent'},
{'some_descendant_has': {'name_contains': 'agent'}},
{'some_ancestor_has': {'name_contains': 'agent'}},
]}},
evaluation_name='no_circular_delegation',
),
]
```
### Tool Usage Patterns
Verify intelligent tool selection:
```python
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Used search before answering
HasMatchingSpan(
query={'and_': [
{'name_contains': 'search'},
{'some_ancestor_has': {'name_contains': 'answer'}},
]},
evaluation_name='searched_before_answering',
),
# Limited tool calls (no loops)
HasMatchingSpan(
query={'and_': [
{'name_contains': 'tool'},
{'max_child_count': 5},
]},
evaluation_name='reasonable_tool_usage',
),
]
```
## Best Practices
1. **Start Simple**: Begin with basic name queries, add complexity as needed
2. **Use Descriptive Names**: Name your spans well in your application code
3. **Test Queries**: Verify queries work before running full evaluations
4. **Combine with Other Evaluators**: Use span checks alongside output validation
5. **Document Expectations**: Comment why specific spans should/shouldn't exist
## Next Steps
- **[Logfire Integration](../how-to/logfire-integration.md)** - Set up Logfire for span capture
- **[Custom Evaluators](custom.md)** - Write advanced span analysis
- **[Built-in Evaluators](built-in.md)** - Other evaluator types