---
name: aggregating-event-datasets
description: Aggregate and summarize event datasets (logs) using OPAL statsby. Use when you need to count, sum, or calculate statistics across log events. Covers make_col for derived columns, statsby for aggregation, group_by for grouping, aggregation functions (count, sum, avg, percentile), and topk for top N results. Returns single summary row per group across entire time range. For time-series trends, see time-series-analysis skill.
---
# Aggregating Event Datasets
Event datasets (logs) can be aggregated to create summaries and statistics. This skill teaches you how to use `statsby` to aggregate log data into meaningful insights using OPAL.
## When to Use This Skill
- Counting occurrences (error count by namespace, log volume by pod)
- Calculating statistics (average, sum, percentiles) across events
- Grouping events by dimensions (namespace, pod, container, service)
- Finding top N results by a metric (top 10 error sources, busiest pods)
- Creating summary reports across entire time range
**Note**: This skill covers `statsby` which returns **one summary row per group** across the entire time range. For time-series trends (multiple rows per group over time), see the `time-series-analysis` skill.
## Prerequisites
- Access to Observe tenant via MCP
- Understanding of event datasets (see filtering-event-datasets skill)
- Dataset with `log` interface (or any Event dataset)
## Key Concepts
### statsby - Statistical Aggregation
`statsby` is the primary aggregation verb for event datasets. It:
- Groups events by specified dimensions
- Applies aggregation functions (count, sum, avg, etc.)
- Returns **one row per group** across the entire query time range
**Syntax**:
```opal
statsby aggregation_function(), group_by(dimension1, dimension2, ...)
```
### Common Aggregation Functions
- `count()` - Count number of events
- `sum(field)` - Sum values of a field
- `avg(field)` - Average value of a field
- `min(field)` - Minimum value
- `max(field)` - Maximum value
- `percentile(field, p)` - Percentile (e.g., p=0.95 for 95th percentile)
- `any_not_null(field)` - Any non-null value from the group
### topk vs sort/limit
- **`topk N, max(metric)`** - Get top N results by a specific metric (semantically correct for "top performers")
- **`sort desc(metric) | limit N`** - Alternative but less clear intent
- **Use topk** for aggregated results - it's more explicit about intent
## Discovery Workflow
Start with dataset discovery (same as filtering-event-datasets):
**Step 1: Find dataset**
```
discover_context("kubernetes logs")
```
**Step 2: Get schema**
```
discover_context(dataset_id="YOUR_DATASET_ID")
```
Note fields you'll use for:
- **Filtering** (before aggregation)
- **Grouping** (dimensions to aggregate by)
- **Calculating** (fields to sum, average, etc.)
## Basic Patterns
### Pattern 1: Simple Count
**Use case**: Count total events
```opal
statsby count()
```
**Explanation**: Counts all events in the time range. Returns single row with total count.
**Output**:
```
count
5831
```
### Pattern 2: Count by Dimension
**Use case**: Count events grouped by a field (e.g., namespace)
```opal
make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby count(), group_by(namespace)
| topk 10, max(count)
```
**Explanation**:
1. `make_col` creates a derived column `namespace` from nested field
2. `statsby` counts events, grouped by namespace
3. `topk` returns top 10 namespaces by count
**Output**:
```
namespace,count,_c_rank
default,5805,1
kube-system,648,2
observe,64,3
```
### Pattern 3: Count with Filtering
**Use case**: Count errors per namespace
```opal
filter contains(body, "error")
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby error_count:count(), group_by(namespace)
| topk 10, max(error_count)
```
**Explanation**: Filters for errors first, then counts by namespace. Notice we name the count `error_count` for clarity.
### Pattern 4: Multiple Dimensions
**Use case**: Count by namespace AND pod
```opal
make_col
namespace:string(resource_attributes."k8s.namespace.name"),
pod:pod
| statsby count(), group_by(namespace, pod)
| topk 20, max(count)
```
**Explanation**: Groups by multiple dimensions. Each unique (namespace, pod) combination gets one row.
### Pattern 5: Multiple Aggregations
**Use case**: Calculate multiple statistics in one query
```opal
filter stream = "stderr"
| make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby
stderr_count:count(),
group_by(namespace)
| topk 10, max(stderr_count)
```
**Explanation**: You can calculate multiple aggregations in a single `statsby` call.
## Complete Example
End-to-end workflow for analyzing errors across your infrastructure.
**Scenario**: Find which services, namespaces, and pods are producing the most errors in the last 24 hours.
**Step 1: Discovery**
```
discover_context("kubernetes logs")
```
Found: Dataset "Kubernetes Explorer/Kubernetes Logs" (ID: 42161740)
**Step 2: Build query**
```opal
filter contains(body, "error") or contains(body, "ERROR")
| make_col
namespace:string(resource_attributes."k8s.namespace.name"),
pod:pod,
container:container
| statsby error_count:count(), group_by(namespace, pod, container)
| topk 20, max(error_count)
```
**Step 3: Execute**
```
execute_opal_query(
query="[query above]",
primary_dataset_id="42161740",
time_range="24h"
)
```
**Step 4: Interpret results**
```csv
namespace,pod,container,error_count,_c_rank
kube-system,calico-node-74d4r,calico-node,33,1
kube-system,calico-node-hhvbf,calico-node,31,2
kube-system,calico-node-ghk2s,calico-node,31,3
kube-system,calico-kube-controllers-759cd8b574-fzr49,calico-kube-controllers,31,4
```
**Analysis**:
- Most errors are in `kube-system` namespace
- `calico-node` pods are the primary error source
- All errors are from the same container (`calico-node`)
- Total of 126 errors across top 4 sources in 24h
**Next steps**: Investigate the specific calico-node errors to understand the root cause.
## Advanced Patterns
### Pattern 6: Conditional Aggregation
**Use case**: Count errors vs total, calculate error rate
```opal
make_col
namespace:string(resource_attributes."k8s.namespace.name"),
is_error:if(contains(body, "error"), 1, 0)
| statsby
total:count(),
error_count:sum(is_error),
group_by(namespace)
| make_col error_rate:float64(error_count)/float64(total)
| topk 10, max(error_rate)
```
**Explanation**:
1. Create boolean flag `is_error` (1 or 0)
2. Count total events and sum error flags
3. Calculate error rate as derived column
4. Show top 10 by error rate
**Note**: OPAL doesn't have `count_if()`, so use `if()` + `sum()` pattern.
### Pattern 7: Type Conversions
**Use case**: Safely handle type conversions for nested fields
```opal
make_col
namespace:string(resource_attributes."k8s.namespace.name"),
pod:string(pod),
container:string(container)
| statsby count(), group_by(namespace, pod, container)
| topk 20, max(count)
```
**Explanation**: Wrap fields in `string()`, `int64()`, `float64()` for type safety, especially with nested fields.
## Common Pitfalls
### Pitfall 1: Forgetting make_col Before statsby
❌ **Wrong**:
```opal
statsby count(), group_by(resource_attributes."k8s.namespace.name")
# Error: Can't group by nested field directly
```
✅ **Correct**:
```opal
make_col namespace:string(resource_attributes."k8s.namespace.name")
| statsby count(), group_by(namespace)
```
**Why**: `statsby` group_by needs simple column names. Use `make_col` to extract nested fields first.
### Pitfall 2: Using align Instead of statsby
❌ **Wrong**:
```opal
align options(bins: 1), count:count()
aggregate total:sum(count)
# align is for METRICS only!
```
✅ **Correct**:
```opal
statsby count()
# statsby is for EVENTS
```
**Why**: `align` is only for metric datasets. Events use `statsby` for aggregation.
### Pitfall 3: Using limit Instead of topk After Aggregation
❌ **Wrong** (less clear):
```opal
statsby error_count:count(), group_by(namespace)
| sort desc(error_count)
| limit 10
```
✅ **Correct**:
```opal
statsby error_count:count(), group_by(namespace)
| topk 10, max(error_count)
```
**Why**: `topk` explicitly states "top N by this metric" - clearer intent than arbitrary limit.
### Pitfall 4: Confusing statsby with timechart
❌ **Wrong** (if you want summary):
```opal
timechart 1h, count(), group_by(namespace)
# Returns multiple rows per namespace (time-series)
```
✅ **Correct** (for summary):
```opal
statsby count(), group_by(namespace)
# Returns one row per namespace (total)
```
**Why**:
- `statsby` = Single summary across time range
- `timechart` = Time-series with multiple rows per group
## Tips and Best Practices
- **Name your aggregations**: Use descriptive names like `error_count:count()` instead of just `count()`
- **Filter before aggregating**: Apply filters before `statsby` for better performance
- **Use topk for top N**: More explicit than sort/limit
- **Type conversion**: Wrap nested fields in `string()` for safety
- **Test with limit first**: When developing, filter to small dataset before aggregating
- **Small time ranges**: Start with 1h or 24h, expand once query is working
## Aggregation Function Reference
**Counting**:
- `count()` - Count all events in group
**Numeric**:
- `sum(field)` - Sum values
- `avg(field)` - Average
- `min(field)` - Minimum
- `max(field)` - Maximum
- `percentile(field, p)` - Percentile (0.0 to 1.0)
**String/Any**:
- `any_not_null(field)` - Any non-null value from group
## Additional Resources
For more details, see:
- [RESEARCH.md](../../RESEARCH.md) - Tested patterns and findings
- [OPAL Documentation](https://docs.observeinc.com/en/latest/content/query-language-reference/) - Official OPAL docs
## Related Skills
- [filtering-event-datasets] - For filtering events before aggregation
- [time-series-analysis] - For time-series trends with timechart
- [working-with-nested-fields] - Deep dive on nested field access
---
**Last Updated**: November 14, 2025
**Version**: 1.0
**Tested With**: Observe OPAL v2.x