---
name: detecting-anomalies
description: Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type.
---
# Detecting Anomalies
Detect anomalies in metrics and time-series data using OPAL statistical methods. This skill covers multiple detection patterns for different types of anomalies: statistical outliers, sudden spikes/drops, threshold violations, and deviations from moving baselines.
Use when you need to:
- Identify unusual spikes or drops in request volume, errors, latency
- Detect values exceeding normal statistical bounds
- Alert on sudden percentage changes (traffic doubling, sudden drops)
- Compare current values to moving averages
- Find outliers in skewed distributions
## Key Concepts
### Anomaly Detection Approaches
**Statistical Methods** (good for gradual changes):
- Z-Score (standard deviation) - Assumes normal distribution
- IQR (Interquartile Range) - Robust to skewed data
- Percentile thresholds - Compare to historical baseline
**Temporal Methods** (good for sudden changes):
- Rate of change - Detect sudden spikes/drops
- Moving average deviation - Compare to recent baseline
**Threshold Methods** (simple and interpretable):
- Static thresholds - Known limits (CPU > 90%)
- Dynamic thresholds - Calculated from baseline (current > avg * 1.5)
### When to Use Each Pattern
```
What type of anomaly?
├─ Known threshold (e.g., "CPU > 90%")
│ └─> Threshold-Based Detection (Pattern 3)
│
├─ Statistical outliers (unusual values)
│ ├─ Normal distribution?
│ │ └─> Z-Score Method (Pattern 1)
│ │
│ └─ Skewed distribution?
│ └─> IQR Method (Pattern 2)
│
├─ Sudden spikes/drops
│ └─> Rate of Change (Pattern 4)
│
└─ Deviation from recent baseline
└─> Moving Average (Pattern 5)
```
## Pattern 1: Statistical Outlier Detection (Z-Score)
**Concept**: Detect values beyond N standard deviations from the mean
**When to use**:
- Metrics with relatively stable baseline
- Data roughly follows normal distribution
- Need statistically grounded detection
**Query**:
```opal
align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate avg_val:avg(metric_value),
stddev_val:stddev(metric_value),
current_val:sum(metric_value),
group_by(service_name)
| make_col z_score:(current_val - avg_val) / stddev_val
| make_col upper_bound:avg_val + (2 * stddev_val)
| make_col lower_bound:avg_val - (2 * stddev_val)
| make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false)
| filter is_anomaly = true
| sort desc(z_score)
| limit 20
```
**Threshold tuning**:
- `z > 2` or `z < -2`: ~95% confidence (moderate sensitivity)
- `z > 3` or `z < -3`: ~99.7% confidence (low false positives)
- `z > 1.5` or `z < -1.5`: ~87% confidence (high sensitivity)
**Example result**:
```
service_name: featureflagservice
avg_val: 11.5
stddev_val: 13.9
current_val: 46
z_score: 2.48
is_anomaly: true
```
**Pros**:
- Statistically grounded
- Well-understood confidence intervals
- Good for normally distributed data
**Cons**:
- Assumes normal distribution
- Sensitive to extreme outliers in baseline
- Requires sufficient historical data
## Pattern 2: IQR (Interquartile Range) Method
**Concept**: Detect values beyond the interquartile range using Tukey's fences
**When to use**:
- Skewed distributions (latency, error counts)
- Presence of natural outliers in baseline
- More robust alternative to Z-score
**Query**:
```opal
align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate p25:percentile(metric_value, 0.25),
p75:percentile(metric_value, 0.75),
current_val:sum(metric_value),
group_by(service_name)
| make_col iqr:p75 - p25
| make_col upper_fence:p75 + (1.5 * iqr)
| make_col lower_fence:p25 - (1.5 * iqr)
| make_col is_outlier:if(current_val > upper_fence or current_val < lower_fence, true, false)
| filter is_outlier = true
| sort desc(current_val)
| limit 20
```
**Threshold tuning**:
- `1.5 * IQR`: Standard outliers (moderate sensitivity)
- `3 * IQR`: Extreme outliers (low false positives)
- `1 * IQR`: More sensitive detection
**Example result**:
```
service_name: featureflagservice
p25: 1.75
p75: 16.75
iqr: 15
upper_fence: 39.25
current_val: 46
is_outlier: true
```
**Pros**:
- Robust to skewed distributions
- Not affected by extreme values
- Based on quartiles (median-based)
**Cons**:
- Less interpretable than Z-score
- May miss anomalies in heavy-tailed distributions
- Requires sufficient data for percentile calculation
## Pattern 3: Threshold-Based Detection
**Concept**: Simple comparison against fixed or dynamic thresholds
**When to use**:
- Known capacity limits (CPU > 90%, memory > 80%)
- SLO violations (error rate > 1%, latency > 500ms)
- Business rules (orders < 100 per hour)
**Static Threshold**:
```opal
align options(bins: 1), total_calls:sum(m("span_call_count_5m"))
aggregate current_rate:sum(total_calls), group_by(service_name)
make_col threshold:100000
| make_col is_high:if(current_rate > threshold, true, false)
| filter is_high = true
| sort desc(current_rate)
```
**Dynamic Threshold** (baseline comparison):
```opal
align options(bins: 1), metric_value:sum(m("span_call_count_5m"))
aggregate baseline:avg(metric_value),
current:sum(metric_value),
group_by(service_name)
make_col threshold:baseline * 1.5
| make_col is_anomaly:if(current > threshold, true, false)
| filter is_anomaly = true
```
**Threshold multiplier guidance**:
- `1.5x`: High sensitivity (more alerts)
- `2x`: Moderate sensitivity (balanced)
- `3x`: Low sensitivity (only major spikes)
**Pros**:
- Simple and interpretable
- No assumptions about distribution
- Clear business meaning
**Cons**:
- Requires domain knowledge to set thresholds
- Static thresholds may not adapt to changing baselines
- May miss subtle anomalies
## Pattern 4: Rate of Change Detection
**Concept**: Detect sudden spikes or drops by comparing to previous time period
**When to use**:
- Detect sudden traffic spikes or drops
- Identify rapid changes in behavior
- Alert on percentage change thresholds
**Query**:
```opal
align 5m, metric_value:sum(m("span_call_count_5m"))
| make_col previous_value:window(lag(metric_value, 1), group_by(service_name))
| make_col value_change:metric_value - previous_value
| make_col pct_change:if(previous_value > 0, (value_change / previous_value) * 100, 0)
| make_col is_spike:if(pct_change > 100 or pct_change < -50, true, false)
| filter is_spike = true
| sort desc(pct_change)
| limit 20
```
**Critical syntax**: Use `window(lag(...), group_by(...))` NOT `lag(...) over (partition by...)`
**Threshold examples**:
- `pct_change > 100`: 2x increase (doubling)
- `pct_change > 200`: 3x increase
- `pct_change < -50`: 50% drop
- `pct_change < -75`: 75% drop
**Example result**:
```
service_name: frontend
metric_value: 50
previous_value: 2
value_change: 48
pct_change: 2400
is_spike: true
```
**Pros**:
- Detects sudden changes regardless of absolute value
- Adapts to current baseline automatically
- Effective for early spike detection
**Cons**:
- Sensitive to very low baseline values (small numbers can cause large percentage changes)
- May produce false positives during normal ramp-up/down
- Requires at least 2 time periods of data
**Best practices**:
- Add minimum value filter to avoid division by small numbers
- Use different thresholds for increases vs decreases
- Consider absolute change threshold in addition to percentage
## Pattern 5: Moving Average Baseline
**Concept**: Compare current value to recent moving average using sliding window
**When to use**:
- Smooth noisy metrics for baseline
- Detect deviations from recent average
- Adaptive baseline that follows trends
**Query**:
```opal
align 5m, metric_value:sum(m("span_call_count_5m"))
| make_col moving_avg:window(avg(metric_value), group_by(service_name), frame(back:30m))
| make_col deviation:metric_value - moving_avg
| make_col pct_deviation:if(moving_avg > 0, (deviation / moving_avg) * 100, 0)
| make_col is_anomaly:if(pct_deviation > 50 or pct_deviation < -50, true, false)
| filter is_anomaly = true
| sort desc(pct_deviation)
| limit 20
```
**Frame options**:
- `frame(back:10m)`: Short-term baseline (10-minute average)
- `frame(back:30m)`: Medium-term baseline (30-minute average)
- `frame(back:1h)`: Longer-term baseline (1-hour average)
**Deviation thresholds**:
- `> 50%`: Moderate deviation from recent average
- `> 100%`: Doubling compared to recent average
- `> 25%`: More sensitive detection
**Pros**:
- Adapts to changing baselines and trends
- Smooths noisy data
- Good for metrics with daily/hourly patterns
**Cons**:
- Slower to detect anomalies (due to averaging)
- May miss anomalies during rapid baseline shifts
- Requires sufficient lookback data
**Best practices**:
- Choose frame duration based on metric volatility
- Shorter frames for fast-changing metrics
- Longer frames for more stable baselines
## Pattern 6: Percentile-Based Threshold
**Concept**: Compare current value to historical percentile (p95, p99)
**When to use**:
- SLO violations (latency > p95)
- Detect values above "normal high"
- Comparing current to historical baseline
**Query**:
```opal
align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate p95:percentile(metric_value, 0.95),
p99:percentile(metric_value, 0.99),
current:sum(metric_value),
group_by(service_name)
| make_col is_anomaly:if(current > p95, true, false)
| filter is_anomaly = true
| sort desc(current)
```
**Percentile choices**:
- `p95`: Detect top 5% unusual values (moderate sensitivity)
- `p99`: Detect top 1% extreme values (low false positives)
- `p90`: Detect top 10% (high sensitivity)
**Pros**:
- Percentile-based SLOs are industry standard
- Automatically adapts to data distribution
- Clear meaning (top X% of values)
**Cons**:
- Unidirectional (only detects high values, not drops)
- Requires sufficient historical data
- May not detect subtle shifts in distribution
## Common Patterns
### Pattern: Combine Multiple Detection Methods
Increase confidence by requiring multiple methods to agree:
```opal
align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate avg_val:avg(metric_value),
stddev_val:stddev(metric_value),
p95:percentile(metric_value, 0.95),
current:sum(metric_value),
group_by(service_name)
| make_col z_score:(current - avg_val) / stddev_val
| make_col is_zscore_anomaly:if(z_score > 2 or z_score < -2, true, false)
| make_col is_percentile_anomaly:if(current > p95, true, false)
| make_col is_anomaly:if(is_zscore_anomaly = true and is_percentile_anomaly = true, true, false)
| filter is_anomaly = true
```
**Use case**: Reduce false positives by requiring consensus
### Pattern: Multi-Metric Correlation
Detect anomalies across correlated metrics:
```opal
align options(bins: 1),
requests:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
total_errors:sum(errors),
group_by(service_name)
make_col error_rate:if(total_requests > 0, (float64(total_errors) / float64(total_requests)) * 100, 0)
| make_col threshold:1.0
| make_col is_high_error:if(error_rate > threshold and total_requests > 100, true, false)
| filter is_high_error = true
```
**Use case**: Alert when error rate AND request volume both indicate issues
### Pattern: Time-Series Trending
Track anomalies over time using timechart:
```opal
align 5m, metric_value:sum(m("span_call_count_5m"))
| aggregate avg_val:avg(metric_value),
stddev_val:stddev(metric_value),
current:sum(metric_value),
group_by(service_name)
| make_col z_score:(current - avg_val) / stddev_val
| make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false)
| filter is_anomaly = true
```
**Result**: Multiple rows per service showing anomalies across time buckets
**Use case**: Visualize when and how often anomalies occur
## OPAL Syntax Key Points
### Window Functions (LAG/LEAD)
**CRITICAL**: OPAL uses `window()` function, NOT SQL `OVER` clause!
**✅ CORRECT Syntax**:
```opal
make_col prev:window(lag(column, 1), group_by(dimension))
make_col next:window(lead(column, 1), group_by(dimension))
make_col moving_avg:window(avg(column), group_by(dimension), frame(back:30m))
```
**❌ WRONG Syntax** (SQL-style):
```opal
lag(column, 1) over (partition by dimension order by time)
```
**Window function components**:
- `lag(column, offset)`: Access previous row value
- `lead(column, offset)`: Access next row value
- `group_by(dimension)`: Partition by dimension
- `frame(back:duration)`: Sliding window lookback period
### Derived Columns Must Use Separate make_col
**❌ WRONG** - Cannot reference derived column in same make_col:
```opal
make_col upper_bound:avg + (2 * stddev),
is_anomaly:if(value > upper_bound, true, false)
```
**✅ CORRECT** - Use separate make_col statements:
```opal
make_col upper_bound:avg + (2 * stddev)
| make_col is_anomaly:if(value > upper_bound, true, false)
```
### Metrics Query Patterns
**Summary (one row per group)**:
```opal
align options(bins: 1), metric:sum(m("metric_name"))
aggregate result:sum(metric), group_by(dimension)
```
**Note**: No pipe `|` between `align options(bins: 1)` and `aggregate`!
**Time-series (multiple rows per group)**:
```opal
align 5m, metric:sum(m("metric_name"))
| aggregate result:sum(metric), group_by(dimension)
```
**Note**: Pipe `|` required between `align 5m` and `aggregate`!
### Period-Over-Period Comparison with Timeshift + Union
For comparing entire periods (e.g., "this hour" vs "exactly 1 hour ago"), use the `timeshift` + `union` pattern with subquery definitions.
**Key Difference**:
- **`window(lag())`**: Compares adjacent buckets (5-min to 5-min, approximate)
- **`timeshift + union`**: Compares entire periods (exact time offset: 1h, 1d, 1w)
**Working Example** (✅ works in all query contexts):
```opal
@current <- @ {
align rate:sum(m("span_call_count_5m"))
aggregate current_sum:sum(rate), group_by(service_name)
}
@previous <- @ {
timeshift 1h # Shift BEFORE align!
align rate:sum(m("span_call_count_5m"))
aggregate prev_sum:sum(rate), group_by(service_name)
}
@combined <- @current {
union @previous
aggregate current:any_not_null(current_sum),
previous:any_not_null(prev_sum),
group_by(service_name)
make_col change:current - previous
make_col pct_change:if(previous > 0, (change / previous) * 100, 0)
make_col abs_pct_change:if(pct_change < 0, -pct_change, pct_change)
}
<- @combined {
filter abs_pct_change > 50
sort desc(abs_pct_change)
limit 10
}
```
**Critical Points**:
1. **`@subquery <- @`**: Use `@` alone to reference the primary input dataset
2. **`timeshift` BEFORE `align`**: Operates on raw data, shifts timestamps before aggregation
3. **Separate aggregation**: Both series must be aggregated independently
4. **`any_not_null()` collapses union**: Combines current/previous into single row per dimension
5. **Works everywhere**: MCP queries, worksheets, and monitors all support this syntax
**Use Cases**:
- Day-over-day comparison: "Today vs yesterday" (use `timeshift 1d`)
- Week-over-week trending: "This week vs last week" (use `timeshift 7d`)
- Hour-over-hour spikes: "This hour vs 1 hour ago" (use `timeshift 1h`)
- SLA violations: "Current vs same period last month" (use `timeshift 30d`)
**Tested Results**:
- ✅ Detected 200% increase in service request rate (18 vs 6)
- ✅ Detected 92% drop in request volume (1 vs 13)
- ✅ Works with any timeshift duration (1h, 6h, 1d, 7d, etc.)
**Comparison with window(lag())**:
| Feature | window(lag(rate, N)) | timeshift + union |
|---------|---------------------|-------------------|
| **Time precision** | Approximate (N buckets back) | Exact (fixed time offset) |
| **Example** | `lag(rate, 12)` ≈ 1 hour (if buckets are 5min) | `timeshift 1h` = exactly 60 minutes |
| **Complexity** | Simple, one query | More complex, subqueries + union |
| **Use case** | Real-time spike detection | Period-over-period reporting |
| **Best for** | "Current vs previous bucket" | "Current vs same time yesterday" |
**When to use each**:
- Use `window(lag())` for: Real-time alerts, simple spike detection, fast queries
- Use `timeshift + union` for: Exact period comparison, day-over-day reports, SLA tracking
## Troubleshooting
### Issue: "Unknown function 'over()'"
**Cause**: Using SQL window function syntax instead of OPAL syntax
**Solution**: Use `window(lag(...), group_by(...))` instead of `lag(...) over (...)`
**Example**:
```opal
# WRONG
make_col prev:lag(value, 1) over (partition by service order by time)
# CORRECT
make_col prev:window(lag(value, 1), group_by(service))
```
### Issue: High false positive rate
**Cause**: Threshold too sensitive or baseline includes anomalies
**Solutions**:
1. **Increase threshold**: Use 3-sigma instead of 2-sigma for Z-score
2. **Combine methods**: Require multiple detection methods to agree
3. **Filter baseline**: Exclude known anomaly periods from baseline calculation
4. **Add minimum value filter**: Avoid alerting on very low absolute values
**Example with minimum value filter**:
```opal
| make_col is_spike:if(pct_change > 100 and metric_value > 10, true, false)
```
### Issue: Missing anomalies (false negatives)
**Cause**: Threshold too strict or wrong detection method for data type
**Solutions**:
1. **Decrease threshold**: Use 1.5-sigma or lower percentile (p90 instead of p95)
2. **Try different method**: IQR if data is skewed, rate-of-change for sudden spikes
3. **Check data distribution**: Visualize baseline to understand normal range
4. **Use multiple methods**: Catch different types of anomalies
### Issue: Division by zero or very small numbers
**Cause**: Calculating percentage change when previous value is zero or very small
**Solution**: Add conditional check for minimum denominator:
```opal
make_col pct_change:if(previous_value > 5, (value_change / previous_value) * 100, 0)
```
### Issue: Window function returns null values
**Cause**: First row in group has no previous value for `lag()`
**Solution**: This is expected behavior - first row will have `null` for `lag()`. Filter nulls or provide default:
```opal
make_col previous_value:window(lag(metric_value, 1), group_by(service_name))
| filter not is_null(previous_value)
```
Or use default value (though not directly supported in current lag syntax):
```opal
make_col pct_change:if(is_null(previous_value), 0, (value_change / previous_value) * 100)
```
## Key Takeaways
1. **Choose detection method based on anomaly type and data distribution**
- Z-Score for normal distributions
- IQR for skewed data
- Rate-of-change for sudden spikes
- Moving average for trend deviations
2. **OPAL window functions use different syntax from SQL**
- Use `window(lag(...), group_by(...))` NOT `lag(...) over (...)`
- Works with both metrics (align) and raw datasets
3. **Combine multiple methods to reduce false positives**
- Require Z-score AND percentile agreement
- Add minimum value filters for rate-of-change
- Correlate multiple metrics (requests + errors)
4. **Tune thresholds based on metric characteristics**
- Volatile metrics: Higher thresholds (3-sigma, 100% change)
- Stable metrics: Lower thresholds (2-sigma, 50% change)
- Test and iterate based on false positive rate
5. **Derived columns require separate make_col statements**
- Cannot reference newly created column in same make_col
- Use pipeline of make_col statements for sequential calculations
6. **Frame specification enables sliding window calculations**
- `frame(back:30m)` for 30-minute moving average
- Shorter frames for fast-changing metrics
- Longer frames for stable baselines
7. **Metrics queries have two distinct patterns**
- `options(bins: 1)` for summary (no pipe before aggregate)
- `align 5m` for time-series (pipe required before aggregate)
8. **Statistical methods work best with sufficient historical data**
- Need enough data points for meaningful stddev/percentiles
- Consider minimum sample size (e.g., 24 hours of 5m buckets = 288 samples)
9. **Rate-of-change detection is powerful but requires careful tuning**
- Very effective for early spike detection
- Prone to false positives with low baseline values
- Add minimum value and absolute change filters
10. **Test detection patterns against historical data**
- Validate false positive rate on known-good periods
- Verify detection on known anomaly events
- Adjust thresholds based on operational feedback
## When to Use This Skill
Use detecting-anomalies skill when:
- User asks to check for anomalies
- Creating alert rules for unusual behavior
- Investigating performance degradation or incidents
- Identifying outliers in service metrics
- Detecting sudden traffic spikes or drops
- Comparing current values to historical baselines
- Setting up SLO violation alerts
- Analyzing metrics for unusual patterns
Cross-references:
- aggregating-gauge-metrics (for metric query patterns)
- analyzing-tdigest-metrics (for percentile-based detection)
- time-series-analysis (for temporal trending)
- working-with-intervals (for span-based anomaly detection)
- window-functions-deep-dive (to better understand window functions)