Skip to main content
Glama
SKILL.md21.8 kB
--- name: detecting-anomalies description: Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type. --- # Detecting Anomalies Detect anomalies in metrics and time-series data using OPAL statistical methods. This skill covers multiple detection patterns for different types of anomalies: statistical outliers, sudden spikes/drops, threshold violations, and deviations from moving baselines. Use when you need to: - Identify unusual spikes or drops in request volume, errors, latency - Detect values exceeding normal statistical bounds - Alert on sudden percentage changes (traffic doubling, sudden drops) - Compare current values to moving averages - Find outliers in skewed distributions ## Key Concepts ### Anomaly Detection Approaches **Statistical Methods** (good for gradual changes): - Z-Score (standard deviation) - Assumes normal distribution - IQR (Interquartile Range) - Robust to skewed data - Percentile thresholds - Compare to historical baseline **Temporal Methods** (good for sudden changes): - Rate of change - Detect sudden spikes/drops - Moving average deviation - Compare to recent baseline **Threshold Methods** (simple and interpretable): - Static thresholds - Known limits (CPU > 90%) - Dynamic thresholds - Calculated from baseline (current > avg * 1.5) ### When to Use Each Pattern ``` What type of anomaly? ├─ Known threshold (e.g., "CPU > 90%") │ └─> Threshold-Based Detection (Pattern 3) │ ├─ Statistical outliers (unusual values) │ ├─ Normal distribution? │ │ └─> Z-Score Method (Pattern 1) │ │ │ └─ Skewed distribution? │ └─> IQR Method (Pattern 2) │ ├─ Sudden spikes/drops │ └─> Rate of Change (Pattern 4) │ └─ Deviation from recent baseline └─> Moving Average (Pattern 5) ``` ## Pattern 1: Statistical Outlier Detection (Z-Score) **Concept**: Detect values beyond N standard deviations from the mean **When to use**: - Metrics with relatively stable baseline - Data roughly follows normal distribution - Need statistically grounded detection **Query**: ```opal align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate avg_val:avg(metric_value), stddev_val:stddev(metric_value), current_val:sum(metric_value), group_by(service_name) | make_col z_score:(current_val - avg_val) / stddev_val | make_col upper_bound:avg_val + (2 * stddev_val) | make_col lower_bound:avg_val - (2 * stddev_val) | make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false) | filter is_anomaly = true | sort desc(z_score) | limit 20 ``` **Threshold tuning**: - `z > 2` or `z < -2`: ~95% confidence (moderate sensitivity) - `z > 3` or `z < -3`: ~99.7% confidence (low false positives) - `z > 1.5` or `z < -1.5`: ~87% confidence (high sensitivity) **Example result**: ``` service_name: featureflagservice avg_val: 11.5 stddev_val: 13.9 current_val: 46 z_score: 2.48 is_anomaly: true ``` **Pros**: - Statistically grounded - Well-understood confidence intervals - Good for normally distributed data **Cons**: - Assumes normal distribution - Sensitive to extreme outliers in baseline - Requires sufficient historical data ## Pattern 2: IQR (Interquartile Range) Method **Concept**: Detect values beyond the interquartile range using Tukey's fences **When to use**: - Skewed distributions (latency, error counts) - Presence of natural outliers in baseline - More robust alternative to Z-score **Query**: ```opal align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate p25:percentile(metric_value, 0.25), p75:percentile(metric_value, 0.75), current_val:sum(metric_value), group_by(service_name) | make_col iqr:p75 - p25 | make_col upper_fence:p75 + (1.5 * iqr) | make_col lower_fence:p25 - (1.5 * iqr) | make_col is_outlier:if(current_val > upper_fence or current_val < lower_fence, true, false) | filter is_outlier = true | sort desc(current_val) | limit 20 ``` **Threshold tuning**: - `1.5 * IQR`: Standard outliers (moderate sensitivity) - `3 * IQR`: Extreme outliers (low false positives) - `1 * IQR`: More sensitive detection **Example result**: ``` service_name: featureflagservice p25: 1.75 p75: 16.75 iqr: 15 upper_fence: 39.25 current_val: 46 is_outlier: true ``` **Pros**: - Robust to skewed distributions - Not affected by extreme values - Based on quartiles (median-based) **Cons**: - Less interpretable than Z-score - May miss anomalies in heavy-tailed distributions - Requires sufficient data for percentile calculation ## Pattern 3: Threshold-Based Detection **Concept**: Simple comparison against fixed or dynamic thresholds **When to use**: - Known capacity limits (CPU > 90%, memory > 80%) - SLO violations (error rate > 1%, latency > 500ms) - Business rules (orders < 100 per hour) **Static Threshold**: ```opal align options(bins: 1), total_calls:sum(m("span_call_count_5m")) aggregate current_rate:sum(total_calls), group_by(service_name) make_col threshold:100000 | make_col is_high:if(current_rate > threshold, true, false) | filter is_high = true | sort desc(current_rate) ``` **Dynamic Threshold** (baseline comparison): ```opal align options(bins: 1), metric_value:sum(m("span_call_count_5m")) aggregate baseline:avg(metric_value), current:sum(metric_value), group_by(service_name) make_col threshold:baseline * 1.5 | make_col is_anomaly:if(current > threshold, true, false) | filter is_anomaly = true ``` **Threshold multiplier guidance**: - `1.5x`: High sensitivity (more alerts) - `2x`: Moderate sensitivity (balanced) - `3x`: Low sensitivity (only major spikes) **Pros**: - Simple and interpretable - No assumptions about distribution - Clear business meaning **Cons**: - Requires domain knowledge to set thresholds - Static thresholds may not adapt to changing baselines - May miss subtle anomalies ## Pattern 4: Rate of Change Detection **Concept**: Detect sudden spikes or drops by comparing to previous time period **When to use**: - Detect sudden traffic spikes or drops - Identify rapid changes in behavior - Alert on percentage change thresholds **Query**: ```opal align 5m, metric_value:sum(m("span_call_count_5m")) | make_col previous_value:window(lag(metric_value, 1), group_by(service_name)) | make_col value_change:metric_value - previous_value | make_col pct_change:if(previous_value > 0, (value_change / previous_value) * 100, 0) | make_col is_spike:if(pct_change > 100 or pct_change < -50, true, false) | filter is_spike = true | sort desc(pct_change) | limit 20 ``` **Critical syntax**: Use `window(lag(...), group_by(...))` NOT `lag(...) over (partition by...)` **Threshold examples**: - `pct_change > 100`: 2x increase (doubling) - `pct_change > 200`: 3x increase - `pct_change < -50`: 50% drop - `pct_change < -75`: 75% drop **Example result**: ``` service_name: frontend metric_value: 50 previous_value: 2 value_change: 48 pct_change: 2400 is_spike: true ``` **Pros**: - Detects sudden changes regardless of absolute value - Adapts to current baseline automatically - Effective for early spike detection **Cons**: - Sensitive to very low baseline values (small numbers can cause large percentage changes) - May produce false positives during normal ramp-up/down - Requires at least 2 time periods of data **Best practices**: - Add minimum value filter to avoid division by small numbers - Use different thresholds for increases vs decreases - Consider absolute change threshold in addition to percentage ## Pattern 5: Moving Average Baseline **Concept**: Compare current value to recent moving average using sliding window **When to use**: - Smooth noisy metrics for baseline - Detect deviations from recent average - Adaptive baseline that follows trends **Query**: ```opal align 5m, metric_value:sum(m("span_call_count_5m")) | make_col moving_avg:window(avg(metric_value), group_by(service_name), frame(back:30m)) | make_col deviation:metric_value - moving_avg | make_col pct_deviation:if(moving_avg > 0, (deviation / moving_avg) * 100, 0) | make_col is_anomaly:if(pct_deviation > 50 or pct_deviation < -50, true, false) | filter is_anomaly = true | sort desc(pct_deviation) | limit 20 ``` **Frame options**: - `frame(back:10m)`: Short-term baseline (10-minute average) - `frame(back:30m)`: Medium-term baseline (30-minute average) - `frame(back:1h)`: Longer-term baseline (1-hour average) **Deviation thresholds**: - `> 50%`: Moderate deviation from recent average - `> 100%`: Doubling compared to recent average - `> 25%`: More sensitive detection **Pros**: - Adapts to changing baselines and trends - Smooths noisy data - Good for metrics with daily/hourly patterns **Cons**: - Slower to detect anomalies (due to averaging) - May miss anomalies during rapid baseline shifts - Requires sufficient lookback data **Best practices**: - Choose frame duration based on metric volatility - Shorter frames for fast-changing metrics - Longer frames for more stable baselines ## Pattern 6: Percentile-Based Threshold **Concept**: Compare current value to historical percentile (p95, p99) **When to use**: - SLO violations (latency > p95) - Detect values above "normal high" - Comparing current to historical baseline **Query**: ```opal align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate p95:percentile(metric_value, 0.95), p99:percentile(metric_value, 0.99), current:sum(metric_value), group_by(service_name) | make_col is_anomaly:if(current > p95, true, false) | filter is_anomaly = true | sort desc(current) ``` **Percentile choices**: - `p95`: Detect top 5% unusual values (moderate sensitivity) - `p99`: Detect top 1% extreme values (low false positives) - `p90`: Detect top 10% (high sensitivity) **Pros**: - Percentile-based SLOs are industry standard - Automatically adapts to data distribution - Clear meaning (top X% of values) **Cons**: - Unidirectional (only detects high values, not drops) - Requires sufficient historical data - May not detect subtle shifts in distribution ## Common Patterns ### Pattern: Combine Multiple Detection Methods Increase confidence by requiring multiple methods to agree: ```opal align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate avg_val:avg(metric_value), stddev_val:stddev(metric_value), p95:percentile(metric_value, 0.95), current:sum(metric_value), group_by(service_name) | make_col z_score:(current - avg_val) / stddev_val | make_col is_zscore_anomaly:if(z_score > 2 or z_score < -2, true, false) | make_col is_percentile_anomaly:if(current > p95, true, false) | make_col is_anomaly:if(is_zscore_anomaly = true and is_percentile_anomaly = true, true, false) | filter is_anomaly = true ``` **Use case**: Reduce false positives by requiring consensus ### Pattern: Multi-Metric Correlation Detect anomalies across correlated metrics: ```opal align options(bins: 1), requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) aggregate total_requests:sum(requests), total_errors:sum(errors), group_by(service_name) make_col error_rate:if(total_requests > 0, (float64(total_errors) / float64(total_requests)) * 100, 0) | make_col threshold:1.0 | make_col is_high_error:if(error_rate > threshold and total_requests > 100, true, false) | filter is_high_error = true ``` **Use case**: Alert when error rate AND request volume both indicate issues ### Pattern: Time-Series Trending Track anomalies over time using timechart: ```opal align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate avg_val:avg(metric_value), stddev_val:stddev(metric_value), current:sum(metric_value), group_by(service_name) | make_col z_score:(current - avg_val) / stddev_val | make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false) | filter is_anomaly = true ``` **Result**: Multiple rows per service showing anomalies across time buckets **Use case**: Visualize when and how often anomalies occur ## OPAL Syntax Key Points ### Window Functions (LAG/LEAD) **CRITICAL**: OPAL uses `window()` function, NOT SQL `OVER` clause! **✅ CORRECT Syntax**: ```opal make_col prev:window(lag(column, 1), group_by(dimension)) make_col next:window(lead(column, 1), group_by(dimension)) make_col moving_avg:window(avg(column), group_by(dimension), frame(back:30m)) ``` **❌ WRONG Syntax** (SQL-style): ```opal lag(column, 1) over (partition by dimension order by time) ``` **Window function components**: - `lag(column, offset)`: Access previous row value - `lead(column, offset)`: Access next row value - `group_by(dimension)`: Partition by dimension - `frame(back:duration)`: Sliding window lookback period ### Derived Columns Must Use Separate make_col **❌ WRONG** - Cannot reference derived column in same make_col: ```opal make_col upper_bound:avg + (2 * stddev), is_anomaly:if(value > upper_bound, true, false) ``` **✅ CORRECT** - Use separate make_col statements: ```opal make_col upper_bound:avg + (2 * stddev) | make_col is_anomaly:if(value > upper_bound, true, false) ``` ### Metrics Query Patterns **Summary (one row per group)**: ```opal align options(bins: 1), metric:sum(m("metric_name")) aggregate result:sum(metric), group_by(dimension) ``` **Note**: No pipe `|` between `align options(bins: 1)` and `aggregate`! **Time-series (multiple rows per group)**: ```opal align 5m, metric:sum(m("metric_name")) | aggregate result:sum(metric), group_by(dimension) ``` **Note**: Pipe `|` required between `align 5m` and `aggregate`! ### Period-Over-Period Comparison with Timeshift + Union For comparing entire periods (e.g., "this hour" vs "exactly 1 hour ago"), use the `timeshift` + `union` pattern with subquery definitions. **Key Difference**: - **`window(lag())`**: Compares adjacent buckets (5-min to 5-min, approximate) - **`timeshift + union`**: Compares entire periods (exact time offset: 1h, 1d, 1w) **Working Example** (✅ works in all query contexts): ```opal @current <- @ { align rate:sum(m("span_call_count_5m")) aggregate current_sum:sum(rate), group_by(service_name) } @previous <- @ { timeshift 1h # Shift BEFORE align! align rate:sum(m("span_call_count_5m")) aggregate prev_sum:sum(rate), group_by(service_name) } @combined <- @current { union @previous aggregate current:any_not_null(current_sum), previous:any_not_null(prev_sum), group_by(service_name) make_col change:current - previous make_col pct_change:if(previous > 0, (change / previous) * 100, 0) make_col abs_pct_change:if(pct_change < 0, -pct_change, pct_change) } <- @combined { filter abs_pct_change > 50 sort desc(abs_pct_change) limit 10 } ``` **Critical Points**: 1. **`@subquery <- @`**: Use `@` alone to reference the primary input dataset 2. **`timeshift` BEFORE `align`**: Operates on raw data, shifts timestamps before aggregation 3. **Separate aggregation**: Both series must be aggregated independently 4. **`any_not_null()` collapses union**: Combines current/previous into single row per dimension 5. **Works everywhere**: MCP queries, worksheets, and monitors all support this syntax **Use Cases**: - Day-over-day comparison: "Today vs yesterday" (use `timeshift 1d`) - Week-over-week trending: "This week vs last week" (use `timeshift 7d`) - Hour-over-hour spikes: "This hour vs 1 hour ago" (use `timeshift 1h`) - SLA violations: "Current vs same period last month" (use `timeshift 30d`) **Tested Results**: - ✅ Detected 200% increase in service request rate (18 vs 6) - ✅ Detected 92% drop in request volume (1 vs 13) - ✅ Works with any timeshift duration (1h, 6h, 1d, 7d, etc.) **Comparison with window(lag())**: | Feature | window(lag(rate, N)) | timeshift + union | |---------|---------------------|-------------------| | **Time precision** | Approximate (N buckets back) | Exact (fixed time offset) | | **Example** | `lag(rate, 12)` ≈ 1 hour (if buckets are 5min) | `timeshift 1h` = exactly 60 minutes | | **Complexity** | Simple, one query | More complex, subqueries + union | | **Use case** | Real-time spike detection | Period-over-period reporting | | **Best for** | "Current vs previous bucket" | "Current vs same time yesterday" | **When to use each**: - Use `window(lag())` for: Real-time alerts, simple spike detection, fast queries - Use `timeshift + union` for: Exact period comparison, day-over-day reports, SLA tracking ## Troubleshooting ### Issue: "Unknown function 'over()'" **Cause**: Using SQL window function syntax instead of OPAL syntax **Solution**: Use `window(lag(...), group_by(...))` instead of `lag(...) over (...)` **Example**: ```opal # WRONG make_col prev:lag(value, 1) over (partition by service order by time) # CORRECT make_col prev:window(lag(value, 1), group_by(service)) ``` ### Issue: High false positive rate **Cause**: Threshold too sensitive or baseline includes anomalies **Solutions**: 1. **Increase threshold**: Use 3-sigma instead of 2-sigma for Z-score 2. **Combine methods**: Require multiple detection methods to agree 3. **Filter baseline**: Exclude known anomaly periods from baseline calculation 4. **Add minimum value filter**: Avoid alerting on very low absolute values **Example with minimum value filter**: ```opal | make_col is_spike:if(pct_change > 100 and metric_value > 10, true, false) ``` ### Issue: Missing anomalies (false negatives) **Cause**: Threshold too strict or wrong detection method for data type **Solutions**: 1. **Decrease threshold**: Use 1.5-sigma or lower percentile (p90 instead of p95) 2. **Try different method**: IQR if data is skewed, rate-of-change for sudden spikes 3. **Check data distribution**: Visualize baseline to understand normal range 4. **Use multiple methods**: Catch different types of anomalies ### Issue: Division by zero or very small numbers **Cause**: Calculating percentage change when previous value is zero or very small **Solution**: Add conditional check for minimum denominator: ```opal make_col pct_change:if(previous_value > 5, (value_change / previous_value) * 100, 0) ``` ### Issue: Window function returns null values **Cause**: First row in group has no previous value for `lag()` **Solution**: This is expected behavior - first row will have `null` for `lag()`. Filter nulls or provide default: ```opal make_col previous_value:window(lag(metric_value, 1), group_by(service_name)) | filter not is_null(previous_value) ``` Or use default value (though not directly supported in current lag syntax): ```opal make_col pct_change:if(is_null(previous_value), 0, (value_change / previous_value) * 100) ``` ## Key Takeaways 1. **Choose detection method based on anomaly type and data distribution** - Z-Score for normal distributions - IQR for skewed data - Rate-of-change for sudden spikes - Moving average for trend deviations 2. **OPAL window functions use different syntax from SQL** - Use `window(lag(...), group_by(...))` NOT `lag(...) over (...)` - Works with both metrics (align) and raw datasets 3. **Combine multiple methods to reduce false positives** - Require Z-score AND percentile agreement - Add minimum value filters for rate-of-change - Correlate multiple metrics (requests + errors) 4. **Tune thresholds based on metric characteristics** - Volatile metrics: Higher thresholds (3-sigma, 100% change) - Stable metrics: Lower thresholds (2-sigma, 50% change) - Test and iterate based on false positive rate 5. **Derived columns require separate make_col statements** - Cannot reference newly created column in same make_col - Use pipeline of make_col statements for sequential calculations 6. **Frame specification enables sliding window calculations** - `frame(back:30m)` for 30-minute moving average - Shorter frames for fast-changing metrics - Longer frames for stable baselines 7. **Metrics queries have two distinct patterns** - `options(bins: 1)` for summary (no pipe before aggregate) - `align 5m` for time-series (pipe required before aggregate) 8. **Statistical methods work best with sufficient historical data** - Need enough data points for meaningful stddev/percentiles - Consider minimum sample size (e.g., 24 hours of 5m buckets = 288 samples) 9. **Rate-of-change detection is powerful but requires careful tuning** - Very effective for early spike detection - Prone to false positives with low baseline values - Add minimum value and absolute change filters 10. **Test detection patterns against historical data** - Validate false positive rate on known-good periods - Verify detection on known anomaly events - Adjust thresholds based on operational feedback ## When to Use This Skill Use detecting-anomalies skill when: - User asks to check for anomalies - Creating alert rules for unusual behavior - Investigating performance degradation or incidents - Identifying outliers in service metrics - Detecting sudden traffic spikes or drops - Comparing current values to historical baselines - Setting up SLO violation alerts - Analyzing metrics for unusual patterns Cross-references: - aggregating-gauge-metrics (for metric query patterns) - analyzing-tdigest-metrics (for percentile-based detection) - time-series-analysis (for temporal trending) - working-with-intervals (for span-based anomaly detection) - window-functions-deep-dive (to better understand window functions)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rustomax/observe-experimental-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server