Skip to main content
Glama
SKILL.md13.6 kB
--- name: aggregating-gauge-metrics description: Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill. --- # Aggregating Gauge Metrics Pre-computed metrics in Observe store aggregated measurements at regular intervals (typically every 5 minutes). This skill teaches how to query gauge, counter, and delta metric types using OPAL. ## When to Use This Skill - Analyzing request counts, error rates, or throughput metrics - Tracking resource utilization (CPU, memory, network) - Computing totals, averages, or rates across time periods - Creating dashboards with time-series charts - Working with any gauge, counter, or delta metric type - When you need summary statistics or trends over time ## Prerequisites - Access to Observe tenant via MCP - Understanding that metrics are pre-aggregated (not raw events) - Metric dataset with type: gauge, counter, or delta - Use `discover_context()` to find and inspect metrics ## Key Concepts ### What Are Gauge Metrics? **Gauge metrics** are pre-aggregated numeric measurements collected at regular intervals: **Pre-aggregated**: Already summarized at collection time (typically 5-minute intervals) - More efficient than querying raw data - Faster query performance - Lower storage costs **Common Metric Types**: - **Gauge**: Point-in-time value (CPU utilization, memory usage, queue depth) - **Counter**: Monotonically increasing value (total requests, bytes sent) - **Delta**: Change between intervals (requests per interval, errors per interval) **Examples**: - `span_call_count_5m` - Number of requests per 5-minute interval - `span_error_count_5m` - Number of errors per 5-minute interval - `system_cpu_utilization_ratio` - CPU utilization percentage - `k8s_pod_memory_available_bytes` - Available memory in bytes ### CRITICAL: The align Verb is REQUIRED Unlike datasets (Events/Intervals), metrics **MUST** use the `align` verb: ```opal # WRONG - Will not work ❌ m("span_call_count_5m") | statsby total:sum(metric) # CORRECT - Must use align ✅ align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate) ``` **Why align is required**: Metrics are stored as time-series data that must be aligned to a time grid before aggregation. ### Summary vs Time-Series Output OPAL metrics queries can produce two different output types: | Output Type | Pattern | Result | Use Case | |-------------|---------|--------|----------| | **Summary** | `options(bins: 1)` | One row per group | Totals, overall statistics | | **Time-Series** | `5m`, `1h`, or default | Many rows per group | Trending, dashboards, charts | **Summary pattern** - Single statistics across entire time range: ```opal align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate), group_by(service_name) ``` Output: One row per service **Time-series pattern** - Values over time buckets: ```opal align 5m, rate:sum(m("metric")) | aggregate total:sum(rate), group_by(service_name) ``` Output: Multiple rows per service (one per 5-minute bucket) **CRITICAL Syntax Difference**: - Summary (`bins: 1`): NO pipe `|` between align and aggregate - Time-series (`5m`): YES pipe `|` between align and aggregate ## Discovery Workflow **Step 1: Search for metrics** ``` discover_context("request count", result_type="metric") discover_context("error", result_type="metric") discover_context("cpu memory", result_type="metric") ``` **Step 2: Get detailed metric schema** ``` discover_context(metric_name="span_call_count_5m") ``` **Step 3: Verify metric type** Look for: `Type: gauge` (or `counter`, `delta`) **Step 4: Note available dimensions** These are used for `group_by()`: - `service_name`, `service_namespace` - `environment`, `span_name` - `k8s_namespace_name`, `k8s_pod_name` - etc. (shown in discovery output) **Step 5: Write query** Use `align` + `m()` + `aggregate` pattern with correct dimensions ## Basic Patterns ### Pattern 1: Total Count Across Time Range Get overall totals (summary output): ```opal align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate) ``` **Output**: Single row with total count across entire time range. **No group_by**: Aggregates everything together. ### Pattern 2: Totals Per Group Get totals broken down by dimension: ```opal align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate), group_by(service_name) ``` **Output**: One row per service with total requests. **group_by**: Use any dimension from metric schema. ### Pattern 3: Average Values Per Group Calculate averages across time range: ```opal align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio")) aggregate avg_cpu:avg(cpu), group_by(service_name) ``` **Output**: Average CPU utilization per service. **avg() function**: Used twice - once in align, once in aggregate. ### Pattern 4: Multiple Aggregations Compute several statistics together: ```opal align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total:sum(rate), average:avg(rate), maximum:max(rate), group_by(service_name) ``` **Output**: Multiple columns per service (total, average, maximum). ### Pattern 5: Time-Series for Trending Track values over time buckets: ```opal align 5m, rate:sum(m("span_call_count_5m")) | aggregate requests_per_5min:sum(rate), group_by(service_name) ``` **Output**: Multiple rows per service (one per 5-minute interval). **Note**: Pipe `|` required after align for time-series pattern. **Output columns**: - `_c_bucket` - Time bucket identifier - `valid_from`, `valid_to` - Bucket boundaries - Metric values ## Common Use Cases ### Counting Total Requests by Service ```opal align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate), group_by(service_name) | sort desc(total_requests) | limit 10 ``` **Use case**: Identify top services by request volume. ### Counting Errors with Fill for Zero Values ```opal align options(bins: 1), errors:sum(m("span_error_count_5m")) aggregate total_errors:sum(errors), group_by(service_name) fill total_errors:0 ``` **Use case**: Show all services, even those with zero errors. **fill verb**: Replaces null values with 0. ### Tracking Request Rate Over Time ```opal align 1h, rate:sum(m("span_call_count_5m")) | aggregate requests_per_hour:sum(rate), group_by(service_name) ``` **Use case**: Hourly request trends for dashboards. **Output**: Time-series data for charting. ### Multiple Metrics in One Query ```opal align options(bins: 1), requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) aggregate total_requests:sum(requests), total_errors:sum(errors), group_by(service_name) | make_col error_rate:float64(total_errors) / float64(total_requests) ``` **Use case**: Calculate error rate from two metrics. **make_col**: Add derived column after aggregation. ### Resource Utilization Averages ```opal align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio")) aggregate avg_cpu:avg(cpu), max_cpu:max(cpu), group_by(k8s_pod_name) | sort desc(avg_cpu) | limit 20 ``` **Use case**: Find pods with highest CPU usage. ## Complete Example **Scenario**: You want to analyze request and error rates for your microservices over the last 24 hours. **Step 1: Discover available metrics** ``` discover_context("request error", result_type="metric") ``` Found metrics: - `span_call_count_5m` (type: gauge) - `span_error_count_5m` (type: gauge) **Step 2: Get metric details** ``` discover_context(metric_name="span_call_count_5m") ``` Available dimensions: `service_name`, `service_namespace`, `environment`, `span_name` **Step 3: Query for summary statistics** ```opal align options(bins: 1), requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) aggregate total_requests:sum(requests), total_errors:sum(errors), group_by(service_name) fill total_errors:0 | make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0 | sort desc(total_requests) ``` **Step 4: Interpret results** | service_name | total_requests | total_errors | error_rate | |--------------|----------------|--------------|------------| | frontend-proxy | 15660 | 0 | 0.0 | | frontend | 15263 | 35 | 0.23 | | featureflagservice | 11693 | 0 | 0.0 | | productcatalogservice | 8813 | 0 | 0.0 | **Insight**: Frontend has a 0.23% error rate - investigate errors. **Step 5: Get hourly trends** ```opal align 1h, requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) | aggregate requests_per_hour:sum(requests), errors_per_hour:sum(errors), group_by(service_name) | filter service_name = "frontend" ``` **Output**: Time-series showing frontend requests and errors per hour. ## Common Pitfalls ### Pitfall 1: Forgetting align Verb ❌ **Wrong**: ```opal m("span_call_count_5m") | statsby total:sum(metric) ``` ✅ **Correct**: ```opal align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total:sum(rate) ``` **Why**: Metrics MUST use `align` verb - it's required, not optional. ### Pitfall 2: Wrong Pipe Usage ❌ **Wrong** (pipe with bins:1): ```opal align options(bins: 1), rate:sum(m("metric")) | aggregate total:sum(rate) ``` ❌ **Wrong** (no pipe with time duration): ```opal align 5m, rate:sum(m("metric")) aggregate total:sum(rate) ``` ✅ **Correct**: ```opal # Summary - NO pipe align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate) # Time-series - YES pipe align 5m, rate:sum(m("metric")) | aggregate total:sum(rate) ``` **Why**: Syntax differs between summary and time-series patterns. ### Pitfall 3: Grouping by Non-Existent Dimension ❌ **Wrong**: ```opal align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate), group_by(service_name) ``` Error: "field 'service_name' does not exist" ✅ **Correct**: ```opal # First: discover_context(metric_name="metric") to see available dimensions # Then: use only dimensions that exist align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate), group_by(correct_dimension_name) ``` **Why**: Not all metrics have the same dimensions - always check first. ### Pitfall 4: Using statsby Instead of aggregate ❌ **Wrong**: ```opal align options(bins: 1), rate:sum(m("metric")) statsby total:sum(rate) ``` ✅ **Correct**: ```opal align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate) ``` **Why**: After `align`, use `aggregate` (not `statsby` which is for datasets). ## Aggregation Functions Reference Common functions used with gauge metrics: ```opal # Summing values align options(bins: 1), metric:sum(m("metric_name")) aggregate total:sum(metric) # Averaging values align options(bins: 1), metric:avg(m("metric_name")) aggregate average:avg(metric) # Maximum value align options(bins: 1), metric:max(m("metric_name")) aggregate maximum:max(metric) # Minimum value align options(bins: 1), metric:min(m("metric_name")) aggregate minimum:min(metric) # Count of samples align options(bins: 1), metric:count(m("metric_name")) aggregate sample_count:count(metric) ``` **Pattern**: Function used in both `align` and `aggregate`. ## Time Bucket Options Common time durations for time-series queries: ```opal align 1m, ... # 1-minute buckets align 5m, ... # 5-minute buckets (common) align 15m, ... # 15-minute buckets align 1h, ... # 1-hour buckets align 1d, ... # 1-day buckets ``` **Default**: `align` without duration uses automatic binning (300 bins). ## Best Practices 1. **Always use discover_context() first** to find metrics and check dimensions 2. **Verify metric type** - this skill is for gauge/counter/delta (NOT tdigest) 3. **Use summary pattern** (`bins: 1`) for single statistics, reports, totals 4. **Use time-series pattern** (`5m`, `1h`) for dashboards, trending, charts 5. **Remember pipe rule**: bins:1 = no pipe, time duration = yes pipe 6. **Use fill** to replace nulls with zeros for complete results 7. **Add sort + limit** for top-N queries to avoid overwhelming output 8. **Check available dimensions** before using group_by ## Related Skills - **analyzing-tdigest-metrics** - For percentile metrics (latency, duration p95/p99) - **time-series-analysis** - For event/interval trending with timechart (different from metrics) - **aggregating-event-datasets** - For aggregating raw events with statsby (different from metrics) - **working-with-intervals** - For calculating durations from raw interval data ## Summary Gauge metrics are pre-aggregated measurements that **require** the `align` verb: - **Core pattern**: `align` + `m()` + `aggregate` - **Metric types**: gauge, counter, delta (NOT tdigest) - **Two output modes**: - Summary: `options(bins: 1)` → one row per group, NO pipe - Time-series: `5m`, `1h` → many rows per group, YES pipe - **Common functions**: sum, avg, max, min, count - **Discovery**: Use `discover_context()` to find metrics and dimensions **Key distinction**: Metrics are pre-aggregated (use `align`), while Events/Intervals are raw data (use `statsby`/`timechart`). --- **Last Updated**: November 14, 2025 **Version**: 1.0 **Tested With**: Observe OPAL (ServiceExplorer/Service Metrics)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rustomax/observe-experimental-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server