Skip to main content
Glama
SKILL.md16.6 kB
--- name: subquery-patterns-and-union description: Use OPAL subquery syntax (@labels) and union operations to combine multiple datasets or time periods. Essential for period-over-period comparisons, multi-dataset analysis, and complex data transformations. Covers @label <- @ syntax, timeshift for temporal shifts, union for combining results, and any_not_null() for collapsing grouped data. --- # Subquery Patterns and Union Operations ## Overview OPAL subqueries using `@label` syntax enable powerful multi-dataset and multi-period analysis. This skill covers: - Subquery syntax with `@label <- @dataset` - Union operations to combine multiple result sets - Timeshift for period-over-period comparisons - Best practices for complex data transformations ## When to Use This Skill Use subqueries and union when you need to: - **Period-over-period comparison**: Compare current vs previous hour/day/week (metrics or events) - **Time-series comparison**: Chart trends over time with period-over-period data - **Complex transformations**: Build intermediate results for multi-stage calculations (e.g., SLO tracking, error budgets) ## Core Concepts ### Subquery Syntax ```opal # Basic pattern @label <- @dataset_reference { # OPAL pipeline } # Reference primary input @current <- @ { # Process primary dataset } # Reference named dataset (requires dataset_aliases parameter) @other <- @dataset_name { # Process other dataset } # Use subquery results <- @label { # Continue processing } ``` **Key Points**: - `@` alone references the primary input dataset - `@dataset_name` references a named dataset via aliases - `@label` creates a reusable intermediate result - `<- @label` continues the pipeline from that subquery ### Union Operation Union combines multiple result sets with matching columns: ```opal @set_a <- @ { # First result set } @set_b <- @ { # Second result set with same column structure } <- @set_a { union @set_b # Combined results } ``` **Important**: - Union requires matching column names - Rows from both sources appear in output - Use `any_not_null()` to collapse sparse union results - Union happens AFTER aggregation, not before ### Timeshift Verb Timeshift moves row timestamps forward (positive) or backward (negative): ```opal timeshift 1h # Move 1 hour forward timeshift -1d # Move 1 day backward timeshift 30m # Move 30 minutes forward ``` **Critical Rule**: Apply timeshift BEFORE align when working with metrics! ```opal # CORRECT timeshift 1h align rate:sum(m("metric")) # WRONG align rate:sum(m("metric")) timeshift 1h # Too late! Align already processed time buckets ``` ## Pattern 1: Period-Over-Period Comparison (Metrics) **Use Case**: Compare current metrics to previous period (hour, day, week) **Strategy**: 1. Create `@current` subquery with current period aggregation 2. Create `@previous` subquery with timeshift + same aggregation 3. Union both, then collapse with `any_not_null()` 4. Calculate change and percentage change ### Example: Compare Current Hour vs Previous Hour ```opal # Current period (last 1h) @current <- @ { align rate:sum(m("span_call_count_5m")) aggregate current_sum:sum(rate), group_by(service_name) } # Previous period (1h before that) @previous <- @ { timeshift 1h # Shift BEFORE align! align rate:sum(m("span_call_count_5m")) aggregate prev_sum:sum(rate), group_by(service_name) } # Combine both periods @combined <- @current { union @previous aggregate current:any_not_null(current_sum), previous:any_not_null(prev_sum), group_by(service_name) } # Calculate changes <- @combined { make_col change:current - previous make_col pct_change:if(previous > 0, (change / previous) * 100, 0) make_col abs_pct_change:if(pct_change < 0, -pct_change, pct_change) filter abs_pct_change > 50 # Show only significant changes sort desc(abs_pct_change) } ``` **Why This Works**: - `timeshift 1h` moves the "previous" data timestamps forward by 1 hour - When combined with current data, both align to same time buckets - `any_not_null()` picks the non-null value from each period - Result: side-by-side comparison in same row **Sample Output**: ``` service_name current previous change pct_change frontend-service 45000 15000 30000 200.0 checkout-api 8000 15000 -7000 -46.7 payment-service 500 10000 -9500 -95.0 ``` ### Day-Over-Day Comparison ```opal @today <- @ { align rate:sum(m("span_call_count_5m")) aggregate today_sum:sum(rate), group_by(service_name) } @yesterday <- @ { timeshift 1d # 24 hours align rate:sum(m("span_call_count_5m")) aggregate yesterday_sum:sum(rate), group_by(service_name) } @combined <- @today { union @yesterday aggregate today:any_not_null(today_sum), yesterday:any_not_null(yesterday_sum), group_by(service_name) } <- @combined { make_col change:today - yesterday make_col pct_change:if(yesterday > 0, (change / yesterday) * 100, 0) sort desc(pct_change) } ``` ### Week-Over-Week Comparison ```opal @this_week <- @ { align 1h, rate:sum(m("span_call_count_5m")) aggregate week_sum:sum(rate), group_by(service_name) } @last_week <- @ { timeshift 7d # One week align 1h, rate:sum(m("span_call_count_5m")) aggregate last_week_sum:sum(rate), group_by(service_name) } @combined <- @this_week { union @last_week aggregate this_week:any_not_null(week_sum), last_week:any_not_null(last_week_sum), group_by(service_name) } <- @combined { make_col growth:this_week - last_week make_col growth_pct:if(last_week > 0, (growth / last_week) * 100, 0) sort desc(growth_pct) } ``` ## Pattern 2: Period-Over-Period Comparison (Events/Intervals) **Use Case**: Compare raw event/span counts across time periods **Strategy**: Same union pattern, but use `statsby` instead of `align` + `aggregate` ### Example: Error Count This Hour vs Last Hour ```opal @current <- @ { filter error = true statsby current_errors:count(), group_by(service_name) } @previous <- @ { timeshift 1h filter error = true statsby prev_errors:count(), group_by(service_name) } @combined <- @current { union @previous aggregate current:any_not_null(current_errors), previous:any_not_null(prev_errors), group_by(service_name) } <- @combined { make_col error_change:current - previous make_col pct_change:if(previous > 0, (error_change / previous) * 100, 0) filter current > 10 # Only services with significant errors sort desc(current) } ``` **Key Difference**: Use `statsby` for event datasets, `align` + `aggregate` for metrics. ## Pattern 3: Time-Series Period Comparison **Use Case**: Chart current vs previous period trends over time ### Example: Current vs Previous Week (Hourly Buckets) ```opal @current <- @ { align 1h, rate:sum(m("span_call_count_5m")) aggregate current_rate:sum(rate), group_by(service_name) } @previous <- @ { timeshift 7d align 1h, rate:sum(m("span_call_count_5m")) aggregate prev_rate:sum(rate), group_by(service_name) } @combined <- @current { union @previous aggregate current:any_not_null(current_rate), previous:any_not_null(prev_rate), group_by(service_name, _c_bucket) } <- @combined { make_col change:current - previous make_col pct_change:if(previous > 0, (change / previous) * 100, 0) } ``` **Output**: Time-series with both periods aligned by bucket, suitable for line charts showing trends over time. **Note**: This pattern returns multiple rows per service (one per time bucket). For summary comparisons, use Pattern 1 instead. ## Pattern 4: Building Intermediate Results **Use Case**: Complex calculations requiring multiple steps ### Example: Calculate Error Budget Consumption ```opal # Step 1: Get total requests and errors @base <- @ { align options(bins: 1), rate:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) aggregate total_requests:sum(rate), total_errors:sum(errors), group_by(service_name) } # Step 2: Calculate SLO metrics @slo <- @base { make_col error_rate:if(total_requests > 0, total_errors / total_requests, 0) make_col success_rate:1 - error_rate make_col slo_target:0.999 # 99.9% SLO make_col error_budget:1 - slo_target } # Step 3: Calculate budget consumption <- @slo { make_col budget_consumed:if(error_budget > 0, error_rate / error_budget, 0) make_col budget_remaining:1 - budget_consumed make_col status:if(budget_consumed > 1, "VIOLATED", if(budget_consumed > 0.8, "WARNING", "HEALTHY")) filter total_requests > 1000 # Only services with traffic sort desc(budget_consumed) } ``` **Sample Output**: ``` service_name total_requests error_rate budget_consumed status adservice 870 0.0276 27.6 VIOLATED cartservice 2303 0.0091 9.1 VIOLATED frontend 15108 0.0016 1.6 VIOLATED productcatalog 8838 0.0000 0.0 HEALTHY ``` **Why This Works**: - `@base` subquery aggregates raw metrics (requests + errors) - `@slo` subquery builds on `@base`, adding calculated SLO fields - Final stage uses `@slo` results to compute budget status - Each stage can reference all columns from previous stages ## Understanding any_not_null() The `any_not_null()` function is crucial for union patterns: ```opal # After union, you typically have sparse data: # Row 1: current_sum=100, prev_sum=null # Row 2: current_sum=null, prev_sum=80 aggregate current:any_not_null(current_sum), previous:any_not_null(prev_sum), group_by(service_name) # Result: # Row 1: current=100, previous=80 ``` **How it works**: - Groups by service_name (or other dimensions) - For each group, finds any non-null value across all union rows - Collapses sparse union into single row per group **Alternative functions**: - `any()` - Picks arbitrary value (may be null) - `min()` / `max()` - Numeric min/max (only for numbers) - `any_not_null()` - Best for union collapse (picks any non-null) ## Common Patterns Summary | Use Case | Subqueries Needed | Key Verbs | |----------|-------------------|-----------| | Period-over-period (metrics) | 2+ | `timeshift`, `align`, `union`, `any_not_null()` | | Period-over-period (events) | 2+ | `timeshift`, `statsby`, `union`, `any_not_null()` | | Time-series comparison | 2+ | `timeshift`, `align`, `union`, `any_not_null()`, `group_by(_c_bucket)` | | Complex calculations | 1-3 | `make_col`, pipeline stages | **Note**: For A/B comparisons across different filter conditions, use conditional columns with `if()` statements instead of subqueries. For multi-dataset joins, use `lookup` or `join` verbs (see **working-with-resources** skill). ## Troubleshooting ### Issue: "Columns don't match in union" **Cause**: Union requires exact column name matches **Solution**: Ensure both subqueries produce same column names ```opal # WRONG - column names don't match @a <- @ { aggregate count_a:count() } @b <- @ { aggregate count_b:count() } <- @a { union @b } # Error! # CORRECT - same column names @a <- @ { aggregate cnt:count() } @b <- @ { aggregate cnt:count() } <- @a { union @b } # Works! ``` ### Issue: "All nulls after any_not_null()" **Cause**: group_by dimensions don't align across union sources **Solution**: Verify both subqueries group by same dimensions ```opal # WRONG - different group_by @a <- @ { aggregate cnt:count(), group_by(service_name) } @b <- @ { aggregate cnt:count(), group_by(namespace) } <- @a { union @b; aggregate total:any_not_null(cnt), group_by(service_name) } # Result: Nulls (no matching groups) # CORRECT - same group_by @a <- @ { aggregate cnt:count(), group_by(service_name) } @b <- @ { aggregate cnt:count(), group_by(service_name) } <- @a { union @b; aggregate total:any_not_null(cnt), group_by(service_name) } ``` ### Issue: "Timeshift has no effect" **Cause**: Timeshift applied AFTER align (too late!) **Solution**: Always timeshift BEFORE align ```opal # WRONG - timeshift after align @previous <- @ { align rate:sum(m("metric")) timeshift 1h # Too late! } # CORRECT - timeshift before align @previous <- @ { timeshift 1h # First! align rate:sum(m("metric")) } ``` ### Issue: "Can't reference @label" **Cause**: Trying to use label before it's defined **Solution**: Define subquery first, then reference it ```opal # WRONG - @combined used before definition <- @combined { ... } @combined <- @ { ... } # CORRECT - define first @combined <- @ { ... } <- @combined { ... } ``` ## Performance Considerations ### When to Use Subqueries vs Single Query **Use subqueries when**: - Period-over-period comparison (timeshift required) - Building complex intermediate results - Readability improves significantly **Avoid subqueries when**: - Simple A/B comparison (use conditional columns) - Single metric aggregation - Performance is critical (subqueries add overhead) ### Optimization Tips 1. **Filter early**: Apply filters in subqueries, not after union ```opal # GOOD @current <- @ { filter service_name = "frontend" # Filter early align rate:sum(m("metric")) } # BAD @current <- @ { align rate:sum(m("metric")) } <- @current { filter service_name = "frontend" # Filter late (processes all services) } ``` 2. **Use options(bins: 1) for summaries**: Reduces data volume in union ```opal @current <- @ { align options(bins: 1), rate:sum(m("metric")) # Single row per service aggregate total:sum(rate), group_by(service_name) } ``` 3. **Limit union sources**: Each union source adds processing cost ## Comparison: Subquery Union vs Window Functions Both can solve period-over-period comparison, but have different tradeoffs: | Aspect | Subquery + Union | Window(lag) | |--------|------------------|-------------| | **Syntax** | More verbose | More concise | | **Flexibility** | Can compare any time periods | Limited to adjacent rows | | **Performance** | Processes data twice | Single pass | | **Time buckets** | Aligns arbitrary periods | Only sequential buckets | | **Use case** | Day-over-day, week-over-week | Row-to-row change detection | **Example: Hour-over-hour rate of change** ```opal # Window approach (simpler for sequential buckets) align 1h, rate:sum(m("span_call_count_5m")) | make_col prev_rate:window(lag(rate, 1), group_by(service_name)) | make_col change:rate - prev_rate | make_col pct:if(prev_rate > 0, (change / prev_rate) * 100, 0) # Union approach (more flexible, can compare any offset) @current <- @ { align 1h, rate:sum(m("span_call_count_5m")) aggregate current:sum(rate), group_by(service_name) } @previous <- @ { timeshift 1h align 1h, rate:sum(m("span_call_count_5m")) aggregate prev:sum(rate), group_by(service_name) } @combined <- @current { union @previous aggregate current:any_not_null(current), prev:any_not_null(prev), group_by(service_name) } <- @combined { make_col change:current - prev make_col pct:if(prev > 0, (change / prev) * 100, 0) } ``` **Recommendation**: - Use **window(lag)** for simple sequential comparisons (previous hour, previous bucket) - Use **union + timeshift** for arbitrary period comparisons (same hour yesterday, last week) ## Related Skills - **window-functions-deep-dive** - Covers window(lag/lead/avg) patterns - **detecting-anomalies** - Uses both union and window patterns - **aggregating-gauge-metrics** - Foundation for metric aggregation - **time-series-analysis** - Time bucketing with timechart ## Key Takeaways 1. **Subquery syntax**: `@label <- @dataset { pipeline }` 2. **Union combines**: Same schema, different filters/time periods 3. **Timeshift before align**: Critical for metric comparisons 4. **any_not_null()**: Collapses sparse union results 5. **Not always best**: Consider conditional columns for simple A/B tests 6. **Performance matters**: Union processes data multiple times When in doubt about subquery syntax or complex union patterns, use `learn_observe_skill("OPAL subquery")` for official documentation.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/rustomax/observe-experimental-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server