SuperDB MCP Server

Overview Schema Related Servers Score Discussions

superdb-mcp
docs
tutorials

fork_for_window.md•8.65 KiB

--- title: "Fork as a Window Function Workaround" description: "Using fork as a workaround for window functions to do per-group selection." layout: default nav_order: 3 parent: Tutorials superdb_version: "0.1.0" last_updated: "2026-02-20" --- # Fork as a Window Function Workaround Window functions like `ROW_NUMBER() OVER (PARTITION BY ...)` are not yet available in SuperDB ([brimdata/super#5921][issue]). This tutorial shows how to use `fork` to achieve per-group selection — picking the top N items from each group. [issue]: https://github.com/brimdata/super/issues/5921 ## The Problem You have a pool of available EC2 instances spread across availability zones. You need to pick instances while maximizing AZ distribution — taking an equal number from each zone rather than filling up from one. ```mdtest-input instances.sup {id:"i-001",az:"us-east-1a"} {id:"i-002",az:"us-east-1a"} {id:"i-003",az:"us-east-1a"} {id:"i-004",az:"us-east-1b"} {id:"i-005",az:"us-east-1c"} {id:"i-006",az:"us-east-1c"} {id:"i-007",az:"us-east-1c"} {id:"i-008",az:"us-east-1c"} ``` Distribution: 3 in `us-east-1a`, 1 in `us-east-1b`, 4 in `us-east-1c`. ## What You'd Want (Window Functions) In SQL with window functions, this would be straightforward: ```sql SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY az ORDER BY id) as rn FROM instances ) WHERE rn <= 2 ``` This assigns a row number within each AZ group, then filters to keep only the first 2 per group. But SuperDB doesn't support this yet. ## The Fork Approach `fork` splits the input stream into parallel branches. Each branch receives a copy of **all** the input records, processes them independently, and the results from every branch are merged back together into a single stream. Here's the full query — we'll break it down step by step after: ```mdtest-command super -s -c " from instances.sup | fork ( where az=='us-east-1a' | head 2 ) ( where az=='us-east-1b' | head 2 ) ( where az=='us-east-1c' | head 2 ) | sort az, id " ``` ```mdtest-output {id:"i-001",az:"us-east-1a"} {id:"i-002",az:"us-east-1a"} {id:"i-004",az:"us-east-1b"} {id:"i-005",az:"us-east-1c"} {id:"i-006",az:"us-east-1c"} ``` ### Step by Step **Step 1: `from instances.sup`** — reads all 8 records into the stream: ```mdtest-command super -s -c "from instances.sup" ``` ```mdtest-output {id:"i-001",az:"us-east-1a"} {id:"i-002",az:"us-east-1a"} {id:"i-003",az:"us-east-1a"} {id:"i-004",az:"us-east-1b"} {id:"i-005",az:"us-east-1c"} {id:"i-006",az:"us-east-1c"} {id:"i-007",az:"us-east-1c"} {id:"i-008",az:"us-east-1c"} ``` **Step 2: `fork`** — sends all 8 records into each of three branches. Each branch sees the full input and processes it independently. **Branch 1:** `where az=='us-east-1a'` filters to 3 records, then `head 2` keeps the first 2: ```mdtest-command super -s -c "from instances.sup | where az=='us-east-1a' | head 2" ``` ```mdtest-output {id:"i-001",az:"us-east-1a"} {id:"i-002",az:"us-east-1a"} ``` (i-003 was filtered out by `head 2`) **Branch 2:** `where az=='us-east-1b'` filters to 1 record, `head 2` returns what's available: ```mdtest-command super -s -c "from instances.sup | where az=='us-east-1b' | head 2" ``` ```mdtest-output {id:"i-004",az:"us-east-1b"} ``` Only 1 instance exists in this AZ. `head 2` doesn't error or pad — it just returns what's there. **Branch 3:** `where az=='us-east-1c'` filters to 4 records, `head 2` keeps the first 2: ```mdtest-command super -s -c "from instances.sup | where az=='us-east-1c' | head 2" ``` ```mdtest-output {id:"i-005",az:"us-east-1c"} {id:"i-006",az:"us-east-1c"} ``` (i-007 and i-008 were filtered out by `head 2`) **Step 3: implicit combine** — after the fork closes, results from all three branches merge back into a single stream of 5 records. Fork branches run in parallel and finish in nondeterministic order, so the combined output may be interleaved differently on each run. This is why the final `sort` matters. **Step 4: `sort az, id`** — sorts the combined results for clean, predictable output: ```mdtest-command super -s -c " from instances.sup | fork ( where az=='us-east-1a' | head 2 ) ( where az=='us-east-1b' | head 2 ) ( where az=='us-east-1c' | head 2 ) | sort az, id " ``` ```mdtest-output {id:"i-001",az:"us-east-1a"} {id:"i-002",az:"us-east-1a"} {id:"i-004",az:"us-east-1b"} {id:"i-005",az:"us-east-1c"} {id:"i-006",az:"us-east-1c"} ``` 2 from `us-east-1a`, 1 from `us-east-1b` (all it had), 2 from `us-east-1c` — as balanced as possible given the available pool. ## Why Not Just Sort and Head? Without fork, you might try: ```mdtest-command super -s -c "from instances.sup | sort az, id | head 5" ``` ```mdtest-output {id:"i-001",az:"us-east-1a"} {id:"i-002",az:"us-east-1a"} {id:"i-003",az:"us-east-1a"} {id:"i-004",az:"us-east-1b"} {id:"i-005",az:"us-east-1c"} ``` All 3 from `us-east-1a`, the 1 from `us-east-1b`, and only 1 from `us-east-1c`. That's unbalanced — it fills up from the first AZ alphabetically instead of distributing evenly. ## Verifying the Distribution You can check the balance of your selection by piping through an aggregate: ```mdtest-command super -s -c " from instances.sup | fork ( where az=='us-east-1a' | head 2 ) ( where az=='us-east-1b' | head 2 ) ( where az=='us-east-1c' | head 2 ) | aggregate count:=count() by az | sort az " ``` ```mdtest-output {az:"us-east-1a",count:2} {az:"us-east-1b",count:1} {az:"us-east-1c",count:2} ``` ## Alternative: Self-Join for Row Numbering There's a pure SQL approach that doesn't require fork and works dynamically with any number of groups. The idea: for each record, count how many records in the same group have an id less than or equal to it. This simulates `ROW_NUMBER() OVER (PARTITION BY az ORDER BY id)`. ```mdtest-command super -s -c " select a.id, a.az, count(*) as row_num from instances.sup a join instances.sup b on a.az = b.az and b.id <= a.id group by a.id, a.az order by a.az, a.id " ``` ```mdtest-output {id:"i-001",az:"us-east-1a",row_num:1} {id:"i-002",az:"us-east-1a",row_num:2} {id:"i-003",az:"us-east-1a",row_num:3} {id:"i-004",az:"us-east-1b",row_num:1} {id:"i-005",az:"us-east-1c",row_num:1} {id:"i-006",az:"us-east-1c",row_num:2} {id:"i-007",az:"us-east-1c",row_num:3} {id:"i-008",az:"us-east-1c",row_num:4} ``` Step by step, for record `i-006` in `us-east-1c`: 1. The self-join matches `i-006` against all `us-east-1c` records with `id <= 'i-006'`: that's `i-005` and `i-006` itself. 2. `count(*)` = 2, so `row_num` = 2. Now filter to keep only the first 2 per group: ```mdtest-command super -s -c " with ranked as ( select a.id, a.az, count(*) as row_num from instances.sup a join instances.sup b on a.az = b.az and b.id <= a.id group by a.id, a.az ) select id, az from ranked where row_num <= 2 order by az, id " ``` ```mdtest-output {id:"i-001",az:"us-east-1a"} {id:"i-002",az:"us-east-1a"} {id:"i-004",az:"us-east-1b"} {id:"i-005",az:"us-east-1c"} {id:"i-006",az:"us-east-1c"} ``` Same result as fork, but no hardcoded AZ names — works with any number of groups dynamically. ## Trade-offs **Fork** is simple and fast (linear scan per branch), but requires hardcoding group values. Best when groups are known and stable (like AZs in a region). **Self-join** is dynamic and handles any number of groups automatically, but is O(n^2) per group since every record is joined against all peers with a smaller key. Fine for small datasets, potentially slow for large ones. **With window functions** ([brimdata/super#5921][issue]), the query would be both dynamic and efficient — handling any number of groups with a single linear pass and supporting sophisticated ranking (e.g., ordering within groups by launch time, instance type preference, etc.). | Approach | Dynamic groups? | Time complexity | Notes | |------------------|-----------------|------------------|--------------------------------------------| | Fork | No | O(n) per branch | Groups must be hardcoded | | Self-join | Yes | O(n^2) per group | Every record joined against its group peers | | Window functions | Yes | O(n log n) | Sort + single pass (not yet available) | For a refresher on what those mean in practice ([Big O notation](https://en.wikipedia.org/wiki/Big_O_notation)): | Notation | Name | 100 records | 10,000 records | Growth | |------------|-------------|-------------|----------------|---------------| | O(n) | Linear | 100 | 10,000 | Scales nicely | | O(n log n) | Linearithmic| ~664 | ~132,877 | Typical sort | | O(n^2) | Quadratic | 10,000 | 100,000,000 | Gets slow fast|

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/chrismo/superdb-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

fork_for_window.md•8.65 KiB