Open Census MCP Server

phase4b_decision_log_DEC021.md•3.3 KiB

# DEC-4B-021: Judge Pass Count and Statistical Power **Date:** 2026-02-13 **Context:** Determining how many judge passes to run within Gemini 3 Pro's 250 calls/day rate limit. Each pass = 39 queries × 1 ordering. Budget allows up to 6 passes per judge per day (234 calls). ## Power Analysis ### Primary Test: Paired Wilcoxon Signed-Rank (N=39) | Effect size d | Power | Detectable Δ (sd=2) | Detectable Δ (sd=3) | |---|---|---|---| | 0.3 (small) | 0.55 | 0.6 CQS pts | 0.9 CQS pts | | 0.4 | 0.77 | 0.8 pts | 1.2 pts | | 0.5 (medium) | 0.91 | 1.0 pts | 1.5 pts | | 0.8 (large) | 0.999 | 1.6 pts | 2.4 pts | ### Subgroup Power (the real constraint) | Subgroup | N | d=0.5 power | d=0.8 power | |---|---|---|---| | All queries | 39 | 0.91 | 0.999 | | Edge cases | 23 | 0.72 | 0.975 | | Normal queries | 16 | 0.56 | 0.90 | | Traps only | 11 | 0.41 | 0.75 | ### Diminishing Returns of Additional Passes With noise/signal ratio = 1.0 (conservative assumption): | Passes | Measurements/query | Effective d | Power (N=39) | Power (N=23) | Marginal gain | |---|---|---|---|---|---| | 2 | 6 | 0.463 | 0.87 | 0.68 | — | | 4 | 12 | 0.480 | 0.89 | 0.70 | +0.02 | | 6 | 18 | 0.487 | 0.90 | 0.71 | +0.01 | | 8 | 24 | 0.490 | 0.91 | 0.72 | +0.005 | | 12 | 36 | 0.493 | 0.90 | 0.71 | -0.002 | **Key insight:** Additional passes reduce measurement error but do not increase the number of independent observations. The bottleneck is N=39 queries, not judge noise. Going from 6 to 12 passes buys ~1% power — not worth a second day of API calls. ## Decision **6 passes per judge (234 calls/judge, 702 total), single day execution.** | Pass | Ordering | Purpose | |---|---|---| | 1 | control_first | Base scoring | | 2 | treatment_first | Position bias measurement | | 3 | control_first | Test-retest #1 | | 4 | treatment_first | Test-retest #1 × position | | 5 | control_first | Test-retest #2 (confidence intervals) | | 6 | treatment_first | Test-retest #2 × position (CIs) | ### What this enables - Position bias: full battery, both orderings, 3 replications each - Test-retest ICC: 6 measurements per query per judge (3 per ordering) - Per-query confidence intervals: bootstrappable from 18 measurements (3 judges × 6 passes) - Google position bias reproduction: direct comparison to FedSurvConceptMapper finding - Split-half reliability: can split 6 passes into two groups of 3 ### What this does NOT fix - Trap-query subgroup (N=11) is underpowered at d=0.5 (power=0.41) - Normal-query subgroup (N=16) is underpowered at d=0.5 (power=0.56) - If subgroup effects matter, **add more queries** in a future battery, don't add passes ### Rate limit compliance - Gemini 3 Pro hard limit: 250/day - Our usage: 234/day (93.6% utilization) - Headroom: 16 calls for retries on parse failures ## Trade-off More passes vs. more queries. Passes reduce noise; queries increase statistical power. At N=39, we are noise-limited only for the overall comparison. For subgroup analysis, we are sample-size-limited. Future work should expand the battery (especially edge cases and traps) rather than increase passes. > "Model stochasticity is a parameter to manage; system nondeterminism is a failure mode to control." — B. Webb The 6-pass design manages judge stochasticity (measurement noise) while accepting the fixed N=39 as a known limitation documented for future improvement.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

phase4b_decision_log_DEC021.md•3.3 KiB