Open Census MCP Server

06b_statistical_analysis_plan.md•10.5 KiB

# Chapter 6B: Statistical Analysis Plan [← Ch. 6](06_success_criteria.md) | [README](README.md) | [Ch. 7 →](07_risk_register.md) --- ## 6B.1 Purpose This chapter specifies every statistical test, bias diagnostic, and effect measure to be computed from the judge scoring output. The governing principle is **collect everything, analyze selectively** — computation is cheap, re-running pipelines with API costs is not. All metrics below are computed and stored. Reporting decisions (what goes in the paper vs. supplementary materials) are made after seeing the data. --- ## 6B.2 Data Collection Requirements ### 6B.2.1 Per-Judge-Call Record Every judge invocation must record: | Field | Type | Purpose | |---|---|---| | `query_id` | str | Links to battery | | `judge_model` | str | Exact model string | | `judge_vendor` | str | anthropic / openai / google | | `presentation_order` | str | "control_first" or "treatment_first" | | `scores_response_a` | dict | D1-D6 scores for Response A | | `scores_response_b` | dict | D1-D6 scores for Response B | | `preference` | str | "A" / "B" / "tie" | | `confidence` | dict | Per-dimension confidence (1-5 scale) | | `reasoning` | str | Full judge reasoning text | | `response_a_label` | str | "control" or "treatment" (for analysis, not shown to judge) | | `response_b_label` | str | "control" or "treatment" | | `latency_ms` | int | Judge call latency | | `input_tokens` | int | Prompt tokens consumed | | `output_tokens` | int | Judge response tokens | | `timestamp` | str | ISO 8601 | | `run_id` | str | Batch run identifier | ### 6B.2.2 Position Bias Subset For a subset of queries (minimum 10, ideally all 39), run each judge in **both orderings** (control-first and treatment-first). This doubles the judge calls for this subset but enables direct measurement of position bias. Total judge calls budget: - Base: 39 queries × 3 judges = 117 calls - Position bias (both orderings, all 39): 39 × 3 × 2 = 234 calls - Test-retest (10-query subset, second run): 10 × 3 = 30 calls - **Total: 264 calls** ### 6B.2.3 Derived Fields (computed post-hoc) | Field | Derivation | |---|---| | `control_total_cqs` | Sum of D1-D6 for control response | | `treatment_total_cqs` | Sum of D1-D6 for treatment response | | `cqs_delta` | treatment_total - control_total | | `control_char_count` | len(control response text) | | `treatment_char_count` | len(treatment response text) | | `char_ratio` | treatment_chars / control_chars | | `gate_failure_control` | True if control D6 = 0 | | `gate_failure_treatment` | True if treatment D6 = 0 | --- ## 6B.3 Inter-Rater Agreement ### 6B.3.1 Primary: Krippendorff's Alpha (ordinal) Computed per dimension across all 3 judges. Uses ordinal scale (0 < 1 < 2) which is appropriate for the CQS scale where distances are meaningful. | Interpretation | α range | |---|---| | Poor | < 0.2 | | Fair | 0.2 – 0.4 | | Moderate | 0.4 – 0.6 | | Substantial | 0.6 – 0.8 | | Near-perfect | > 0.8 | **Threshold:** α ≥ 0.4 (moderate) per dimension. If any dimension < 0.2, that dimension's scores are unreliable and should be reported but flagged. ### 6B.3.2 Secondary: Fleiss' Kappa Multi-rater agreement treating scores as nominal categories. Less appropriate than α-ordinal for our scale but widely recognized. Report for comparability with harmonization study (κ = 0.611 rater tier, 0.843 arbitrator tier). ### 6B.3.3 Pairwise: Cohen's Kappa All 3 pairs (Claude-GPT, Claude-Gemini, GPT-Gemini). Identifies which judge pairs agree most/least. If one judge is an outlier, this reveals it. ### 6B.3.4 Percent Agreement Simple agreement rate per dimension and overall. Report alongside κ/α for context (κ can be misleadingly low with skewed distributions). --- ## 6B.4 Bias Diagnostics ### 6B.4.1 Position Bias **Method:** For each query scored in both orderings, compute: - `swap_consistency`: Does the judge give the same preference regardless of order? - `position_effect`: Mean score difference (Response A score - Response B score) across all calls, regardless of which is treatment. If > 0, first-position bias. **Metric:** Swap consistency rate (% of queries where preference is invariant to ordering). Per Zheng et al. 2023, position bias is significant when consistency < 80%. **Statistical test:** Paired t-test or Wilcoxon on per-dimension scores between orderings for the same query-judge pair. ### 6B.4.2 Verbosity Bias **Method:** Correlate response length (char count and token count) with CQS total score across all judge calls. **Metrics:** - Spearman's ρ between `response_char_count` and `total_cqs` (across both conditions) - Partial correlation controlling for condition (control/treatment) - Regression: `CQS ~ length + condition + length:condition` **Interpretation:** If ρ > 0.7, judges are rewarding length. If the interaction term (length:condition) is significant, the length effect differs by condition — which would mean treatment is getting a boost from verbosity, not content. ### 6B.4.3 Self-Enhancement / Vendor Bias **Method:** For each judge, compute mean CQS awarded to control and treatment responses. Then compare across judges to detect systematic vendor-specific inflation. **Metrics:** - Per-judge: Mean treatment CQS, mean control CQS, mean delta - Self-vendor indicator: Does Claude-judge rate treatment (generated by Claude) higher than other judges rate treatment? - Cross-vendor comparison table: | Judge | Mean Ctrl CQS | Mean Treat CQS | Δ | Treat Preference % | |---|---|---|---|---| | Claude Opus | | | | | | GPT-5.2 | | | | | | Gemini 3 Pro | | | | | **Statistical test:** Kruskal-Wallis on treatment CQS scores grouped by judge vendor. If p < 0.05, at least one judge scores systematically differently. **Self-enhancement ratio:** Claude_treatment_mean / non_Claude_treatment_mean. Values > 1.05 suggest self-enhancement bias. ### 6B.4.4 Leniency / Severity Bias **Method:** Some judges may be systematically lenient or harsh across all responses. **Metric:** Per-judge mean CQS across both conditions. Report the range. If one judge's mean is > 1.5 CQS points above or below the others, flag as leniency/severity bias. --- ## 6B.5 Treatment Effect Analysis ### 6B.5.1 Primary: Paired Wilcoxon Signed-Rank Test Paired by query_id. Tests whether treatment CQS differs from control CQS. Non-parametric (appropriate for ordinal scale, N=39). Computed for: - Total CQS (0-12) - Each dimension D1-D6 individually - Query subgroups (normal vs edge) ### 6B.5.2 Effect Size: Cohen's d (paired) `d = mean(Δ) / sd(Δ)` where Δ = treatment_cqs - control_cqs per query. | Interpretation | d | |---|---| | Small | 0.2 | | Medium | 0.5 | | Large | 0.8 | Also compute rank-biserial correlation r as a non-parametric alternative. ### 6B.5.3 Subgroup Analysis Split by `difficulty` (normal / tricky / trap): - Normal queries: expect small or no treatment effect (TOST equivalence test) - Tricky queries: expect moderate treatment advantage - Trap queries: expect large treatment advantage Split by `category`: - Per-category mean delta and significance test - Identifies which failure modes pragmatics most effectively addresses ### 6B.5.4 TOST Equivalence Testing For normal queries where we hypothesize no treatment effect, Two One-Sided Tests (TOST) with equivalence margin of ±1 CQS point. If TOST p < 0.05, we can claim treatment and control are equivalent on easy queries (pragmatics don't hurt). ### 6B.5.5 D6 Gate Analysis - Proportion of D6=0 (gate failure) in control vs treatment - McNemar's test for paired proportions - If treatment has significantly fewer gate failures, this is a key finding about grounding preventing hallucination --- ## 6B.6 Internal Consistency / Reliability ### 6B.6.1 Test-Retest Reliability Run 10-query subset through each judge twice (same prompt, different API call). Compute ICC (intraclass correlation coefficient, two-way random, absolute agreement) per dimension and total CQS. | Interpretation | ICC | |---|---| | Poor | < 0.5 | | Moderate | 0.5 – 0.75 | | Good | 0.75 – 0.9 | | Excellent | > 0.9 | ### 6B.6.2 Confidence Calibration If judges provide per-dimension confidence (1-5 scale): - Compute correlation between confidence and agreement with majority vote - High confidence should predict agreement; low confidence should predict disagreement - If inverse or null relationship, confidence scores are uninformative ### 6B.6.3 Cronbach's Alpha (internal consistency of CQS dimensions) Treat D1-D6 as items on a 6-item scale. Compute Cronbach's α to assess whether the dimensions measure a coherent underlying construct. | Interpretation | α | |---|---| | Unacceptable | < 0.5 | | Poor | 0.5 – 0.6 | | Questionable | 0.6 – 0.7 | | Acceptable | 0.7 – 0.8 | | Good | > 0.8 | Note: Very high α (> 0.95) would suggest dimension redundancy — the 6 dimensions may be measuring the same thing with different labels. --- ## 6B.7 Human Calibration Metrics When human scoring is available from the calibration packet: ### 6B.7.1 Judge-Human Agreement - Cohen's κ per dimension: each judge vs. human expert - ICC between each judge and human for total CQS - Identify which judge most closely matches human expert judgment ### 6B.7.2 Systematic Disagreement Patterns - Where judges disagree with humans, is the pattern systematic? - Do judges consistently over-score or under-score specific dimensions? - Plot per-dimension: human score vs. judge score (scatter + regression line) --- ## 6B.8 Summary Metrics Table For the paper, the minimum reporting set: | Metric | Purpose | Section | |---|---|---| | Krippendorff's α per dimension | Inter-rater reliability | 6B.3.1 | | Position swap consistency % | Position bias magnitude | 6B.4.1 | | Length-CQS Spearman ρ | Verbosity bias magnitude | 6B.4.2 | | Self-enhancement ratio | Vendor bias magnitude | 6B.4.3 | | Wilcoxon p (edge cases) | Treatment effect significance | 6B.5.1 | | Cohen's d (edge cases) | Treatment effect size | 6B.5.2 | | TOST p (normal queries) | Equivalence on easy queries | 6B.5.4 | | McNemar p (D6 gate) | Grounding prevents hallucination | 6B.5.5 | | ICC test-retest | Judge reliability | 6B.6.1 | | Cronbach's α | CQS internal consistency | 6B.6.3 | Everything else is supplementary / diagnostic. --- ## 6B.9 Analysis Scripts Specification All analysis code must: 1. Read from the judge scoring JSONL (single input file) 2. Output all metrics to a structured JSON report 3. Generate publication-ready tables (LaTeX and Markdown) 4. Generate diagnostic plots (saved as PNG/SVG) 5. Be deterministic (seeded where applicable) 6. Run without modification after judge scoring completes

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

06b_statistical_analysis_plan.md•10.5 KiB