Open Census MCP Server

numbers_registry.md•23.1 KiB

# Numbers Registry — Publication Data Catalogue **Purpose:** Single source of truth for every statistic cited in publication materials. Every number traces to a source file, computation script, SRS requirement, and V&V certification status. **Rule:** No number appears in the paper, slides, or handout unless it has an entry here with status CERTIFIED or COUNTABLE. Numbers with status PENDING or UNTRACED must be resolved before citation. **Last updated:** 2026-02-21 (session 3: cost analysis COST-001–013, commit 24c9232) --- ## Status Definitions | Status | Meaning | |--------|---------| | CERTIFIED | Produced by SRS-documented script, independently verified by V&V script (exit 0) | | COUNTABLE | Deterministic count from a static artifact (file line count, YAML entries, DB query). Reproducible by inspection. | | COMPUTED | Produced by SRS-documented script but no independent V&V verification yet | | PENDING | Known to exist in raw data but no documented script produces the aggregate | | UNTRACED | Cited in handoffs/threads but source not yet confirmed | --- ## Section 1: Study Design Parameters | ID | Number | Description | Source File | Script | SRS Req | V&V | Status | |----|--------|-------------|-------------|--------|---------|-----|--------| | SD-001 | 39 | Total queries in test battery | `src/eval/battery/queries.yaml` | N/A (count) | VR-010 | — | COUNTABLE | | SD-002 | 3 | Experimental conditions (control, RAG, pragmatics) | `src/eval/judge_config.yaml` | N/A (design) | VR-024–026 | — | COUNTABLE | | SD-003 | 3 | Judge vendors (Anthropic, OpenAI, Google) | `src/eval/judge_config.yaml` | N/A (design) | VR-031 | — | COUNTABLE | | SD-004 | 6 | Passes per comparison (3 vendors × 2 orderings) | `src/eval/judge_config.yaml` | N/A (design) | VR-032 | — | COUNTABLE | | SD-005 | 3 | Pairwise comparisons (C-R, C-P, R-P) | `src/eval/judge_config.yaml` | N/A (design) | VR-041 | — | COUNTABLE | | SD-006 | 2,106 | Total Stage 2 judge records. 0 parse failures. Backfills: 3 Anthropic (rag_vs_pragmatics, 2026-02-21 commit 2112ec7), 3 Google (control_vs_pragmatics, 2026-02-19). All files 702/702 clean. Zero delta on aggregates confirmed. | `results/v2_redo/stage2/*.jsonl` | N/A (count) | VR-041 | verify_registry_counts.py | COUNTABLE | | SD-007 | 702 | Judge records per comparison (39 × 6 passes × 3 vendors) | `results/v2_redo/stage2/*.jsonl` | N/A (count) | VR-041 | verify_registry_counts.py | COUNTABLE | | SD-008 | 5 | CQS dimensions scored (D1–D5) | `src/eval/judge_prompts.py` | N/A (design) | VR-037 | — | COUNTABLE | | SD-009 | 38% / 62% | Normal (15) / edge case (24) split. Categories: normal(15), geographic_edge(7), small_area(4), temporal(4), ambiguity(3), product_mismatch(3), persona(3). SRS VR-010 corrected from 41%/59% to match actual battery. | `src/eval/battery/queries.yaml` | N/A (count) | VR-010 | verify_registry_counts.py | COUNTABLE | | SD-010 | 1 | Caller model (`claude-sonnet-4-5-20250929`) | `src/eval/judge_config.yaml` | N/A (config) | VR-021 | — | COUNTABLE | --- ## Section 2: Pragmatics Layer Parameters | ID | Number | Description | Source File | Script | SRS Req | V&V | Status | |----|--------|-------------|-------------|--------|---------|-----|--------| | PL-001 | 36 | Curated pragmatic items in compiled ACS pack (36 context rows, 35 threads). **NOTE:** Previously cited as 35 — actual compiled pack has 36. | `packs/acs.db` (context table) | N/A (DB count) | verify_registry_counts.py | COUNTABLE | | PL-002 | 36 | Staged pragmatic items in `staging/acs/*.json` (18 category files). **NOTE:** Previously cited as 47 — DISCREPANCY. Actual count is 36 from verify script. The 47 figure is untraced; may reflect an older version or items staged in other domains. | `staging/acs/*.json` | `verify_registry_counts.py` | — | `paper/registry_verification_report.md` | COUNTABLE | | PL-003 | — | Source documents for pragmatic extraction | Quarry pipeline / provenance | — | VR-081 | — | PENDING | | PL-004 | 39/39 (100%) | Grounding gate: Pragmatics called `get_methodology_guidance` on all 39 queries. Control 0/39 and RAG 0/39 is correct by design — VR-025 excludes the tool from their tool list, VR-027/028 verify zero contamination. **Not a discrepancy.** | `results/v2_redo/stage1/*_responses_*.jsonl` | `verify_registry_counts.py` | VR-025, VR-027, VR-028 | `paper/registry_verification_report.md` | COUNTABLE | --- ## Section 3: Stage 2 — CQS Judge Scores (CERTIFIED) Source: `results/v2_redo/stage2/analysis/aggregate_statistics.md` + `.json` Script: `src/eval/aggregate_analysis.py` V&V: `src/eval/aggregate_analysis.py` (self-documenting per VR-048; SRS Section 8.9 registry) SRS: VR-048, VR-060–065 ### 3a. Omnibus Test | ID | Number | Description | Status | |----|--------|-------------|--------| | S2-001 | 42.01 | Friedman χ²(2, N=39) composite CQS | CERTIFIED | | S2-002 | p < 0.001 | Omnibus p-value | CERTIFIED | ### 3b. Pairwise Comparisons (Holm-corrected) | ID | Comparison | CQS Δ | Cohen's d | 95% CI | p (Holm) | Eff. N | Status | |----|------------|-------|-----------|--------|----------|--------|--------| | S2-010 | Pragmatics vs Control | +0.539 | 1.440 | [0.421, 0.651] | < 0.001 | 36/39 | CERTIFIED | | S2-011 | Pragmatics vs RAG | +0.385 | 0.922 | [0.256, 0.513] | < 0.001 | 32/39 | CERTIFIED | | S2-012 | RAG vs Control | +0.154 | 0.546 | [0.072, 0.244] | 0.0017 | 30/39 | CERTIFIED | ### 3c. Per-Dimension Omnibus | ID | Dimension | χ²(2) | p | Status | |----|-----------|-------|---|--------| | S2-020 | D1 (Accuracy) | 16.25 | < 0.001 | CERTIFIED | | S2-021 | D2 (Completeness) | 14.59 | < 0.001 | CERTIFIED | | S2-022 | D3 (Uncertainty) | 44.83 | < 0.001 | CERTIFIED | | S2-023 | D4 (Clarity) | 29.36 | < 0.001 | CERTIFIED | | S2-024 | D5 (Harm Avoidance) | 16.13 | < 0.001 | CERTIFIED | ### 3d. Per-Dimension Effect Sizes (Cohen's d) | ID | Dimension | Prag vs Ctrl | Prag vs RAG | RAG vs Ctrl | Status | |----|-----------|-------------|-------------|-------------|--------| | S2-030 | D1 | 0.541 | 0.515 | 0.190 | CERTIFIED | | S2-031 | D2 | 0.537 | 0.297 | 0.246 | CERTIFIED | | S2-032 | D3 | 1.353 | 1.040 | 0.417 | CERTIFIED | | S2-033 | D4 | 0.957 | 0.577 | 0.546 | CERTIFIED | | S2-034 | D5 | 0.732 | 0.521 | 0.148 | CERTIFIED | ### 3e. Condition Means | ID | Condition | CQS Mean | Status | |----|-----------|----------|--------| | S2-040 | Pragmatics | 1.5282 | CERTIFIED | | S2-041 | RAG | 1.1436 | CERTIFIED | | S2-042 | Control | 0.9897 | CERTIFIED | ### 3f. Stratum Analysis (Normal vs Edge) Source: `results/v2_redo/stage2/analysis/stratum_analysis.md` + `.json` Script: `src/eval/stratum_analysis.py` SRS: VR-101, VR-102, VR-103 *Key finding: No overfit. Pragmatics effect is LARGER on normal queries (d=2.347) than edge (d=1.135). Edge-greater hypothesis not supported (p=0.987).* #### Normal Queries (n=15) | ID | Comparison | Δ CQS | Cohen's d | Wilcoxon p | Status | |----|------------|-------|-----------|-----------|--------| | SA-001 | Pragmatics vs Control | +0.707 | 2.347 | < 0.001 | COMPUTED | | SA-002 | Pragmatics vs RAG | +0.580 | 1.436 | 0.0011 | COMPUTED | | SA-003 | RAG vs Control | +0.127 | 0.458 | 0.1370 (ns) | COMPUTED | *Power note: n=15 Wilcoxon power ~0.56 at d=0.5, ~0.94 at d=1.0. SA-003 underpowered for small effects.* #### Edge Queries (n=24) | ID | Comparison | Δ CQS | Cohen's d | Wilcoxon p | Status | |----|------------|-------|-----------|-----------|--------| | SA-010 | Pragmatics vs Control | +0.433 | 1.135 | < 0.001 | COMPUTED | | SA-011 | Pragmatics vs RAG | +0.263 | 0.683 | 0.0041 | COMPUTED | | SA-012 | RAG vs Control | +0.171 | 0.590 | 0.0024 | COMPUTED | #### Between-Stratum (VR-103) | ID | Comparison | Normal Δ | Edge Δ | ΔΔ (Edge−Normal) | p(Edge>Normal) | Status | |----|------------|---------|--------|------------------|----------------|--------| | SA-020 | Pragmatics vs Control | +0.707 | +0.433 | −0.273 | 0.9866 | COMPUTED | | SA-021 | Pragmatics vs RAG | +0.580 | +0.263 | −0.318 | 0.9873 | COMPUTED | | SA-022 | RAG vs Control | +0.127 | +0.171 | +0.044 | 0.3471 | COMPUTED | ### 3g. Efficiency Analysis Source: `results/v2_redo/stage1/analysis/overhead_analysis.md` + `.json` Script: `src/eval/overhead_analysis.py` Note: Uses actual API token counts (input_tokens, output_tokens) per record. *Key finding: Pragmatics has higher raw token overhead than RAG (+465% vs +307% over control), but delivers targeted context (21.8 of 36 items = 61% per query, API-served) vs RAG's fixed top-5 brute-force injection. Quality difference: pragmatics CQS 1.53 vs RAG 1.14.* | ID | Metric | Control | RAG | Pragmatics | Status | |----|--------|---------|-----|------------|--------| | EFF-001 | Response file size (bytes) | 1,169,849 | 2,262,441 | 1,631,664 | COMPUTED | | EFF-002 | Mean input tokens/query | 5,830 | 23,746 | 32,929 | COMPUTED | | EFF-003 | Mean output tokens/query | 679 | 714 | 975 | COMPUTED | | EFF-004 | Input overhead vs control | — | +307.3% | +464.8% | COMPUTED | | EFF-005 | Mean tool calls/query | 3.5 | 3.3 | 4.0 | COMPUTED | | EFF-006 | Mean retrieval context (RAG) | — | 16,026 chars | — | COMPUTED | | EFF-007 | Mean guidance response (pragmatics) | — | — | 16,106 chars | COMPUTED | | EFF-008 | Mean pragmatic items/query | — | — | 21.8 of 36 (61%) | COMPUTED | ### 3h. Cost Analysis Source: `results/v2_redo/stage1/analysis/cost_analysis.md` + `.json` Script: `src/eval/cost_analysis.py` **Citation:** Anthropic. (2026). Claude model pricing and API overview. Retrieved February 21, 2026, from https://platform.claude.com/docs/en/about-claude/models/overview **Pricing used:** - Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`, used in experiment): $3.00/$15.00 per MTok input/output - Claude Opus 4.6 (`claude-opus-4-6`, premium reference): $5.00/$25.00 per MTok input/output *Key finding: Pragmatics is 2.2× more cost-effective than RAG (CQS per marginal dollar: 6.28 vs 2.83). Absolute costs are negligible — full 39-query battery costs $4.42 at Sonnet pricing, $7.37 at Opus pricing.* #### Sonnet 4.5 Pricing ($3/$15 per MTok) | ID | Metric | Control | RAG | Pragmatics | Status | |----|--------|---------|-----|------------|--------| | COST-001 | Cost per query | $0.0277 | $0.0819 | $0.1134 | COMPUTED | | COST-002 | Total battery cost (39 queries) | $1.08 | $3.20 | $4.42 | COMPUTED | | COST-003 | Marginal cost per query vs control | — | $0.0543 | $0.0857 | COMPUTED | | COST-004 | CQS per marginal dollar | — | 2.83 | **6.28** | COMPUTED | | COST-005 | Pragmatics vs RAG cost-effectiveness ratio | — | — | **2.2×** | COMPUTED | #### Opus 4.6 Pricing ($5/$25 per MTok) | ID | Metric | Control | RAG | Pragmatics | Status | |----|--------|---------|-----|------------|--------| | COST-010 | Cost per query | $0.0461 | $0.1366 | $0.1890 | COMPUTED | | COST-011 | Total battery cost (39 queries) | $1.80 | $5.33 | $7.37 | COMPUTED | | COST-012 | Marginal cost per query vs control | — | $0.0905 | $0.1429 | COMPUTED | | COST-013 | CQS per marginal dollar | — | 1.70 | **3.77** | COMPUTED | --- ## Section 4: Stage 3 — Fidelity & Auditability (CERTIFIED) Source: `results/v2_redo/stage3/analysis/fidelity_summary.md` + `.json` Script: `src/eval/fidelity_aggregate.py` V&V: `src/eval/fidelity_qc.py` (exit 0, all 35 checks pass) SRS: VR-091–096 (aggregate), VR-097–100 (QC) ### 4a. Overall Fidelity | ID | Condition | Fidelity | Subst. Fidelity | Error Rate | Total Claims | Status | |----|-----------|----------|-----------------|------------|-------------|--------| | S3-001 | Control | 78.3% | 100.0% | 0.0% | 253 | CERTIFIED | | S3-002 | RAG | 74.6% | 98.9% | 0.8% | 355 | CERTIFIED | | S3-003 | Pragmatics | 91.2% | 99.7% | 0.3% | 353 | CERTIFIED | *Note: Substantive fidelity, error rate, and claim counts are in fidelity_summary.md but not yet extracted to this registry. Fill from certified source.* ### 4b. Overall Auditability | ID | Condition | Auditable | Partially | Unauditable | Subst. Claims | Status | |----|-----------|-----------|-----------|-------------|---------------|--------| | S3-010 | Control | 6.2% | 63.0% | 15.2% | 257 | CERTIFIED | | S3-011 | RAG | 21.8% | 76.0% | 17.8% | 242 | CERTIFIED | | S3-012 | Pragmatics | 29.5% | 51.8% | 18.7% | 278 | CERTIFIED | *All values from `fidelity_summary.json` (certified source). Non-claims excluded from auditability denominators per VR-054.* --- ## Section 5: V1 vs V2 Reconciliation These numbers are NOT cited in the paper. They document why V1 numbers differ from V2, per `talks/fcsm_2026/2026-02-21_v1_to_v2_redesign.md`. | ID | Metric | V1 Value | V2 Value | Reason for Divergence | Status | |----|--------|----------|----------|----------------------|--------| | RC-001 | Pragmatics fidelity | 91.6% | 91.2% | Different Stage 1 responses (pre/post leakage fix) | DOCUMENTED | | RC-002 | Pragmatics auditability | 72.8% | 29.5% | V1 measured treatment-only; V2 symmetric across conditions; different responses | DOCUMENTED | | RC-003 | Control auditability | 8.1% | 6.2% | Different Stage 1 responses | DOCUMENTED | --- ## Section 5b: RAG Condition Parameters Source: `src/eval/rag_retriever.py`, `results/rag_ablation/index/metadata.json` Verified by: `verify_registry_counts.py` | ID | Parameter | Value | Source | Status | |----|-----------|-------|--------|--------| | RAG-001 | Embedding model | all-MiniLM-L6-v2 (384-dim) | `rag_retriever.py` | COUNTABLE | | RAG-002 | Index type | FAISS IndexFlatIP (cosine) | `rag_retriever.py` | COUNTABLE | | RAG-003 | Top-k retrieval | 5 | `rag_retriever.py:27` | COUNTABLE | | RAG-004 | Total chunks indexed | 311 | `results/rag_ablation/index/qc_report.txt` (D&M: 210, Handbook: 85, Geography: 16) | COUNTABLE | | RAG-005 | Source documents | 3 | `results/rag_ablation/index/sources.txt` | COUNTABLE | | RAG-005a | RAG source doc 1 | ACS General Handbook 2020 (89pp) | `sources.txt` | COUNTABLE | | RAG-005b | RAG source doc 2 | ACS Design & Methodology 2024 (238pp) | `sources.txt` | COUNTABLE | | RAG-005c | RAG source doc 3 | ACS Geography Handbook 2020 (27pp) | `sources.txt` | COUNTABLE | | RAG-005d | RAG total source pages | 354pp (89 + 238 + 27) | Derived | COUNTABLE | | RAG-005e | Source overlap with pragmatics | 3 of 3 shared (identical source docs). RAG indexes all as chunks; pragmatics cites same 3 via curated items | `sources.txt` vs `neo4j-pragmatics: Context.provenance` | COUNTABLE | | RAG-006 | Bootstrap iterations | 10,000 | `judge_config.yaml` (analysis section) | COUNTABLE | | RAG-007 | Bootstrap seed | Not set (non-deterministic) | `judge_config.yaml` | COUNTABLE | --- ## Section 5c: Determinism Verification Source: `results/rag_ablation/analysis/determinism_test_real.json` Script: `scripts/test_determinism_real.py` CC Task: `cc_tasks/2026-02-15_real_determinism_test.md` | ID | Number | Description | Source | Status | |----|--------|-------------|--------|--------| | DET-001 | 39/39 | Queries with identical context retrieval across 2 replications + original | `determinism_test_real.json` | COUNTABLE | | DET-002 | 0 | Mismatches (run1 vs run2) | `determinism_test_real.json` | COUNTABLE | | DET-003 | 0 | Mismatches (run1 vs original Stage 1) | `determinism_test_real.json` | COUNTABLE | | DET-004 | 100% | Deterministic reproducibility rate | DET-001 / SD-001 | COUNTABLE | **What this proves:** Given identical topic parameters, the pragmatics retrieval layer returns identical context sets every time. The system is deterministic — no stochastic retrieval, no embedding drift, no random sampling. This is by design: topic→thread→context is a graph traversal, not a vector search. --- ## Section 5d: Study Design Derivation Decision pedigree for key design parameters. | ID | Parameter | Value | Derivation | Decision Record | |----|-----------|-------|------------|----------------| | DRV-001 | Battery size (n=39) | 39 | Paired Wilcoxon at d=0.5, α=0.05, power=0.80 requires ~35 pairs. Stratified: 15 normal + 24 edge. Constrained by Gemini 250/day rate limit (234 calls = 93.6% utilization) | DEC-4B-009, DEC-4B-021 | | DRV-002 | Normal/edge split | 38%/62% | Equivalence testing (no-harm claim on normal) needs 15-20; superiority testing (edge cases) at d=0.8 needs 15-20. Edge oversampled because that's where pragmatics value-add is hypothesized | DEC-4B-009 | | DRV-003 | Judge passes per comparison | 6 | 3 vendors × 2 orderings. 6→12 passes buys ~1% power — not worth cost. Bottleneck is N=39 queries, not judge noise | DEC-4B-021 | | DRV-004 | Edge case oversampling rationale | 62% edge | Hypothesis is directional: pragmatics help on hard queries, neutral on easy. More power needed where effect matters. Not arbitrary. | DEC-4B-009 | --- ## Section 5f: Extraction Provenance Source: Quarry Neo4j `SourceDocument` nodes + file metadata | ID | Number | Description | Source | Status | |----|--------|-------------|--------|--------| | EXT-001 | 3 | Source documents — **identical** for RAG and pragmatics (by design, for fair comparison) | `sources.txt` + quarry SourceDocument nodes | COUNTABLE | | EXT-002 | 34 | Pragmatic items pipeline-extracted via quarry (28 from Handbook ACS-GEN-001 + 6 from D&M ACS-DM-2024) | `neo4j-pragmatics: Context.provenance` | COUNTABLE | | EXT-003 | 2 | Pragmatic items manually extracted (human + AI source material review): ACS-IND-001 geography from Geography Handbook + ACS-GQ-001 group quarters from CPS-HBM-001. Same sources, not pipeline-extracted. | provenance query | COUNTABLE | | EXT-004 | 36 | Total pragmatic items (34 pipeline + 2 manual) = PL-001 | EXT-002 + EXT-003 | COUNTABLE | | EXT-005 | 89 | Pages in ACS General Handbook 2020 | Document metadata | COUNTABLE | | EXT-006 | 238 | Pages in ACS Design & Methodology Report 2024 | Document metadata | COUNTABLE | | EXT-007 | 27 | Pages in ACS Geography Handbook 2020 ("Geography and the American Community Survey: What Data Users Need to Know") | Document metadata, user-confirmed | COUNTABLE | | EXT-008 | 354 | Total source pages (89 + 238 + 27) | Derived | COUNTABLE | | EXT-009 | 5,233 | Quarry KG nodes from Handbook + D&M (the 2 quarry-extracted docs) | `MATCH (n)-[:SOURCED_FROM]->(s) WHERE s.catalog_id IN [...]` | COUNTABLE | | EXT-010 | 0.65% | Extraction yield: 34 quarry-extracted items / 5,233 nodes | EXT-002 / EXT-009 | COUNTABLE | **Key design point:** Both conditions used the **same 3 source documents**. RAG indexed all 3 as 311 chunks for brute-force top-5 retrieval. Pragmatics pipeline-extracted 34 items from 2 docs and manually extracted 2 items from the others via human + AI source material review. The independent variable is representation method, not source material. **Dual extraction paths:** The geography handbook didn't yield usable pipeline-extracted pragmatics — a finding itself. Some expert judgment is implicit in how practitioners *use* documents, not explicit in document text. The pipeline captures explicit knowledge; manual extraction via SME conversation captures tacit knowledge. A mature system needs both paths. The 2 manually extracted items are proof-of-concept for the Phase 2 expert validation pathway (structured interviews to elicit tacit knowledge from Census methodology specialists). --- ## Section 5e: Pending Analysis (CC Tasks Created) | ID | Analysis | CC Task | Registry Section (after completion) | |----|----------|---------|------------------------------------| | ~~PEND-001~~ | ~~Stratum treatment effect~~ | `cc_tasks/2026-02-21_stratum_analysis.md` | **DONE** → Section 3f: SA-001–022 | | ~~PEND-002~~ | ~~Token overhead~~ | `cc_tasks/2026-02-21_overhead_analysis.md` | **DONE** → Section 3g: EFF-001–008 | --- ## Section 6: Derived / Interpretive Numbers Numbers computed from certified data for narrative use (e.g., "X times higher"). These must trace to certified source numbers. | ID | Statement | Derivation | Source IDs | Status | |----|-----------|------------|------------|--------| | DV-001 | "very large effect size" (d > 1.0) | Cohen's d = 1.440 for Prag vs Ctrl | S2-010 | CERTIFIED | | DV-002 | "large effect size" (d > 0.8) | Cohen's d = 0.922 for Prag vs RAG | S2-011 | CERTIFIED | | DV-003 | "medium effect size" (d > 0.5) | Cohen's d = 0.546 for RAG vs Ctrl | S2-012 | CERTIFIED | | DV-004 | D3 largest effect | d = 1.353 Prag vs Ctrl, largest of D1–D5 | S2-032 | CERTIFIED | | DV-005 | Fidelity gap Prag vs RAG | 91.2% - 74.6% = 16.6 pp | S3-003, S3-002 | CERTIFIED | | DV-006 | Fidelity gap Prag vs Ctrl | 91.2% - 78.3% = 12.9 pp | S3-003, S3-001 | CERTIFIED | --- ## Section 7: Gaps — Numbers Needed But Not Yet Traceable | ID | Description | Likely Source | Action Needed | |----|-------------|--------------|---------------| | ~~GAP-001~~ | ~~2,106 judge record count~~ | — | **CLOSED** → SD-006 COUNTABLE (2,106 confirmed, 0 parse failures) | | ~~GAP-002~~ | ~~35 pragmatic item count~~ | — | **CLOSED** → PL-001 COUNTABLE (36 in compiled pack) | | ~~GAP-003~~ | ~~47 staged item count~~ | — | **CLOSED** → PL-002 COUNTABLE (36 actual; 47 was untraced ghost number) | | ~~GAP-004~~ | ~~Grounding compliance~~ | — | **CLOSED** → PL-004 COUNTABLE (39/39 pragmatics; 0/39 control/RAG correct by design per VR-025) | | ~~GAP-005~~ | ~~Normal/edge split~~ | — | **CLOSED** → SD-009 COUNTABLE (38%/62%; SRS VR-010 corrected to match) | | ~~GAP-006~~ | ~~Per-dimension RAG vs Ctrl d~~ | — | **CLOSED** → S2-030–034 filled from aggregate_statistics.json | | ~~GAP-007~~ | ~~Substantive fidelity, error rates~~ | — | **CLOSED** → S3-001–003, S3-010–012 filled from fidelity_summary.json | | ~~GAP-008~~ | ~~Bootstrap CI parameters~~ | — | **CLOSED** → 10,000 iterations (`judge_config.yaml analysis.bootstrap_iterations`); no seed documented in config | | ~~GAP-009~~ | ~~RAG index parameters~~ | — | **CLOSED** → FAISS IndexFlatIP cosine, all-MiniLM-L6-v2 (384-dim), top-k=5 (`rag_retriever.py:27`), 311 chunks, 3 source docs (`results/rag_ablation/index/metadata.json`) | | ~~GAP-010~~ | ~~Source document count for pragmatics~~ | — | **CLOSED** → Same 3 docs for both conditions (~392pp). 34 pipeline-extracted + 2 manually extracted = 36 items. Section 5f (EXT-001–010). | | ~~GAP-011~~ | ~~Stratum treatment effect~~ | — | **CLOSED** → Section 3f (SA-001–022). Normal d=2.347 > Edge d=1.135. No overfitting. | | ~~GAP-012~~ | ~~Token overhead~~ | — | **CLOSED** → Section 3g (EFF-001–008). Pragmatics +465% tokens, RAG +307%. Old handoff note (120%/36%) superseded. | | ~~GAP-013~~ | ~~Pragmatics development procedure narrative~~ | — | **CLOSED** → `paper/sections/05_extraction_pipeline.md` (dual extraction paths, determinism, curation, compilation pipeline) | | ~~GAP-014~~ | ~~API-driven architecture advantage~~ | — | **CLOSED** → `paper/sections/08_discussion_sidecar.md` (sidecar pattern, central maintenance, multi-vendor, cost-effectiveness, scaling) | --- ## Maintenance Rules 1. **Adding a number:** Create entry with appropriate status. If PENDING/UNTRACED, add to Section 7 gaps. 2. **Promoting to CERTIFIED:** Requires V&V script in SRS Section 8.9 registry with exit 0. 3. **Promoting to COUNTABLE:** Requires deterministic source (file count, config value, YAML entry count). Document the exact command to reproduce. 4. **Citing in paper:** Reference by ID (e.g., "S2-010"). Section files in `paper/sections/` should use these IDs in comments to maintain traceability. 5. **Number changes:** If a certified number changes due to reanalysis, update here AND note the previous value in Section 5 (reconciliation) with reason.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

numbers_registry.md•23.1 KiB