Open Census MCP Server

doe_rag_ablation_plan.md•11.1 KiB

# DOE Plan: RAG Ablation Experiment ## Experiment ID: EXP-RAG-001 ## Status: PLANNING ## Start Date: 2026-02-15 ## Deadline: 2026-03-01 (slides due) --- ## 1. Hypothesis **H₀:** Section-level RAG on the same source documents produces equivalent CQS dimension scores to the bare LLM control condition. **H₁:** RAG produces higher scores on some dimensions (likely D4 Definitions, possibly D2 Methodology) but NOT on D3 (Uncertainty) or D5 (Reproducibility). **H₂ (the paper's claim):** Pragmatics >> RAG > Control on D3/D5, demonstrating that curated expert judgment adds value beyond document retrieval. ## 2. Experimental Design | Parameter | Value | Rationale | |-----------|-------|-----------| | Design | Within-query paired comparison | Same 39 queries, matched to existing control | | Conditions | RAG vs Control (pairwise) | Mirrors existing Pragmatics vs Control design | | Queries | 39 (15 normal, 24 edge) | Same test battery as primary evaluation | | Judge vendors | 3 (Anthropic, OpenAI, Google) | Same panel | | Passes per vendor | 6 (3 control-first, 3 treatment-first) | Same counterbalancing | | Total judge records | 702 (39 × 3 × 6) | Matches primary evaluation exactly | | Rubric | CQS D1-D5 (D6 excluded per DEC-4B-023) | Same rubric | | Fidelity check | Yes, Stage 3 on RAG responses | Equal treatment | | Caller model | Same as primary eval (check judge_config.yaml) | Must match | ## 3. RAG Condition Specification | Parameter | Value | Rationale | |-----------|-------|-----------| | Chunking | Section-level, ~600 tokens, 100 overlap | Reasonable default, deliberately untuned | | Embedding | all-MiniLM-L6-v2 (384-dim) | Already in project | | Index | FAISS IndexFlatIP, L2-normalized | Simplest possible | | Retrieval | Top-5 chunks per query | Standard default | | Injection | System prompt prepend as "Reference Materials" | How RAG actually works | | Agent loop | None — single-shot completion | RAG doesn't use tools | | Tool access | None | Fair comparison to control | ### Source Documents (from neo4j provenance) | Doc ID | File | Role | |--------|------|------| | ACS-GEN-001 | `knowledge-base/source-docs/OtherACS/acs_general_handbook_2020.pdf` | Primary — Ch. 1,3,4,7,8,10,12 | | ACS-DM-2024 | `knowledge-base/source-docs/census-methodology/acs_design_methodology_report_2024.pdf` | Primary | | Census Geography | `knowledge-base/source-docs/OtherACS/acs_geography_handbook_2020.pdf` | Primary | | ACS Accuracy | `knowledge-base/source-docs/OtherACS/ACS_Accuracy_of_Data_2023.pdf` | Secondary | | Multiyear Accuracy | `knowledge-base/source-docs/OtherACS/MultiyearACSAccuracyofData2023.pdf` | Secondary | | Understanding ACS | `knowledge-base/source-docs/OtherACS/Understanding and Using ACS Data_ What All Data Users Need to Know.pdf` | Secondary | ## 4. Output Locations ALL outputs go to `results/rag_ablation/`. Nothing touches existing results. ``` results/ ├── stage2_pragmatics_v_control/ ← Snapshot of existing analysis (Step 0) │ ├── README.md │ ├── effect_sizes.csv │ ├── irr_scores.csv │ ├── bias_diagnostics.csv │ ├── preference_summary.csv │ ├── stratified_effects.csv │ ├── fidelity_summary.csv │ ├── test_retest.csv │ └── aggregate_report.md ├── rag_ablation/ │ ├── index/ │ │ ├── chunks.jsonl │ │ ├── faiss_index.bin │ │ ├── metadata.json │ │ └── sources.txt │ ├── stage1/ │ │ └── rag_responses.jsonl │ ├── stage2/ │ │ ├── judge_scores_*.jsonl │ │ └── checkpoints/ │ ├── stage3/ │ │ └── fidelity_results.jsonl │ └── analysis/ │ ├── three_group_comparison.csv │ ├── friedman_tests.csv │ ├── posthoc_pairwise.csv │ ├── rag_vs_control_effects.csv │ ├── rag_fidelity.csv │ ├── rag_preferences.csv │ ├── rag_bias_diagnostics.csv │ ├── rag_test_retest.csv │ └── aggregate_report.md ``` ## 5. QC Procedure **Principle: CC executes, CC self-checks, human verifies.** Each pipeline step that produces artifacts follows three layers: 1. **Execution** — CC runs the script 2. **Automated QC** — CC runs a QC script with pass/fail criteria, writes report to `qc_report.txt` alongside artifacts. CC stops on FAIL. 3. **Human inspection** — We review QC report and spot-check before next phase. QC reports are committed for provenance. See lab notebook entry 2026-02-15 in `talks/fcsm_2026/notes.md` for rationale (lesson learned from Phase 4B data contamination where CC "fixed" a bug without ground-truth verification). ## 6. Execution Plan ### Phase A: Infrastructure (Day 1) | Step | Task | CC Task | Depends On | Est. Time | Status | |------|------|---------|------------|-----------|--------| | A0 | Snapshot existing results to `stage2_pragmatics_v_control/` | Part of main CC task | — | 2 min | ☐ | | A1 | Build RAG index from source documents | CC task Step 1 | A0 | 30 min | ☐ | | A2 | Create RAG retriever module | CC task Step 2 | A1 | 15 min | ☐ | | A3 | Add RAG condition to generate_responses.py | CC task Step 3 | A2 | 20 min | ☐ | **Gate Check A:** Verify index has reasonable chunk count (expect 150-300 chunks from ~300 pages). Spot-check 3 retrieval queries manually — do the returned chunks seem relevant? ### Phase B: Data Generation (Day 1-2) | Step | Task | CC Task | Depends On | Est. Time | Status | |------|------|---------|------------|-----------|--------| | B1 | Generate RAG responses for 39 queries | CC task Step 4 | A3 | 30 min + ~$3 API | ☐ | | B2 | Manual spot-check: read 5 RAG responses vs control | Manual | B1 | 15 min | ☐ | **Gate Check B:** Read NORM-001, GEO-002, SML-002, AMB-001, NORM-008 RAG responses. Do they cite handbook language? Do they have table IDs? (Expected: handbook language yes, table IDs no.) If RAG responses look identical to control, something is wrong with retrieval injection. ### Phase C: Judge Scoring (Day 2-3) | Step | Task | CC Task | Depends On | Est. Time | Est. Cost |Status | |------|------|---------|------------|-----------|-----------|-------| | C1 | OpenAI judge: RAG vs Control | CC task Step 5 | B1 | 30 min | ~$4 | ☐ | | C2 | Anthropic judge: RAG vs Control | CC task Step 5 | B1 | 30 min | ~$12 | ☐ | | C3 | Google judge: RAG vs Control | CC task Step 5 | B1 | 20 min | ~$8 | ☐ | **Gate Check C:** Verify 702 records, 0 parse failures, 234 per vendor. Check that presentation order counterbalancing is correct. ### Phase D: Fidelity (Day 3) | Step | Task | CC Task | Depends On | Est. Time | Status | |------|------|---------|------------|-----------|--------| | D1 | Stage 3 fidelity check on RAG responses | CC task Step 6 | B1 | 20 min | ☐ | **Gate Check D:** RAG auditability should be close to control's 8.1%, not treatment's 72.8%. If it's above 30%, investigate — the RAG chunks may be leaking table IDs from accuracy documents. ### Phase E: Analysis (Day 3-4) | Step | Task | CC Task | Depends On | Est. Time | Status | |------|------|---------|------------|-----------|--------| | E1 | Build analyze_three_group.py | CC task Step 7 | C1-C3, D1 | 45 min | ☐ | | E2 | Run analysis, generate CSVs and report | E1 | E1 | 5 min | ☐ | | E3 | Review numbers, sanity check against hypotheses | Manual | E2 | 30 min | ☐ | **Gate Check E:** The three-group table should tell a clear story. If RAG ≈ Pragmatics on any dimension, we have a problem. If RAG ≈ Control on all dimensions, the critique is dead but the experiment was overkill. ### Phase F: Documentation (Day 4) | Step | Task | Depends On | Status | |------|------|------------|--------| | F1 | Update SRS with VR-080 through VR-083 | E3 | ☐ | | F2 | Update rag_fallacy_thinking.md with actual numbers | E3 | ☐ | | F3 | Create decision record DEC-RAG-001 | E3 | ☐ | | F4 | Commit everything | F1-F3 | ☐ | ## 6. Analysis Plan ### Primary analysis: Three-group repeated-measures Unit of analysis: query (n=39) For each dimension D1-D5: 1. Compute query-level means for each condition (averaging across vendors and passes) 2. Friedman test across 3 conditions (nonparametric repeated-measures) 3. If Friedman p < 0.05: pairwise Wilcoxon signed-rank with Bonferroni correction (α = 0.05/3 = 0.0167) ### Secondary analyses (equal treatment to primary eval): - Cohen's d (paired and independent) for RAG vs Control - Inter-rater reliability (Krippendorff's α) for RAG vs Control scoring - Position bias check - Self-enhancement bias check - Verbosity bias check - Test-retest reliability per pass-pair - Judge preference rates (RAG vs Control) - Stratified effects (normal vs edge cases) - Fidelity/auditability comparison across all 3 conditions ### The money outputs: **Table 1: Three-Group CQS Comparison** | Dimension | Control M(SD) | RAG M(SD) | Pragmatics M(SD) | Friedman χ² | p | |-----------|--------------|-----------|-------------------|-------------|---| **Table 2: Pairwise Effect Sizes** | Dimension | RAG > Ctrl d [CI] | Prag > RAG d [CI] | Prag > Ctrl d [CI] | |-----------|-------------------|--------------------|--------------------| **Table 3: Auditability** | Metric | Control | RAG | Pragmatics | |--------|---------|-----|------------| | Auditable | 8.1% | [X]% | 72.8% | | Fidelity | N/A | [X]% | 91.6% | ## 7. Decision Criteria | Outcome | Interpretation | Paper Impact | |---------|---------------|--------------| | RAG ≈ Control on all dims | Documents in context don't help | Strongest: curation is essential | | RAG > Control on D4, ≈ on D3/D5 | Retrieval helps definitions, not judgment | Expected: validates the category argument | | RAG > Control on D3/D5 | Retrieval helps uncertainty/reproducibility | Weakens claim — but check fidelity | | RAG ≈ Pragmatics | Curation unnecessary, retrieval sufficient | Problem — rethink contribution | If RAG ≈ Pragmatics: look at fidelity. If RAG auditability is still low, pragmatics still win on reproducibility even if judge scores are similar (judges may not penalize non-auditability harshly enough). ## 8. Risks and Mitigations | Risk | Probability | Mitigation | |------|-------------|------------| | RAG chunks contain table IDs from accuracy docs | Low | Check chunks manually before running | | Judge scores inflated by Anthropic self-enhancement | Known | Report OpenAI+Google pooled as primary | | RAG responses identical to control (retrieval fails) | Low | Gate Check B catches this | | Cost overrun | Low | Budget ~$27 matching primary eval | | Pipeline code breaks with new condition | Medium | Build incrementally with gate checks | | Results don't clearly differentiate | Medium | Fidelity is the backstop metric | ## 9. Estimated Total Cost | Component | Cost | |-----------|------| | Stage 1 RAG generation | ~$3 | | Stage 2 OpenAI judge | ~$4 | | Stage 2 Anthropic judge | ~$12 | | Stage 2 Google judge | ~$8 | | Stage 3 fidelity | ~$0.50 | | **Total** | **~$27.50** | --- ## CC Task Location `cc_tasks/2026-02-15_rag_ablation_condition.md` ## Key Files (will be created) - `scripts/build_rag_index.py` - `src/eval/rag_retriever.py` - `src/eval/analyze_three_group.py` - `results/rag_ablation/` (entire directory tree)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

doe_rag_ablation_plan.md•11.1 KiB