Open Census MCP Server

10_appendices.md•17.9 KiB

# Appendices ## Appendix A: Complete Test Battery The full 39-query test battery. Source: `src/eval/battery/queries.yaml`. **Distribution:** 15 standard queries (category `normal`) and 24 edge-case queries across 7 edge categories. | # | Query Text | Category | Difficulty | |---|-----------|----------|------------| | 1 | What is the total population of California according to the most recent Census data? | normal | normal | | 2 | What is the median household income in Cook County, Illinois? | normal | normal | | 3 | How many housing units are in Harris County, Texas? | normal | normal | | 4 | What percentage of people in New York City have a bachelor's degree or higher? | normal | normal | | 5 | What is the poverty rate in Maricopa County, Arizona? | normal | normal | | 6 | What percentage of households in Miami-Dade County rent rather than own their home? | normal | normal | | 7 | How many people in King County, Washington are 65 or older? | normal | normal | | 8 | What is the unemployment rate in Wayne County, Michigan? | normal | normal | | 9 | What is the median age in Travis County, Texas? | normal | normal | | 10 | What percentage of people in Hennepin County, Minnesota have health insurance? | normal | normal | | 11 | How many people in Fulton County, Georgia were born in another country? | normal | normal | | 12 | What is the average household size in Salt Lake County, Utah? | normal | normal | | 13 | What percentage of workers in Alameda County, California commute by public transit? | normal | normal | | 14 | How many single-mother households are there in Philadelphia County, Pennsylvania? | normal | normal | | 15 | What is the median gross rent in Denver County, Colorado? | normal | normal | | 16 | What is the population of Washington? | geographic_edge | trap | | 17 | What is the median income in Portland? | geographic_edge | trap | | 18 | Give me tract-level median income data for rural Loving County, Texas. | geographic_edge | trap | | 19 | What is the median household income in Alexandria, Virginia? | geographic_edge | tricky | | 20 | Compare poverty rates in the Bronx and Manhattan. | geographic_edge | tricky | | 21 | What is the homeownership rate in Nashville, Tennessee? | geographic_edge | tricky | | 22 | What is the unemployment rate in Washington, DC? | geographic_edge | tricky | | 23 | What is the median household income in Kalawao County, Hawaii? | small_area | trap | | 24 | Compare the poverty rates across all census tracts in rural Wyoming. | small_area | trap | | 25 | What is the income of Asian Americans in Boise, Idaho? | small_area | tricky | | 26 | I need ACS 1-year data for Gallatin County, Montana. | small_area | tricky | | 27 | Compare the 2019 and 2020 ACS estimates for health insurance coverage in Florida. | temporal | trap | | 28 | How has median household income in Philadelphia changed from 2010 to 2022? | temporal | tricky | | 29 | Has the percentage of people working from home in Denver increased since 2015? | temporal | tricky | | 30 | What was the median home value in San Francisco in 2005 dollars? | temporal | tricky | | 31 | How many families are in poverty in Springfield? | ambiguity | trap | | 32 | What's the income gap between whites and minorities in my area? | ambiguity | trap | | 33 | Is the economy better in Texas or California? | ambiguity | trap | | 34 | Give me ACS 1-year estimates for Sioux County, Nebraska. | product_mismatch | tricky | | 35 | What does the decennial census say about income levels in Ohio? | product_mismatch | tricky | | 36 | I need monthly employment data from the ACS. | product_mismatch | tricky | | 37 | My 8th grade class is doing a project on our town. How many people live in Bozeman, Montana and is it growing? | persona_8th_grader | normal | | 38 | I'm analyzing population trends in Bozeman, MT for a comprehensive plan update. I need the most recent ACS estimates with margins of error, and guidance on comparing to the 2010 baseline. | persona_city_planner | tricky | | 39 | I'm writing a story about whether Bozeman is really 'booming' as people claim. What do the Census numbers actually show, and how confident should I be in those numbers? | persona_journalist | tricky | **Difficulty key:** `normal` = standard query with clear answer; `tricky` = requires methodological care; `trap` = contains a latent error, ambiguity, or fitness-for-use failure that an uninformed response would miss. --- ## Appendix B: Consultation Quality Score (CQS) Rubric The CQS rubric specifies five quality dimensions (D1–D5), each scored 0–2. Full specification is available at `docs/verification/cqs_rubric_specification.md`. Grounding compliance is reported as a Stage 3 pipeline verification metric alongside fidelity and auditability. | Dimension | Name | What It Measures | Scoring | |-----------|------|-----------------|---------| | D1 | Source Selection & Fitness | Right Census product, vintage, geography, and universe | 0 / 1 / 2 | | D2 | Methodological Soundness | Correct computations, weights, denominators, and formulas | 0 / 1 / 2 | | D3 | Uncertainty Communication | MOE acknowledged, quantified, and correctly interpreted | 0 / 1 / 2 | | D4 | Definitional Accuracy | Official Census concepts and reference periods used correctly | 0 / 1 / 2 | | D5 | Reproducibility & Traceability | Another analyst can replicate the cited numbers | 0 / 1 / 2 | **Stage 3 verification metrics (pipeline behavior, not CQS dimensions):** - Fidelity: 91.2% (pragmatics), 74.6% (RAG), 78.3% (control) - Auditability: 72.8% (pragmatics), 8.1% (control) - Grounding compliance: 100% — all 39 pragmatics queries consulted methodology guidance before data interpretation ### Full Scoring Criteria #### D1: Source Selection & Fitness **What it measures:** Did the response select the right Census product, vintage, geography level, and population universe for the stated question? - **Score 0 (Absent):** Wrong product entirely (e.g., decennial for income), wrong vintage, wrong geography level for the population, or no product specified. - **Score 1 (Partial):** Correct product family but wrong parameters (e.g., ACS 1-year for a 15K-population area), or correct product but without justification. - **Score 2 (Complete):** Correct product, vintage, geography, and universe — with rationale appropriate to the query context. Also scores 2: correctly determining that no available product meets fitness-for-use requirements and explaining why, with redirection to alternatives. **Failure modes:** Using ACS 1-year for geographies below 65K population threshold; mixing decennial and ACS concepts without noting design differences; not specifying vintage when temporal precision matters. #### D2: Methodological Soundness **What it measures:** Are computations, weights, denominators, and formulas correct for the stated analysis? - **Score 0 (Absent):** Fundamental errors — wrong denominator, unweighted counts used for inference, incorrect derived statistics, or no computation shown. - **Score 1 (Partial):** Core computation correct but missing weight specification, incomplete formula, or minor unit inconsistency. - **Score 2 (Complete):** Correct computation with appropriate weights, denominators, and formulas — consistent units, proper aggregation methods. **Failure modes:** Dividing by total population when the universe is civilian noninstitutionalized; adding MOEs directly instead of root-sum-of-squares; comparing rates with different bases without noting the difference. #### D3: Uncertainty Communication **What it measures:** Does the response acknowledge, quantify, and correctly interpret statistical uncertainty? - **Score 0 (Absent):** No mention of uncertainty, MOE, or reliability. Estimates presented as exact counts. - **Score 1 (Partial):** Uncertainty mentioned qualitatively ("estimates may vary") but not quantified, or MOE provided without interpretation. - **Score 2 (Complete):** MOE or SE provided with correct confidence level, significance testing appropriate to design, and explicit reliability assessment. Also scores 2: determining that uncertainty is too high for the estimate to be useful and recommending against use. **Failure modes:** Ranking estimates without checking MOE overlap; over-precision (reporting tract-level income to the dollar without MOE); using 95% CI interpretation for ACS data reported at 90% confidence. #### D4: Definitional Accuracy **What it measures:** Are official Census concepts, classifications, and reference periods used correctly? - **Score 0 (Absent):** Key concepts conflated or used incorrectly (e.g., household vs. family, nominal vs. real dollars, point-in-time vs. period estimate). - **Score 1 (Partial):** Correct concepts but imprecise language, or reference period not specified. - **Score 2 (Complete):** Official definitions used correctly, reference periods explicit, and cross-source differences flagged when applicable. **Failure modes:** Treating ACS period estimates as point-in-time snapshots; conflating "household income" with "family income"; comparing ACS and CPS estimates without noting design and definitional differences. #### D5: Reproducibility & Traceability **What it measures:** Can another analyst replicate the stated numbers from the cited sources? - **Score 0 (Absent):** "According to Census data..." — no table ID, no variable codes, no geography specification. - **Score 1 (Partial):** Dataset and year specified but missing table ID or variable codes, or geography described but not with FIPS/GEOID precision. - **Score 2 (Complete):** Full provenance: dataset, table ID or variable codes, geography (with identifiers), year/vintage, and any filters or transformations described. **Failure modes:** Confabulated table IDs; correct data but no way to verify the source; describing geography colloquially without FIPS or GEOID. --- ## Appendix C: System Prompts System prompts used for each experimental condition. Source: `src/eval/agent_loop.py` and `src/eval/rag_retriever.py`. All three conditions share the same base system prompt; conditions differ only in what augments or extends it. ### Base System Prompt (shared across all conditions) ``` You are a statistical consultant helping users access and understand U.S. Census data. Use your available tools to answer the question. ``` ### Control Condition Identical to base system prompt. No augmentation. Receives data retrieval tools (`get_census_data`, `explore_variables`) only. ### RAG Condition Base system prompt augmented at runtime with retrieved methodology documentation chunks. The following template is applied before each query: ``` {base_prompt} ## Reference Materials The following excerpts from Census methodology documentation may be relevant: {retrieved_chunks} Use these materials to inform your response where applicable. ``` Where `{retrieved_chunks}` is the top-5 chunks retrieved from a 311-chunk FAISS index of ACS methodology documentation, ranked by cosine similarity to the query. Receives the same data retrieval tools as control. ### Pragmatics Condition Extends the base prompt with a grounding gate instruction that forces consultation of methodology guidance before data retrieval: ``` You are a statistical consultant helping users access and understand U.S. Census data. Use your available tools to answer the question. You MUST call get_methodology_guidance FIRST before any other tool calls. This is required for every query — no exceptions. Select topics relevant to the query. After reviewing the methodology guidance, proceed with data retrieval. ``` Receives data retrieval tools plus `get_methodology_guidance` (excluded from control and RAG conditions). The `get_methodology_guidance` tool queries the compiled ACS pragmatics pack (SQLite) and returns structured expert judgment relevant to the query topics. --- ## Appendix D: Design Correction Post-Mortem The V1 evaluation design contained a confound: the pragmatics condition had access to a methodology guidance tool that the control and RAG conditions lacked, making tool access — not knowledge representation — the independent variable. This was identified and corrected in V2, where all conditions received identical data tools and differed only in methodology support form. Full documentation is in `docs/decisions/ADR-011-v2-evaluation-design-correction.md`. --- ## Appendix E: Pragmatic Item Catalog The 36 pragmatic items in the ACS pack. Full content (context text, triggers, thread edges, provenance) is available in `staging/acs/*.json` (18 category files). Items sorted by category. | Item ID | Category | Latitude | Context (first 100 chars) | Triggers | Thread Edges | |---------|----------|----------|--------------------------|----------|-------------| | ACS-BRK-001 | break_in_series | narrow | The 2009-2010 transition marks a break in population controls due to shift from Census 2000... | 4 | 1 | | ACS-BRK-002 | break_in_series | narrow | The ACS transitioned from long-form decennial census to continuous monthly collection in 200... | 7 | 2 | | ACS-BRK-003 | break_in_series | narrow | Starting with 2024 data, ACS updated the Period of Military Service question to align with D... | 6 | 1 | | ACS-CMP-001 | comparison | none | Never directly compare ACS 1-year estimates with 5-year estimates. They represent different ... | 3 | 1 | | ACS-CMP-002 | comparison | narrow | Consecutive 5-year estimates share 4 out of 5 years of underlying data. This means they are... | 6 | 2 | | ACS-CMP-003 | comparison | none | Overlapping confidence intervals do NOT prove two estimates are statistically indistinguisha... | 5 | 1 | | ACS-DIS-001 | disclosure_avoidance | narrow | ACS applies data swapping and noise injection to protect respondent confidentiality. Small-a... | 5 | 2 | | ACS-DIS-002 | disclosure_avoidance | none | ACS does NOT use differential privacy. The 2020 Decennial Census used differential privacy,... | 4 | 0 | | ACS-DIS-003 | disclosure_avoidance | narrow | When ACS estimates show a margin of error equal to the estimate itself, or when the Census B... | 5 | 1 | | ACS-DOL-001 | dollar_values | narrow | When comparing dollar-denominated estimates (income, rent, home value) across different ACS ... | 5 | 1 | | ACS-EQV-001 | geographic_equivalence | narrow | Some census tracts contain an entire county's population — this occurs in very rural or spar... | 5 | 1 | | ACS-EQV-002 | geographic_equivalence | narrow | Census Designated Places (CDPs) are statistical entities, not legal jurisdictions. CDPs have... | 5 | 0 | | ACS-GEO-001 | geography | none | Block group level data is only available in ACS 5-year estimates, not 1-year estimates. This... | 4 | 1 | | ACS-GEO-002 | geography | wide | Public Use Microdata Areas (PUMAs) have a minimum population of 100,000. PUMA boundaries do ... | 4 | 0 | | ACS-GEO-003 | geography | wide | Congressional district boundaries change after each decennial census reapportionment. ACS es... | 4 | 0 | | ACS-GEO-004 | geography | full | ACS geographic boundaries reflect boundaries as of January 1 of the final year in the survey... | 4 | 0 | | ACS-GQ-001 | group_quarters | narrow | ACS includes group quarters population (college dorms, military barracks, prisons). For comm... | 8 | 2 | | ACS-GQ-002 | group_quarters | wide | ACS group quarters imputation rates can be very high — up to 30-50% of GQ persons may have ... | 6 | 2 | | ACS-IND-001 | independent_cities | none | Some US cities are county-equivalents (independent cities) — they do NOT nest inside a count... | 5 | 0 | | ACS-MOE-001 | margin_of_error | narrow | To calculate standard error from ACS margin of error: SE = MOE / 1.645. ACS MOEs are report... | 3 | 1 | | ACS-MOE-002 | margin_of_error | narrow | Coefficient of variation (CV) = (SE / estimate) × 100. CV above 40% indicates the estimate ... | 3 | 1 | | ACS-MOE-003 | margin_of_error | wide | 5-year estimates have smaller margins of error than 1-year estimates for the same geography,... | 3 | 0 | | ACS-MOE-004 | margin_of_error | narrow | MOE approximation formulas for derived estimates (sums, differences, ratios) assume independ... | 5 | 2 | | ACS-NRS-001 | nonresponse | narrow | ACS publishes allocation rates (item imputation rates) for every characteristic. High alloca... | 6 | 2 | | ACS-NRS-002 | nonresponse | narrow | ACS uses hot-deck imputation, which assigns values from a statistically similar responding u... | 7 | 1 | | ACS-PER-001 | period_estimate | narrow | ACS produces period estimates, not point-in-time estimates. A 5-year estimate represents an ... | 3 | 0 | | ACS-PCL-001 | population_controls | narrow | ACS estimates at the tract and block group level are NOT controlled to independent populatio... | 5 | 2 | | ACS-POP-001 | population_threshold | none | ACS 1-year estimates are only published for geographic areas with population of 65,000 or mo... | 3 | 1 | | ACS-POP-002 | population_threshold | none | ACS 1-year Supplemental Estimates are available for areas with population of 20,000 or more,... | 3 | 0 | | ACS-POP-003 | population_threshold | none | ACS 5-year estimates are available for all geographic areas, including census tracts and bloc... | 3 | 0 | | ACS-REL-001 | release_schedule | narrow | As of December 2025, the most recent ACS releases are: ACS 1-year 2024 (released September ... | 4 | 0 | | ACS-RES-001 | residence_rules | narrow | ACS uses a 'current residence' rule — a person must have lived at an address for 2 months or... | 6 | 2 | | ACS-SAM-001 | sampling | wide | ACS sampling rates are not uniform. Sparsely populated areas are sampled at rates up to 5x h... | 6 | 2 | | ACS-SUP-001 | suppression | wide | Some 1-year ACS tables may be suppressed if estimates are deemed too unreliable. Suppression... | 3 | 0 | | ACS-THR-001 | threshold | narrow | For geographies with total population under approximately 1,000, ACS 5-year estimates may st... | 5 | 2 | | ACS-THR-002 | threshold | narrow | When a user requests data for a small place (population under 5,000), proactively check whet... | 4 | 1 | **Latitude key:** `none` = hard constraint (no exceptions); `narrow` = strong guidance with rare exceptions; `wide` = context-dependent; `full` = background information.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

10_appendices.md•17.9 KiB