Open Census MCP Server

phase4b_decision_log_DEC023.md•2.59 KiB

# DEC-4B-023: D6 Rubric Revision — Groundedness → Pipeline Fidelity ## Date: 2026-02-13 ## Discovery D6 "Groundedness & Faithfulness" consistently scored treatment LOWER than control (d=-0.23) across all judge vendors. Investigation revealed this is a rubric design flaw, not a treatment deficiency. ### Root Cause D6 asks: "Are all claims traceable to cited Census sources?" - **Control** makes vague, unfalsifiable claims ("about 15% according to Census data"). Nothing to trace → nothing to penalize → scores well by default. - **Treatment** cites specific variables (B17001_002E), FIPS codes (04013), exact values with MOEs. Each is a verifiable claim that can be wrong → more surface area for penalty. The rubric rewards vagueness and penalizes specificity. This is **arbitrary coherence** — the judge has no tools to verify claims, so it scores plausibility, not actual groundedness. ### Key Insight Groundedness of API-retrieved data is not a subjective judgment. It is a deterministic measurement: did the response faithfully report what the API returned? We already have both sides of this comparison in the Stage 1 data (tool_calls contain API responses; response_text contains what was reported). ## Decision **Replace D6 subjective judge scoring with automated Pipeline Fidelity metric.** - Extract verifiable claims from treatment response_text - Compare against tool_call results in Stage 1 JSONL - Compute match rate (exact for codes/FIPS, approximate for values) - Control scored N/A (no verifiable claims to check) - Report as separate metric, not part of 5-dimension judge composite ## Framing This maps to established frameworks: - **NIST AI RMF (AI 600-1)**: Valid and reliable AI outputs - **OMB Circular A-130**: Information quality and traceability - **FCSM Data Quality Framework**: Auditability of statistical outputs "Pipeline fidelity" not "groundedness" — measures whether the system faithfully transmits what authoritative sources provide. ## Rubric Impact - CQS composite becomes D1-D5 (5 dimensions, judge-scored) - Pipeline Fidelity reported separately (automated, deterministic) - D6 scores from existing runs retained for methodological discussion - Paper discusses why subjective LLM-judge scoring is insufficient for accuracy measurement and proposes automated alternative ## Status - [x] Discovery documented - [x] Fidelity extraction script designed and built - [x] Fidelity metric computed against v3 Stage 1 data - [x] Symmetric auditability measurement (both conditions) - [ ] Paper section drafted - [ ] Integrate into aggregate analysis with Stage 2 results

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

phase4b_decision_log_DEC023.md•2.59 KiB