Open Census MCP Server

rag_ablation_findings.md•5.7 KiB

# RAG Ablation Findings: Strategic Implications

## Date: 2026-02-15
## Context: Post three-group analysis discussion

---

## The "So Fucking What" Test

### Raw Numbers
- 35 ACS pragmatics (hand-curated expert judgments from 3 source documents)
- 311 RAG chunks (blind extraction from same 3 documents)
- Control: bare LLM, no augmentation

### Three-Group Results (0-2 scale)
| Dimension | Control | RAG | Pragmatics |
|-----------|---------|-----|------------|
| D1 (Data Selection) | 1.13 | **1.80** | 1.58 |
| D2 (Geographic) | 0.61 | 1.31 | **1.52** |
| D3 (Uncertainty) | 0.39 | 1.09 | **1.55** |
| D4 (Soundness) | 1.20 | 1.63 | **1.78** |
| D5 (Fitness) | 0.54 | 1.29 | **1.62** |
| Composite | 0.77 | 1.43 | **1.61** |

### Fidelity
| Metric | Control | RAG | Pragmatics |
|--------|---------|-----|------------|
| Fidelity | N/A | 64.9% | 91.6% |
| Mismatches | N/A | 0 | 1.6% |

---

## The Bloom's Taxonomy Framing

RAG operates at Level 1 (Remember): "Here's what the handbook says about MOEs."

Pragmatics operate at Level 5 (Evaluate): "This specific estimate has a CV 
that makes it unreliable for your use case. Use this instead."

**The D1 anomaly proves this.** RAG scores HIGHER on data selection (1.80 vs 
1.58) because it dumps more recalled information into the response. More 
information ≠ better decisions. Knowledge recall is the lowest level of 
thinking. The hard dimensions — D3 (uncertainty quantification) and D5 
(fitness-for-use) — require judgment and synthesis. That's where pragmatics 
pull ahead, and that's where getting it wrong causes actual harm.

The utility of information without judgment is limited. A 500-page handbook 
in your context window doesn't help if you don't know which paragraph matters 
for THIS query. RAG gives you information. Pragmatics give you judgment 
about that information.

---

## The Expert Knowledge Argument

The most valuable pragmatics are NOT found in any document. They are locked 
in the heads of expert statisticians and data scientists. Examples:

- "Don't compare 1-year and 5-year estimates directly" — this is in the 
  handbook, but the JUDGMENT about when users try to do it anyway and how 
  to redirect them is expert knowledge
- "For Mercer County PA, the CV on poverty estimates exceeds 30%" — no 
  document says this. An experienced analyst KNOWS this from working with 
  the data. A pragmatic encodes that judgment permanently.
- "St. Louis is an independent city, not in a county" — geographic edge 
  cases that every Census analyst learns the hard way

Pragmatics are an opportunity to CODIFY expertise and judgment that would 
otherwise be lost to retirement, turnover, and institutional amnesia. This 
is knowledge management, not just data management.

---

## Cost to Own: Pragmatics vs RAG

### RAG Maintenance Cycle
1. Source document updated → re-extract
2. Re-chunk (hope parameters still work)
3. Re-embed (model version matters)
4. Re-index (hope retrieval quality holds)
5. No way to verify without full evaluation re-run
6. No traceability: which queries are affected by the change?

### Pragmatics Maintenance Cycle
1. Source changes → edit the specific node
2. Provenance chain tells you exactly which source changed
3. Deterministic delivery means you know exactly which queries are affected
4. Update is surgical, not wholesale
5. Expert review of one judgment, not re-validation of entire pipeline

### Scale Comparison
- 35 pragmatics: a few days of expert curation
- 311 RAG chunks: 30 minutes of compute, but unknowable quality without eval
- The pragmatics took more human time but produce AUDITABLE, TRACEABLE, 
  DETERMINISTIC results. The RAG took less time but produces probabilistic, 
  unauditable, untraceable results.

---

## The Weight Class Argument

35 pragmatics punch astronomically above their weight:
- 35 atomic judgments vs 311 text chunks (9:1 ratio)
- Yet pragmatics win on composite (1.61 vs 1.43), on D3 (1.55 vs 1.09), 
  on D5 (1.62 vs 1.29)
- And achieve 91.6% fidelity vs 64.9%

Even with RAG, you must curate. Someone has to decide which documents to 
include, how to chunk them, how many to retrieve. Those are judgment calls 
made blindly. Pragmatics make those judgment calls explicitly, with 
provenance and expert review.

What good is having the information if you don't know how to use it?

---

## API-Side Delivery: The Architectural Advantage

RAG requires: vector store, embedding model, retrieval pipeline, chunk 
storage. Operationally heavy. Must be hosted and maintained separately 
from the data API.

Pragmatics require: a lookup table keyed to query parameters. The Census 
API already returns metadata with every call (variable labels, geography 
names, vintage). Pragmatics are the same pattern — structured JSON returned 
alongside the data. No vector store, no embeddings, no retrieval pipeline.

Call for poverty data in a small county → API returns the data PLUS a 
fitness-for-use note: "CV exceeds 30%, consider 5-year estimates."

This is how ARIA labels work for accessibility. This is how Open Banking 
structured APIs replaced screen-scraping. The producer ships the judgment 
with the data.

---

## The FCSM Pitch (One Sentence)

Same source documents, three delivery methods. Raw text helps models recall 
more (D1). Curated expert judgment helps models reason correctly (D3, D5). 
The difference isn't what the model knows — it's whether it gets expert 
judgment at the point of decision.

## The Best Practice Recommendation

Don't just publish your data. Don't just publish your documentation. 
Publish your expert judgment about fitness-for-use in a structured, 
machine-queryable, deterministic form. That's what makes open data 
safe for GenAI consumption.

RAG is the easy path. It works. But it doesn't scale to trustworthy.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

rag_ablation_findings.md•5.7 KiB

# RAG Ablation Findings: Strategic Implications

## Date: 2026-02-15
## Context: Post three-group analysis discussion

---

## The "So Fucking What" Test

### Raw Numbers
- 35 ACS pragmatics (hand-curated expert judgments from 3 source documents)
- 311 RAG chunks (blind extraction from same 3 documents)
- Control: bare LLM, no augmentation

### Three-Group Results (0-2 scale)
| Dimension | Control | RAG | Pragmatics |
|-----------|---------|-----|------------|
| D1 (Data Selection) | 1.13 | **1.80** | 1.58 |
| D2 (Geographic) | 0.61 | 1.31 | **1.52** |
| D3 (Uncertainty) | 0.39 | 1.09 | **1.55** |
| D4 (Soundness) | 1.20 | 1.63 | **1.78** |
| D5 (Fitness) | 0.54 | 1.29 | **1.62** |
| Composite | 0.77 | 1.43 | **1.61** |

### Fidelity
| Metric | Control | RAG | Pragmatics |
|--------|---------|-----|------------|
| Fidelity | N/A | 64.9% | 91.6% |
| Mismatches | N/A | 0 | 1.6% |

---

## The Bloom's Taxonomy Framing

RAG operates at Level 1 (Remember): "Here's what the handbook says about MOEs."

Pragmatics operate at Level 5 (Evaluate): "This specific estimate has a CV 
that makes it unreliable for your use case. Use this instead."

**The D1 anomaly proves this.** RAG scores HIGHER on data selection (1.80 vs 
1.58) because it dumps more recalled information into the response. More 
information ≠ better decisions. Knowledge recall is the lowest level of 
thinking. The hard dimensions — D3 (uncertainty quantification) and D5 
(fitness-for-use) — require judgment and synthesis. That's where pragmatics 
pull ahead, and that's where getting it wrong causes actual harm.

The utility of information without judgment is limited. A 500-page handbook 
in your context window doesn't help if you don't know which paragraph matters 
for THIS query. RAG gives you information. Pragmatics give you judgment 
about that information.

---

## The Expert Knowledge Argument

The most valuable pragmatics are NOT found in any document. They are locked 
in the heads of expert statisticians and data scientists. Examples:

- "Don't compare 1-year and 5-year estimates directly" — this is in the 
  handbook, but the JUDGMENT about when users try to do it anyway and how 
  to redirect them is expert knowledge
- "For Mercer County PA, the CV on poverty estimates exceeds 30%" — no 
  document says this. An experienced analyst KNOWS this from working with 
  the data. A pragmatic encodes that judgment permanently.
- "St. Louis is an independent city, not in a county" — geographic edge 
  cases that every Census analyst learns the hard way

Pragmatics are an opportunity to CODIFY expertise and judgment that would 
otherwise be lost to retirement, turnover, and institutional amnesia. This 
is knowledge management, not just data management.

---

## Cost to Own: Pragmatics vs RAG

### RAG Maintenance Cycle
1. Source document updated → re-extract
2. Re-chunk (hope parameters still work)
3. Re-embed (model version matters)
4. Re-index (hope retrieval quality holds)
5. No way to verify without full evaluation re-run
6. No traceability: which queries are affected by the change?

### Pragmatics Maintenance Cycle
1. Source changes → edit the specific node
2. Provenance chain tells you exactly which source changed
3. Deterministic delivery means you know exactly which queries are affected
4. Update is surgical, not wholesale
5. Expert review of one judgment, not re-validation of entire pipeline

### Scale Comparison
- 35 pragmatics: a few days of expert curation
- 311 RAG chunks: 30 minutes of compute, but unknowable quality without eval
- The pragmatics took more human time but produce AUDITABLE, TRACEABLE, 
  DETERMINISTIC results. The RAG took less time but produces probabilistic, 
  unauditable, untraceable results.

---

## The Weight Class Argument

35 pragmatics punch astronomically above their weight:
- 35 atomic judgments vs 311 text chunks (9:1 ratio)
- Yet pragmatics win on composite (1.61 vs 1.43), on D3 (1.55 vs 1.09), 
  on D5 (1.62 vs 1.29)
- And achieve 91.6% fidelity vs 64.9%

Even with RAG, you must curate. Someone has to decide which documents to 
include, how to chunk them, how many to retrieve. Those are judgment calls 
made blindly. Pragmatics make those judgment calls explicitly, with 
provenance and expert review.

What good is having the information if you don't know how to use it?

---

## API-Side Delivery: The Architectural Advantage

RAG requires: vector store, embedding model, retrieval pipeline, chunk 
storage. Operationally heavy. Must be hosted and maintained separately 
from the data API.

Pragmatics require: a lookup table keyed to query parameters. The Census 
API already returns metadata with every call (variable labels, geography 
names, vintage). Pragmatics are the same pattern — structured JSON returned 
alongside the data. No vector store, no embeddings, no retrieval pipeline.

Call for poverty data in a small county → API returns the data PLUS a 
fitness-for-use note: "CV exceeds 30%, consider 5-year estimates."

This is how ARIA labels work for accessibility. This is how Open Banking 
structured APIs replaced screen-scraping. The producer ships the judgment 
with the data.

---

## The FCSM Pitch (One Sentence)

Same source documents, three delivery methods. Raw text helps models recall 
more (D1). Curated expert judgment helps models reason correctly (D3, D5). 
The difference isn't what the model knows — it's whether it gets expert 
judgment at the point of decision.

## The Best Practice Recommendation

Don't just publish your data. Don't just publish your documentation. 
Publish your expert judgment about fitness-for-use in a structured, 
machine-queryable, deterministic form. That's what makes open data 
safe for GenAI consumption.

RAG is the easy path. It works. But it doesn't scale to trustworthy.