Open Census MCP Server

pragmatics_data_flow.md•14.6 KiB

# How the Pragmatics Layer Works ## End-to-End Data Flow *Written 2026-02-08. For vocabulary definitions, see `pragmatics_vocabulary.md`.* --- ## The Problem in One Sentence Census data comes with numbers but not judgment. The pragmatics layer adds the judgment. --- ## Two Paths Into the System There are exactly two ways pragmatic context reaches the caller LLM. Understanding both is essential. ### Path 1: Auto-Bundled (implicit) The user asks for data. The MCP tool figures out what guidance applies and bundles it with the response automatically. The caller LLM never explicitly asked for guidance — it just gets it. ``` User: "What's the median income in rural Owsley County, KY?" │ ▼ Caller LLM interprets request Calls MCP tool: get_census_data state="21", county="189" variables=["B19013_001E"] product="acs5" │ ▼ ┌─────────────────────────────────┐ │ MCP get_census_data │ │ │ │ 1. Validates parameters │ │ 2. Calls Census API │ │ 3. Auto-detects triggers: │ │ • product=acs5 → "period_ │ │ estimate" │ │ • geo=county → (no small- │ │ area trigger) │ │ • var=B19* → "dollar_values", │ │ "inflation" │ │ • always → "margin_of_error", │ │ "reliability" │ │ 4. Looks up matching context │ │ 5. Bundles into response │ └────────────┬────────────────────┘ │ ▼ Response to Caller LLM: { "data": { ... actual Census numbers ... }, "pragmatics": { "guidance": [ { "id": "ACS-DOL-001", "text": "Income estimates are in nominal dollars. To compare across years, adjust for inflation using CPI-U.", "latitude": "none" }, { "id": "ACS-MOE-001", "text": "Every ACS estimate has a margin of error at the 90% confidence level...", "latitude": "narrow" } ], "related": [ ... thread-connected context ... ] }, "source": { ... API provenance ... } } ``` **Key insight:** The caller LLM didn't ask for inflation guidance. The MCP saw an income variable (`B19*`) and automatically included it. This is the **parameter-matching** logic in `retriever.py:get_guidance_by_parameters()`. The mapping rules are simple if/then — no LLM reasoning: | If... | Then add triggers... | |-------|---------------------| | `product == "acs1"` | `population_threshold`, `1yr_acs` | | `product == "acs5"` | `period_estimate` | | `geo_level in (tract, block_group)` | `small_area` | | Variable starts with `B19`, `B25` | `dollar_values`, `inflation` | | `year in (2009, 2010)` | `break_in_series` | | Always | `margin_of_error`, `reliability` | ### Path 2: Explicit Request (the caller asks) The caller LLM decides it needs guidance before or instead of pulling data. It calls `get_methodology_guidance` directly with topic tags. ``` User: "Can I compare 2015 and 2022 ACS income data for my city?" │ ▼ Caller LLM recognizes: - temporal comparison question - dollar values involved - needs guidance before pulling data Calls MCP tool: get_methodology_guidance topics=["temporal_comparison", "dollar_values"] domain="acs" │ ▼ ┌─────────────────────────────────┐ │ MCP get_methodology_guidance │ │ │ │ 1. Takes topics as-is │ │ (these ARE the triggers) │ │ 2. Searches loaded packs for │ │ context items whose │ │ `triggers` field contains │ │ any matching topic │ │ 3. For each match, traverses │ │ thread edges (max_depth=2) │ │ to find related context │ │ 4. Returns guidance + related │ └────────────┬────────────────────┘ │ ▼ Response includes: - CMP-001: "Don't compare 1-year to 5-year..." - DOL-001: "Adjust for inflation using CPI-U..." - Related (via thread): BRK-001: "Known breaks in ACS series at 2005, 2013, 2020..." ``` **Key insight:** The caller LLM chose which topics to send. This is where ADR-003 matters — you need a reasonably capable LLM to know that a temporal comparison of income involves both `temporal_comparison` and `dollar_values` triggers. --- ## What Happens Inside: Trigger Matching The word "trigger" does double duty, which causes confusion. Here's the precise definition: **Triggers are index keys stored on each context item.** They're the tags that make a piece of context findable. A context item about inflation adjustment might have: ```json { "context_id": "ACS-DOL-001", "triggers": ["dollar_values", "inflation", "income_comparison", "cpi"], "context_text": "Income estimates are in nominal dollars...", "latitude": "none" } ``` **Topics are what the caller sends.** When the caller sends `topics=["dollar_values"]`, the retriever scans all loaded context items and returns those where `"dollar_values"` appears in their `triggers` list. **It's just tag matching.** No embeddings, no semantic search, no reasoning. The caller says "I need context tagged `dollar_values`" and the system returns everything tagged `dollar_values`. Simple. ``` Caller sends topics ──→ matched against ──→ triggers on context items (what I need) (what I'm tagged with) ``` --- ## Thread Traversal: Following Connections After finding direct matches, the retriever follows **thread edges** to find related context. This is graph traversal, not search. Each context item can have edges to other context items: ``` ACS-DIS-001 (data swapping/suppression) │ ├──relates_to──→ ACS-SUP-001 (suppression rules) │ └──relates_to──→ ACS-DIS-003 (**, -, *** symbols) │ └──relates_to──→ ACS-MOE-002 (CV thresholds) ``` If you query for `["suppression"]` and match ACS-DIS-001, the traversal automatically pulls in ACS-SUP-001 and ACS-DIS-003 (depth 1), and ACS-MOE-002 (depth 2). **Three edge types:** | Edge | Meaning | Example | |------|---------|---------| | `inherits` | Parent pack's general version applies | CEN-GEO-001 inherits GEN-TV-001 | | `relates_to` | Topically connected | DIS-001 relates to SUP-001 | | `applies_to` | Specific application of general principle | *(future use)* | --- ## Latitude: What the Caller Does With It Latitude tells the caller LLM how much freedom it has when applying each piece of context. It's metadata *about* the guidance, not the guidance itself. ``` Caller receives: { text: "ACS does NOT use differential privacy.", latitude: "none" } { text: "CV > 40% is usually unreliable.", latitude: "narrow" } { text: "Consider SAIPE for county poverty.", latitude: "wide" } Caller's reasoning: "none" → I must state this as fact. No wiggle room. "narrow" → I should follow this unless I have a specific reason not to. "wide" → I should mention this but the user's context might override. "full" → Background info. Use my judgment. ``` The MCP doesn't enforce latitude. It's a signal to the caller LLM about how to weight the context in its reasoning. A well-prompted caller will treat `none` as hard constraints and `full` as optional background. --- ## The Physical Architecture ### Where things live: ``` staging/ ← ASSERTED (human-verified source of truth) ├── general_statistics/ │ └── temporal.json ← 1 context item ├── census/ │ └── geography.json ← 1 context item └── acs/ ├── manifest.json ← pack metadata + version ├── margin_of_error.json ← 3 context items ├── population_threshold.json ← 3 items ├── disclosure_avoidance.json ← 3 items ├── threshold.json ← 2 items ├── independent_cities.json ← 1 item └── ... (12 category files) │ │ python scripts/compile_all.py ▼ packs/ ← AUGMENTED (compiled, shippable) ├── general_statistics.db ← SQLite: context + threads ├── census.db ← inherits general_statistics └── acs.db ← inherits census ``` ### The inheritance chain: ``` general_statistics.db (GEN-* items) ▲ inherits census.db (CEN-* items) ▲ inherits acs.db (ACS-* items, 27 total) ``` When you `load_pack("acs")`, it automatically loads all three. Queries search across the full chain. ### The authoring pipeline: ``` Source documents (ACS handbook, methodology docs) │ │ Manual extraction (today) │ LLM-assisted extraction (future: MinerU + agent swarms) ▼ Neo4j `pragmatics` database ← AUTHORING ENVIRONMENT (27 Context nodes, 16 RELATES_TO edges, 17 BELONGS_TO edges) │ │ scripts/neo4j_to_staging.py ▼ staging/*.json ← ASSERTED (validated against Pydantic) │ │ scripts/compile_all.py ▼ packs/*.db ← AUGMENTED (shipped with MCP) │ │ (reverse sync) │ scripts/staging_to_neo4j.py ▲ staging/*.json ``` --- ## Concrete Example: Full Round Trip **Query:** "What's the poverty rate in a small rural tract in Kentucky?" **Step 1 — Caller LLM interprets** and calls `get_census_data`: - state="21" (KY), county="189" (Owsley), tract="*" - variables=["B17001_001E", "B17001_002E"] - product="acs5" **Step 2 — MCP auto-detects triggers** from parameters: - `product=acs5` → `["period_estimate", "5-year"]` - `geo_level=tract` → `["small_area", "block_group"]` - Always → `["margin_of_error", "reliability"]` **Step 3 — Retriever searches loaded packs** for context items whose `triggers` field contains any of: `period_estimate`, `small_area`, `margin_of_error`, `reliability` **Step 4 — Matches found** (example): - ACS-PER-001: "5-year estimates represent a 60-month average..." (latitude: narrow) - ACS-THR-001: "Pop <1,000, CV can exceed 100%..." (latitude: narrow) - ACS-THR-002: "Pop <5,000, proactively warn about MOE..." (latitude: narrow) - ACS-MOE-001: "Every ACS estimate has a margin of error..." (latitude: narrow) - ACS-MOE-002: "CV < 15% reliable, 15-30% caution..." (latitude: narrow) **Step 5 — Thread traversal** from each match (depth 2): - ACS-THR-001 →relates_to→ ACS-POP-003 (5-year availability) - ACS-THR-001 →relates_to→ ACS-MOE-003 (5-year vs 1-year precision) - ACS-MOE-001 →relates_to→ ACS-MOE-002 →relates_to→ ACS-MOE-003 **Step 6 — Bundle and return:** ```json { "data": { "B17001_001E": "4500", "B17001_002E": "2100", ... }, "pragmatics": { "guidance": [ ... 5 direct matches ... ], "related": [ ... 3 thread-traversed items ... ], "sources": [{"document": "ACS-GEN-001", "section": "Ch. 7"}, ...] } } ``` **Step 7 — Caller LLM reasons** with the bundled context: > "The poverty rate is approximately 47% (2,100/4,500). However, this is a 5-year average (2018-2022), not a point-in-time snapshot. At the tract level in a county with population under 5,000, the margin of error may be significant — the coefficient of variation likely exceeds 30%, meaning this estimate should be used with caution. Consider using the county-level estimate for more reliable figures." The caller didn't need to know any of this beforehand. The pragmatics layer told it. --- ## What Each Component Is Responsible For | Component | Does | Does NOT | |-----------|------|----------| | **Caller LLM** | Interprets user question, selects topics, reasons with context, explains to user | Look up triggers, traverse threads, validate parameters | | **MCP tools** | Validates parameters, calls Census API, matches triggers, traverses threads, bundles pragmatics | Interpret user intent, reason about fitness-for-use, decide what to tell user | | **Retriever** | Tag matching (topics → triggers), thread traversal, auto-bundling from parameters | Semantic search, embedding-based matching, LLM reasoning | | **Pack (SQLite)** | Stores context + threads + pack metadata, supports fast lookup | Authoring, versioning, collaborative editing | | **Neo4j** | Authoring, graph visualization, collaborative editing, schema evolution | Runtime serving (too heavy, wrong tool) | | **Staging JSON** | Source of truth (asserted), Pydantic-validated, human-readable | Runtime queries (must be compiled first) | --- ## Common Misconceptions **"Triggers are guidance."** No. Triggers are index keys. They make context *findable*. The guidance is in `context_text`. **"Topics and triggers are different things."** They're the same strings used in different contexts. Topics are what the caller sends. Triggers are what context items are tagged with. The retriever matches them. **"Latitude is a weight."** No. Weight implies magnitude (how much does this matter?). Latitude implies freedom (how much can you deviate?). A piece of context can be extremely important (you'd better read it) but have `full` latitude (you can interpret it however you want). **"The pragmatics layer does reasoning."** No. It does lookup and traversal. All reasoning happens in the caller LLM. The pragmatics layer is a library, not a librarian. **"Thread traversal is search."** No. It's graph walking. Search finds things you don't know exist. Traversal follows known connections from things you already found. The retriever does search (tag matching) first, then traversal second.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

pragmatics_data_flow.md•14.6 KiB