Open Census MCP Server

pragmatics_gap_detection.md•6.29 KiB

# Pragmatics Gap Detection — Evaluation as Improvement Sensor **Date:** 2026-02-16 **Section relevance:** Paper §8 Discussion, §10 Future Work **Tags:** {paper} {slides:backup} ## Insight The evaluation pipeline is not just a measurement system — it's a continuous improvement sensor for the pragmatics catalog. Every query where: 1. `get_methodology_guidance` returns empty or irrelevant guidance 2. The model calls with topics that don't match any pragmatic items 3. The response quality is low DESPITE grounding (pragmatics gap, not model gap) ...identifies a specific hole in the curated expert judgment catalog. ## Why This Matters Because we have full traceability — query → topic selection → guidance returned → response produced → judge score — we can trace quality failures back to their root cause. A low D3 (uncertainty communication) score on a pragmatics response where the model called `get_methodology_guidance` with topic "dollar_values" but got nothing back tells you: we need a pragmatic item about inflation adjustment. This is the opposite of the typical ML improvement cycle where you retrain on more data and hope. Here you have a specific, traceable, actionable gap: this query needed this guidance and we didn't have it. ## Reusable Benchmarking Utility The test battery + judge pipeline is reusable and scalable as a benchmarking system. Each new set of pragmatics packs added to the catalog can be measured against the same battery. Performance tracked over time as the catalog grows: - 35 ACS items → baseline scores - Add CPS pack → rerun battery with CPS queries → measure delta - Add SIPP pack → same - Each pack addition has a measurable before/after on the same benchmark The 35 items are ACS-only for this study. CPS pragmatics exist in the system but aren't exercised by the current test battery. Expanding the battery to include CPS/SIPP queries tests cross-survey performance with the same pipeline. ## Reducing the Stochastic Tax We validated deterministic retrieval behavior in the pragmatics system — the same query with the same topics returns the same guidance every time. This is reducing the stochastic tax where we can control or account for it. The remaining variance is in model reasoning, which is the part we WANT to be flexible. RAG and GraphRAG are meant to "ground the model" and they do — at a cost. But consider the retrieval quality hierarchy: - **RAG:** Retrieves by semantic similarity. In homogeneous domains like Census methodology, embeddings can't differentiate between chunks that are semantically close but contextually distinct. Retrieval is stochastic — similar queries may get different chunks depending on embedding noise. No curation, no quality gate. - **GraphRAG / Knowledge Graphs:** Retrieves by relationship traversal. Better structural precision than flat RAG, but "if not done right, how do you know?" The graph encodes whatever was extracted — including extraction errors, missing relationships, and uncurated noise. The traversal is deterministic given the graph, but the graph itself may be wrong. No fitness-for-use judgment. - **Pragmatics:** Retrieves curated expert judgment. Deterministic. Each item has been examined for correctness, bias, safety, and fitness-for-use. Provenance is traceable to source documents with page-level citations. Latitude encodes calibrated uncertainty. The catalog is transparent — you can read every item, audit every claim, challenge every judgment. RAG and GraphRAG optimize for retrieval coverage. Pragmatics optimize for retrieval quality. They're solving different problems. RAG asks "what's relevant?" Pragmatics asks "what does the expert think you need to know right now?" ## Tacit Knowledge Capture This connects to a persistent challenge in knowledge management: capturing expert tacit knowledge. Organizations have whined about this for decades. Pragmatics offer a concrete mechanism: - **Automated candidate extraction:** LLM-assisted extraction from source documents identifies candidate items (what the documents say) - **Expert SME capture:** A structured interview/review process where the expert adds judgment that ISN'T in the documents — the fitness-for-use assessments, the "I wouldn't trust that estimate because..." wisdom that lives in their head - **Curation as quality gate:** Every item is reviewed, given a latitude, cited to sources. This is what makes it trustworthy — not the volume, but the review The word "wisdom" may be a stretch, but the direction is right. Pragmatics capture the layer between "knowing the facts" (data, documents) and "knowing what to do with the facts" (expert judgment). That's the layer that's hardest to encode and most valuable to deliver. ## Scaling Implications At scale (hundreds or thousands of queries beyond the 39-query battery): - Cluster the topic requests that return empty/low-relevance guidance - Rank gaps by frequency and judge score impact - Prioritize new pragmatic items by measurable quality improvement potential - The catalog grows based on empirical demand, not expert guesswork This turns the 35-item catalog into a living system that improves based on where it actually fails, with measurements at every step. ## Connection to Extraction Pipeline Gap detection feeds directly into the quarry extraction pipeline: - Identified gap → search source documents for relevant content - Extract → curate → add to catalog → rerun affected queries → measure improvement - Each new pragmatic item has a measurable before/after on real queries The same pipeline that built the initial 35 items scales to fill gaps discovered through evaluation. ## For FCSM Narrative This addresses the scalability critique head-on. "How do you scale beyond 35 items?" Answer: the evaluation system tells you exactly which items to build next, ranked by impact. You don't need to boil the ocean — you need to fill the gaps that matter most, and the system identifies them automatically with full traceability from query to gap to fix to measured improvement. The determinism point is the kicker: every other grounding approach introduces its own stochastic noise into the system. Pragmatics reduce noise at the grounding layer so the only remaining variance is in the reasoning — which is where you want flexibility.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

pragmatics_gap_detection.md•6.29 KiB