Open Census MCP Server

ADR-010-evolutionary-vocabulary.md•5.47 KiB

# ADR-010: Evolutionary Controlled Vocabulary with Provisional Extensions **Status:** Accepted **Date:** 2026-02-09 **Traces to:** FR-QE-006, FR-QE-014 (new), ADR-008 --- ## Context During the first real extraction of the CPS Handbook (157 chunks, 493 nodes), the LLM produced 21 nodes with `fact_category` values outside the controlled vocabulary: - `dissemination` (17 nodes) — periodicity, release schedules, publication processes - `definition` (4 nodes) — unemployment criteria (misclassified; should be `ConceptDefinition` nodes) The `dissemination` category is a legitimate survey lifecycle stage that our vocabulary missed. The LLM correctly identified a gap. The `definition` category is a node-type error, not a vocabulary gap. This presents a fundamental design tension: static controlled vocabularies are brittle and lose signal, while unconstrained vocabularies produce chaos that breaks downstream harvest queries. ## Decision Implement an **evolutionary vocabulary** with three tiers: 1. **Core vocabulary** — Established terms validated across multiple documents. Enforced strictly. Listed in `FACT_CATEGORIES`, `DIMENSIONS`, etc. 2. **Provisional extensions** — New terms discovered during extraction. Accepted with warnings (not rejections). Tracked in `VOCABULARY_EXTENSIONS` with metadata: source document, first seen date, occurrence count. 3. **Rejected terms** — Terms reviewed and determined to be errors (misclassification, synonyms of existing terms). Listed in `VOCABULARY_REJECTIONS` with mapping to correct term or correct node type. **Promotion lifecycle:** ``` LLM generates new term → Validation warns (not rejects) → Term added to VOCABULARY_EXTENSIONS with count=1 → Term recurs in subsequent document extractions → count increments → After 2+ documents use the term → review for promotion to core → If term doesn't recur → review for rejection ``` **Double-regressive trait:** Extensions that don't prove their value across documents get pruned. Terms that repeatedly appear earn permanent status. The system self-repairs by absorbing legitimate vocabulary and shedding noise, but always with human review at promotion/rejection decision points. ## Rationale ### Why not strict enforcement (reject unknown terms)? - Loses signal. The LLM found a real category (`dissemination`) that our initial vocabulary missed. - Forces the LLM into junk-drawer categories (`processing` becomes a dumping ground), destroying the semantic precision harvest queries depend on. - Assumes the vocabulary designers anticipated all categories before seeing real documents. They didn't and won't. ### Why not permissive (accept anything)? - Vocabulary explodes. Each extraction invents its own terms. - Harvest queries can't match on categories because the same concept has 5 synonyms. - No mechanism to detect and correct errors (like `definition` being a node-type mistake). ### Why evolutionary? - Mirrors how real taxonomies develop in practice — provisional acceptance, validation through use, promotion or pruning. - Aligns with the project's "Spider and Starfish" principle: centralized enough to maintain coherence, decentralized enough to adapt to reality. - Provides audit trail: you can see exactly when a term entered the vocabulary, which documents produced it, and whether it was promoted or rejected. - Prevents compounding debt: errors caught at extraction time don't propagate into harvest and pack compilation. ### Precedent The COOS project demonstrated empirically that bottom-up vocabulary discovery (extracting categories from what Census actually publishes) produces better taxonomies than top-down design (deciding what categories should exist). This ADR applies the same principle to the quarry extraction vocabulary. ## Consequences **Positive:** - Vocabulary evolves with the corpus rather than constraining it - Clear audit trail for every vocabulary change - Errors detected and corrected early (validation warnings, not silent acceptance) - Human remains in the loop for promotion/rejection decisions **Negative:** - Slightly more complex validation logic (three-tier check instead of simple list membership) - Requires periodic vocabulary review after each new document extraction - Provisional terms may produce inconsistent harvest results until promoted or rejected ## Implementation ### config.py changes: ```python # Core vocabulary (strict) FACT_CATEGORIES = ["design", "collection", "weighting", "estimation", "variance", "processing", "adjustment", "dissemination"] # Provisional extensions (warn, don't reject) VOCABULARY_EXTENSIONS = { "fact_category": { # "term": {"first_seen": "doc_id", "date": "YYYY-MM-DD", "count": N, "notes": "..."} }, "dimension": {}, "value_type": {}, "assertion_type": {}, } # Rejected terms (map to correct term or node type) VOCABULARY_REJECTIONS = { "fact_category": { "definition": { "reason": "Node type error — these are ConceptDefinition nodes, not MethodologicalChoice", "action": "reclassify_node_type", "target_type": "ConceptDefinition", "date": "2026-02-09" } } } ``` ### Validation behavior: - Core term → accept silently - Provisional term → accept with INFO log, increment count - Rejected term → apply correction mapping, WARN log - Unknown term → accept with WARNING log, auto-add to provisional extensions --- *Static taxonomies die. Evolutionary ones adapt. The control is in the lifecycle, not the list.*

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ADR-010-evolutionary-vocabulary.md•5.47 KiB