Open Census MCP Server

phase3-sys-arch-ontologies.md•25 KiB

# Phase 3 System Architecture - Official Statistical Ontologies Integration ## Core Concept: Leverage Official Statistical Ontologies **Our Value-Add:** Human language complexity translation using authoritative statistical ontologies **Official Ontologies (Domain Expert Maintained):** - **COOS (Census and Opinion Ontology for Statistics)** - Census Bureau's official statistical ontology - **Census Address Ontology** - Official geographic relationship ontology - **STATO (Statistical Methods Ontology)** - Peer-reviewed statistical methodology ontology **tidycensus (Kyle Walker's Domain):** Census API complexity - FIPS codes, API endpoints, MOE calculations, data formatting --- ## Human Language Complexity Examples ### Geographic Complexity Translation - **"the northeast"** → COOS geographic regions → 6 specific states: CT, ME, MA, NH, RI, VT - **"rural areas"** → Census Address Ontology → urban-rural classification codes + geographic filtering - **"major cities"** → Official geographic hierarchy → population threshold + administrative level decision - **"Austin"** → Geographic disambiguation using official place classification ### Variable Complexity Translation - **"teacher salaries"** → COOS concept resolution → occupation-specific routing → BLS not Census - **"income"** → STATO methodology → median not mean + proper universe + statistical caveats - **"poverty"** → COOS poverty concepts → official poverty measure + threshold definition + exclusions ### Statistical Complexity Translation - **"average"** → STATO reasoning → median for skewed distributions, mean for normal distributions - **"compare"** → Official statistical methods → proper geographic resolution + sample size adequacy - **"rate"** → STATO rate methodology → proper denominator + universe definition + reliability checks --- ```mermaid graph TB subgraph "User Layer" U[User Query: "How much do teachers make in Austin?"] CD[Claude Desktop] U --> CD end subgraph "MCP Protocol Layer" CD --> MCP[MCP Server Entry Point] end subgraph "Intelligence Layer - Phase 3 Enhanced" MCP --> QP[Query Parser & Router] QP --> SI[Semantic Index ⚡ <100ms Core Queries] QP --> KB[Knowledge Base 📚 RAG Vector Search] SI --> SM[Static Mappings 🎯 Power Law Variables] SI --> FC[Fuzzy Concept Matcher 🔍 Alias Expansion] KB --> VDB[Vector Database ChromaDB + Sentence Transformers] KB --> DOC[R Documentation Corpus Census Methodology] end subgraph "Data Retrieval Layer" SM --> RE[R Engine tidycensus Integration] FC --> RE KB --> RE RE --> GP[Geography Parser Location → FIPS Codes] RE --> VM[Variable Mapper Concepts → Census Variables] RE --> TC[tidycensus Core R Subprocess] end subgraph "External APIs" TC --> CAPI[Census Bureau APIs ACS/Decennial Data] TC --> TIGER[TIGER Geographic Data Shapefiles & Boundaries] end subgraph "Response Layer" RE --> SP[Statistical Processor MOE Calculations & Validation] SP --> RF[Response Formatter Context + Methodology Notes] RF --> MCP end style SI fill:#e1f5fe,stroke:#01579b,stroke-width:3px style SM fill:#f3e5f5,stroke:#4a148c,stroke-width:2px style FC fill:#fff3e0,stroke:#e65100,stroke-width:2px style RE fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px ``` ## Geographic Intelligence Translation Architecture ```mermaid graph LR subgraph "Human Geographic Concepts" HG1["the northeast"] HG2["rural counties"] HG3["Harris County"] HG4["major cities"] HG5["which state has highest..."] end subgraph "Geography Translator Engine" HG1 --> GT1[Regional Mapper Northeast → CT,ME,MA,NH,RI,VT] HG2 --> GT2[Classification Mapper Rural → NCHS urban-rural codes] HG3 --> GT3[Disambiguation Engine Harris County → Harris County, Texas] HG4 --> GT4[Hierarchy Selector Major cities → population threshold + geography level] HG5 --> GT5[Comparison Router National comparison → all states analysis] end subgraph "tidycensus-Compatible Output" GT1 --> TC1[geography='state' state=c('CT','ME','MA','NH','RI','VT')] GT2 --> TC2[geography='county' + rural filter logic] GT3 --> TC3[geography='county' state='TX', county='Harris'] GT4 --> TC4[geography='place' + population threshold filter] GT5 --> TC5[geography='state' state=NULL (all states)] end style GT1 fill:#e1f5fe,stroke:#01579b,stroke-width:2px style GT2 fill:#f3e5f5,stroke:#4a148c,stroke-width:2px style GT3 fill:#fff3e0,stroke:#e65100,stroke-width:2px style GT4 fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px style GT5 fill:#fce4ec,stroke:#880e4f,stroke-width:2px ``` ## The 4 Essential Capabilities (Not Individual Tools) ### 1. Demography - Variable Intelligence Translation ```mermaid graph LR D1["teacher salary"] --> DT1[Domain Router → BLS not Census] D2["median income"] --> DT2[Variable Mapper → B19013_001 + why median] D3["poverty rate"] --> DT3[Concept Definer → B17001_002 + universe] D4["average income"] --> DT4[Statistical Advisor → Use median for income] style DT1 fill:#e1f5fe style DT2 fill:#f3e5f5 style DT3 fill:#fff3e0 style DT4 fill:#e8f5e8 ``` ### 2. Geography - Spatial Intelligence Translation ```mermaid graph LR G1["the northeast"] --> GT1[Regional Resolver → Multi-state analysis] G2["rural counties"] --> GT2[Classification Filter → Geographic filtering] G3["Harris County"] --> GT3[Disambiguator → Harris County, Texas] G4["which state highest"] --> GT4[Comparison Router → National analysis] style GT1 fill:#e1f5fe style GT2 fill:#f3e5f5 style GT3 fill:#fff3e0 style GT4 fill:#e8f5e8 ``` ### 3. Statistics - Methodological Intelligence ```mermaid graph LR S1[Margin of Error] --> ST1[Interpretation Engine Confidence intervals] S2[Sample Size] --> ST2[Reliability Checker Adequate/inadequate] S3[Median vs Mean] --> ST3[Measure Selector Appropriate statistic] S4[Statistical Validity] --> ST4[Quality Controller Suppression rules] style ST1 fill:#e1f5fe style ST2 fill:#f3e5f5 style ST3 fill:#fff3e0 style ST4 fill:#e8f5e8 ``` ### 4. Statistical Reasoning - Domain Intelligence ```mermaid graph LR R1["What is average teacher salary?"] --> RT1[Context Provider US average + BLS guidance + suggest location specificity] R2[Data Source Routing] --> RT2[Agency Router Census vs BLS vs Other] R3[Limitation Explanation] --> RT3[Scope Clarifier What we can/cannot answer] R4[Question Improvement] --> RT4[Query Enhancer Guide to better questions] style RT1 fill:#e1f5fe style RT2 fill:#f3e5f5 style RT3 fill:#fff3e0 style RT4 fill:#e8f5e8 ``` ## LLM-Powered Automated Mapping Pipeline ### Automated Concept Mapping Strategy **O3's Manual Assumption:** 200 concepts × manual analyst work = hundreds of hours **Our LLM Reality:** 200 concepts × automated processing = hours of compute + selective expert review ```mermaid graph TB subgraph "Automated Mapping Pipeline" COOS[COOS Concepts ~200 statistical concepts] CENSUS[Census Variables 28,000+ ACS variables] COOS --> LLM[LLM Concept Mapper Bulk automated processing] CENSUS --> LLM LLM --> CONF[Confidence Scoring Statistical validation] CONF --> HIGH[High Confidence ≥95% Auto-accept mappings] CONF --> MED[Medium Confidence 70-95% Expert review queue] CONF --> LOW[Low Confidence <70% Research/flag for improvement] HIGH --> VALID[Validated Mappings Authoritative concept→variable] MED --> EXPERT[Expert Review Domain specialist validation] EXPERT --> VALID LOW --> RESEARCH[Additional Research Graph relationship discovery] RESEARCH --> EXPERT end subgraph "Graph Relationship Discovery" VALID --> NEO4J[Neo4j Graph Database Concept relationship mapping] NEO4J --> CLUSTER[Concept Clustering Find related statistical concepts] CLUSTER --> EXPAND[Mapping Expansion Use relationships to fill gaps] end subgraph "Quality Assurance" VALID --> QA[Quality Metrics Precision/recall scoring] QA --> PROVENANCE[Provenance Metadata Who mapped, when, confidence] PROVENANCE --> PUBLISH[Published Mappings Community validation] end style LLM fill:#e1f5fe,stroke:#01579b,stroke-width:4px style CONF fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px style NEO4J fill:#fff3e0,stroke:#e65100,stroke-width:2px style PROVENANCE fill:#f3e5f5,stroke:#4a148c,stroke-width:2px ``` ## Smart Deduplication & Scalable Mapping Strategy ### Variable Deduplication Reality Check **O3's Cost Concern:** 28k variables × $0.01 = $280 (overestimated) **Our Reality:** 28k variables → ~2k unique concepts × $0.01 = $20-30 total #### Census Variable Structure Analysis ```python # Most Census variables are demographic/geographic splits of same concepts VARIABLE_PATTERNS = { "B19013": { # Median household income "base_concept": "median_household_income", "total": "B19013_001", # All households "by_race": ["B19013A_001", "B19013B_001", "B19013C_001", ...], # 9 variants "by_age": ["B19013_002", "B19013_003", ...], # Age brackets # 20+ variables, 1 statistical concept }, "B17001": { # Poverty status "base_concept": "poverty_status", "variants": ["B17001_001", "B17001_002", ...], # 59+ variants # All represent same concept: poverty threshold comparison } } # Deduplication impact: 28,000 variables → ~2,000 unique statistical concepts ``` #### Hierarchical Mapping Strategy ```python class ScalableConceptMapper: """Map concepts once, expand to all variables programmatically""" def __init__(self): self.variable_families = self._group_by_statistical_concept() self.unique_concepts = self._extract_unique_concepts() # ~2k concepts def map_concepts_efficiently(self): """Two-phase mapping: concepts first, expansion second""" # Phase 1: Map unique concepts only (LLM cost: ~$20-30) concept_mappings = self._map_unique_concepts_with_llm() # Phase 2: Programmatic expansion to all variables (cost: $0) full_mappings = self._expand_concepts_to_variables(concept_mappings) return full_mappings def _group_by_statistical_concept(self) -> Dict: """Group 28k variables by underlying statistical concept""" families = {} for var_id, metadata in self.census_variables.items(): # Extract base statistical concept (ignore demographic splits) concept_key = self._normalize_concept(metadata['concept']) if concept_key not in families: families[concept_key] = { "representative_variable": var_id, "concept_definition": metadata['concept'], "all_variables": [], "demographic_splits": [] } families[concept_key]["all_variables"].append(var_id) # Track demographic patterns for expansion if "_" in var_id: # Has demographic suffix families[concept_key]["demographic_splits"].append(var_id) return families def _map_unique_concepts_with_llm(self) -> Dict: """LLM mapping for ~2k unique concepts, not 28k variables""" mappings = {} for concept_key, family in self.variable_families.items(): # Map the statistical concept once coos_mapping = self._llm_map_concept( concept=family["concept_definition"], representative_var=family["representative_variable"] ) mappings[concept_key] = { **coos_mapping, "expansion_pattern": family["all_variables"], "demographic_variants": family["demographic_splits"] } return mappings def _expand_concepts_to_variables(self, concept_mappings: Dict) -> Dict: """Expand concept mappings to all 28k variables programmatically""" full_mappings = {} for concept_key, mapping in concept_mappings.items(): # Map all variables in this family to same COOS concept for var_id in mapping["expansion_pattern"]: full_mappings[var_id] = { "coos_concept": mapping["coos_concept"], "statistical_type": mapping["statistical_type"], "base_concept": concept_key, "is_demographic_variant": var_id in mapping["demographic_variants"], "confidence": mapping["confidence"], "provenance": { **mapping["provenance"], "expansion_method": "programmatic_from_base_concept" } } return full_mappings ``` ### Cost-Efficient Processing Pipeline ### Actual Token Cost Analysis (2024 Pricing) #### Real-World Cost Calculation ```python # Typical concept mapping call TOKENS_PER_CALL = { "input": 500, # Concept + candidate variables "output": 200, # JSON mapping response "total": 700 } # Actual GPT pricing (December 2024) GPT_PRICING = { "gpt_4_1_nano": { "input": 0.100 / 1_000_000, # $0.100 per 1M tokens "output": 0.400 / 1_000_000, # $0.400 per 1M tokens "cost_per_call": (500 * 0.100 + 200 * 0.400) / 1_000_000, # $0.00013 }, "gpt_4_1_mini": { "input": 0.400 / 1_000_000, # $0.400 per 1M tokens "output": 1.600 / 1_000_000, # $1.600 per 1M tokens "cost_per_call": (500 * 0.400 + 200 * 1.600) / 1_000_000, # $0.00052 }, "gpt_4_1_full": { "input": 2.000 / 1_000_000, # $2.000 per 1M tokens "output": 8.000 / 1_000_000, # $8.000 per 1M tokens "cost_per_call": (500 * 2.000 + 200 * 8.000) / 1_000_000, # $0.0026 } } # Comprehensive Census mapping costs CENSUS_MAPPING_COSTS = { "unique_concepts": 2000, "gpt_4_1_nano": 2000 * 0.00013, # $0.26 total "gpt_4_1_mini": 2000 * 0.00052, # $1.04 total "gpt_4_1_full": 2000 * 0.0026, # $5.20 total } ``` #### Previous Cost Estimates vs Reality ```python COST_COMPARISON = { "o3_estimate": "$10-20", "our_estimate": "$20-30", "actual_cost_nano": "$0.26", "actual_cost_mini": "$1.04", "actual_cost_full": "$5.20", "overestimate_factor": "5-100x too high" } ``` #### Federal Statistical System Comprehensive Mapping ```python FEDERAL_SYSTEM_COSTS = { "all_agencies": { "estimated_concepts": 5000, "gpt_4_1_nano_cost": "$0.65", "gpt_4_1_mini_cost": "$2.60", "gpt_4_1_full_cost": "$13.00" }, "annual_maintenance": { "concept_updates": 500, # 10% annual change "update_cost": "$0.26-$1.30", "total_annual_cost": "<$15" } } ``` ### Economic Reality: Cost is Not a Factor #### Strategic Implications - **Complete Census mapping:** Price of a coffee ($1-5) - **Entire federal statistical system:** Price of lunch ($2-15) - **Annual maintenance:** Negligible operational cost - **Quality vs cost trade-off:** Use GPT-4.1 full for highest accuracy at $5 total #### Budget Allocation Strategy ```python BUDGET_ALLOCATION = { "llm_mapping_costs": "$5-15", # Trivial "expert_validation": "$500-2000", # 10-40 hours @ $50/hr "software_development": "$5000-15000", # Developer time "infrastructure": "$100-500/month", # Container hosting "total_project_cost": "$6000-18000", "llm_percentage": "0.1% of total budget" } ``` **LLM costs are literally a rounding error.** Focus shifts entirely to quality and expert validation, not budget constraints. #### Smart Pre-filtering Enhancement ```python def enhanced_prefiltering(self, concept: Dict) -> List[str]: """Multi-stage filtering to minimize LLM token usage""" # Stage 1: Concept family matching (free) concept_family = self._identify_concept_family(concept) candidate_families = self._get_related_families(concept_family) # Stage 2: String similarity within families (free) scored_concepts = [] for family in candidate_families: similarity = self._calculate_concept_similarity(concept, family) if similarity > 0.7: scored_concepts.append((family, similarity)) # Stage 3: Send only top 3 concept families to LLM top_candidates = sorted(scored_concepts, reverse=True)[:3] # Result: 80-90% token reduction from smart filtering return [family for family, score in top_candidates] ``` ### Performance Benchmarking Framework #### Concrete Performance Targets (O3's Specification) ```python # pytest-benchmark scaffold for falsifiable performance testing import pytest from pytest_benchmark import BenchmarkFixture class PerformanceTargets: """Concrete benchmarks per O3's specification""" DATASET_SIZE = 1000 # COOS→Census mappings PLATFORM = "M2 laptop, Python 3.11, uvicorn async" TARGET_COLD_CACHE = 2.0 # ms TARGET_WARM_CACHE = 0.3 # ms @pytest.mark.benchmark def test_concept_resolution_performance(benchmark: BenchmarkFixture): """P95 latency for /resolve?concept=household_income""" ontology = OntologyLoader() ontology.load_mappings(size=1000) # Full dataset def resolve_concept(): return ontology.resolve_concept("household_income") result = benchmark(resolve_concept) # P95 latency assertion assert result.stats.percentiles[95] <= 2.0 # 2ms target @pytest.mark.benchmark def test_warm_cache_performance(benchmark: BenchmarkFixture): """Warm cache performance target""" ontology = OntologyLoader() ontology.resolve_concept("household_income") # Prime cache def resolve_cached(): return ontology.resolve_concept("household_income") result = benchmark(resolve_cached) assert result.stats.mean <= 0.3 # 0.3ms target ``` ### Ontology Scope Decisions (O3's Recommendations) #### STATO Scope Clarification ```python # O3: "STATO is for methods, not subjects" - Decision needed STATO_SCOPE_DECISION = "OUT_OF_SPRINT_3" # Explicit decision # Sprint 3: Focus on COOS concepts → Census variables only # Sprint 4: Add STATO methodology metadata to validated mappings SPRINT_3_ONTOLOGIES = { "coos": { "scope": "statistical_concepts", "purpose": "concept_to_variable_mapping", "priority": "core" }, "stato": { "scope": "statistical_methods", "purpose": "methodology_guidance", "priority": "sprint_4" # Deferred }, "address": { "scope": "geographic_primitives", "purpose": "regional_translation", "priority": "micro_ontology" # Hand-coded essentials } } ``` #### Geographic Micro-Ontology (O3's DIY Recommendation) ```python # Skip PDF extraction, hand-code the dozen primitives we need GEOGRAPHIC_MICRO_ONTOLOGY = { "regions": { "northeast": ["CT", "ME", "MA", "NH", "RI", "VT"], "southeast": ["AL", "AR", "FL", "GA", "KY", "LA", "MS", "NC", "SC", "TN", "VA", "WV"], "midwest": ["IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI"], "west": ["AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY"] }, "classifications": { "rural": {"nchs_code": "<3", "geography": "county"}, "suburban": {"nchs_code": "3-4", "geography": "county"}, "urban": {"nchs_code": ">4", "geography": "county"}, "major_cities": {"population": ">100000", "geography": "place"} }, "administrative": { "levels": ["us", "state", "county", "place", "tract"], "cbsa_support": True, "zip_support": False # Explicit limitation } } ``` ### Authoritative Mapping Examples #### COOS Concepts → Census Variables ```json { "coos:MedianHouseholdIncome": { "census_variables": ["B19013_001"], "universe": "Households", "statistical_method": "median", "stato_methodology": "stato:MedianCalculation", "reliability_notes": "Available for geographies with 65+ households", "why_median": "Income distributions are right-skewed; median more representative than mean", "validation_status": "expert_reviewed" }, "coos:PovertyRate": { "census_variables": { "numerator": "B17001_002", "denominator": "B17001_001" }, "calculation": "B17001_002 / B17001_001 * 100", "statistical_method": "rate", "stato_methodology": "stato:RateCalculation", "universe": "Population for whom poverty status is determined", "exclusions": "Institutionalized population, military group quarters", "validation_status": "peer_reviewed" }, "coos:TeacherSalary": { "census_availability": false, "recommended_source": "BLS", "bls_classification": "SOC 25-2000", "reasoning": "Census lacks occupation-specific salary detail", "coos_classification": "coos:OccupationSpecificIncome", "routing_rule": "occupation_specific → BLS_OES", "validation_status": "expert_confirmed" } } ``` #### STATO Methods → Census Implementation ```json { "stato:MedianCalculation": { "when_to_use": "Skewed distributions (income, home values, rent)", "census_implementation": "Pre-calculated in B-tables", "advantages": "Robust to outliers, interpretable (50th percentile)", "census_variables_using_median": ["B19013_001", "B25077_001", "B25064_001"], "alternative_methods": { "mean": "Available in some C-tables, sensitive to outliers", "geometric_mean": "Rare, used for rates and ratios" } }, "stato:RateCalculation": { "definition": "Part/whole relationship expressed as percentage", "census_pattern": "Detail table variables / summary table totals", "margin_of_error": "Use ratio estimation MOE formulas", "reliability_threshold": "Numerator ≥20 cases for publication" } } ``` ## Official Ontology Data Platform ### Authoritative Knowledge Sources ```yaml # knowledge-base/scripts/config.yaml official_ontologies: coos: name: "Census and Opinion Ontology for Statistics" source: "https://linked-statistics.github.io/COOS/coos.html" format: "RDF/OWL" maintainer: "Census Bureau + Academic Partners" description: "Official statistical concepts and variable relationships" census_address: name: "Census Address Ontology" source: "https://www2.census.gov/geo/pdfs/partnerships/data_guidelines/Census_Address_Ontology.pdf" format: "RDF/OWL" maintainer: "U.S. Census Bureau" description: "Official geographic relationships and hierarchies" stato: name: "Statistical Methods Ontology" source: "https://bioportal.bioontology.org/ontologies/STATO" format: "RDF/OWL" maintainer: "International Statistics Community" description: "Peer-reviewed statistical methodology standards" implementation_sources: tidycensus: variables_api: "https://api.census.gov/data/{year}/{survey}/variables.json" description: "Variable implementation mappings (concept → API variable)" bls: soc_codes: "https://www.bls.gov/soc/" description: "Occupation classification routing" ``` ### Ontology Integration Pipeline ```mermaid graph LR subgraph "Official Ontology Sources" COOS_RDF[COOS Ontology RDF/OWL Format Statistical Concepts] ADDR_RDF[Address Ontology RDF/OWL Format Geographic Relationships] STATO_RDF[STATO Ontology RDF/OWL Format Statistical Methods] end subgraph "Ontology Processing Pipeline" COOS_RDF --> PARSE[parse-ontologies.py RDF → JSON processing] ADDR_RDF --> PARSE STATO_RDF --> PARSE PARSE --> CONCEPT[concept-mapper.py Ontology → Implementation] CONCEPT --> TIDYC[tidycensus Variables API Implementation mappings] CONCEPT --> BLS[BLS Classifications Routing rules] end subgraph "Runtime Optimization" CONCEPT --> FAST[data/ontology/ Fast lookup structures] TIDYC --> FAST BLS --> FAST FAST --> JSON[concept_resolution.json Ontology-guided lookups] FAST --> SQLITE[ontology_reasoning.db Complex relationship queries] end style COOS_RDF fill:#e1f5fe,stroke:#01579b,stroke-width:3px style ADDR_RDF fill:#f3e5f5,stroke:#4a148c,stroke-width:3px style STATO_RDF fill:#fff3e0,stroke:#e65100,stroke-width:3px style PARSE fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

phase3-sys-arch-ontologies.md•25 KiB