Open Census MCP Server

raw_kg_schema.md•28.1 KiB

# Design Document: Raw Knowledge Graph Schema for Pragmatics Extraction *Version: 3.1 — 2026-02-09* *Status: Approved for Phase 1 implementation (post bug-fix review)* *Review History: Five external reviews (Gemini Pro, GPT-5.2, Kimi-K2.5, Grok-4, GPT-5.2 round 3). Bug fixes applied from GPT-5.2 structural review (round 3).* --- ## 1. Purpose & Governing Analogy The raw knowledge graph is a **clinical assessment system** for statistical data fitness-for-use. | Medical Analogy | Census MCP Equivalent | |----------------|----------------------| | Doctor's medical training | LLM's statistical knowledge (the 90%) | | Lab tests, patient history, vitals | Extracted `MethodologicalChoice` + `QualityAttribute` (raw KG) | | Fitness standards by military occupational specialty | `AnalysisTask` + `REQUIRES` edges (what does this job demand?) | | Fit/unfit determination with conditions | Harvested `ContextItem` (the pragmatic judgment) | | The patient | The dataset the user wants to use | **The KG does NOT contain diagnoses.** It contains the test results and the standards. Diagnoses (fitness implications) are derived by matching results against standards — via Cypher pattern-matching, not LLM inference. **The question this KG answers:** "Given what we know about this survey's procedures and their measurable quality consequences, which analytic tasks are compromised, and how?" **Immediate scope:** CPS-ACS income comparability. 3-5 source documents. --- ## 2. Architecture Overview ``` ┌─────────────────────────────────────────────────┐ │ Layer 0: SEED KNOWLEDGE (human expert, once) │ │ AnalysisTask nodes + REQUIRES edges │ │ ~20-30 rules encoding "what does good look │ │ like for this analytic task?" │ │ Templates: threshold + condition + │ │ violation_template + recommended_action │ └──────────────────────┬──────────────────────────┘ │ ┌──────────────────────▼──────────────────────────┐ │ Layer 1: EXTRACTION (LLM, per document) │ │ MethodologicalChoice + QualityAttribute │ │ Factual only. No implications. │ │ "What does this survey do, and what are the │ │ measurable quality consequences?" │ └──────────────────────┬──────────────────────────┘ │ ┌──────────────────────▼──────────────────────────┐ │ Layer 2: HARVEST (Cypher, automated) │ │ Pattern-match: REQUIRES vs PRODUCES │ │ Instantiate violation templates │ │ Flag unanticipated interactions │ │ Output: candidate ContextItems │ └──────────────────────┬──────────────────────────┘ │ ┌──────────────────────▼──────────────────────────┐ │ Layer 3: CURATION (human, per harvest batch) │ │ Validate instantiated warnings │ │ Adjust severity, scope, wording │ │ Approve for inclusion in pragmatics DB │ │ Expert time: 10-15 min per batch (not hours) │ └──────────────────────┬──────────────────────────┘ │ ┌──────────────────────▼──────────────────────────┐ │ Layer 4: EXPORT & COMPILE (automated) │ │ neo4j_to_staging.py → staging JSON │ │ compile_pack.py → SQLite packs (shipped) │ └─────────────────────────────────────────────────┘ ``` ### Work Distribution | Activity | Who | Frequency | Effort | |----------|-----|-----------|--------| | Define AnalysisTask + REQUIRES patterns | SME | Once per task type | High initial, then stable | | Extract MethodologicalChoice from docs | LLM + human verification | Per document | Low (factual extraction) | | Run harvest queries | Automated | Per curation cycle | Near-zero | | Validate ContextItem instantiations | SME or trained reviewer | Per harvest batch | Medium (validation, not research) | | Handle flagged unanticipated interactions | SME | As flagged | Targeted, rare | --- ## 3. Node Types (Phase 1: 13 types) ### 3.1 Layer 0: Seed Knowledge (Expert-Curated) | Node Label | Description | Example | Seeded By | |------------|-------------|---------|-----------| | `AnalysisTask` | A category of analytic work a user might perform | "EstimateChangeOverTime", "CrossSurveyComparison", "SmallAreaEstimation" | Expert (one-time) | | `CanonicalConcept` | Abstract concept that multiple surveys operationalize differently | "Household Income", "Employment" | Expert (one-time) | **AnalysisTask Properties:** ``` name: str # e.g., "EstimateChangeOverTime" description: str # What this task involves typical_use_cases: [str] # Concrete examples critical_quality_dimensions: [str] # What quality attributes matter ``` **CanonicalConcept Properties:** ``` name: str # e.g., "Household Income" canonical_definition: str|null # Provisional; sourced to harmonization effort ``` ### 3.2 Layer 1: Extracted Knowledge (LLM from Documents) | Node Label | Description | Example | |------------|-------------|---------| | `MethodologicalChoice` | A specific decision or procedure within a survey process | "ACS 5-year estimates pool 60 months of data" | | `QualityAttribute` | A measurable quality consequence of a methodological choice | "Consecutive 5-year estimates share 48 of 60 months (overlap = 0.8)" | | `DataProduct` | A specific survey, release, or data product | "CPS ASEC", "ACS 5-Year" | | `SurveyProcess` | A stage in the survey pipeline | "Sampling", "Weighting", "Estimation" | | `UniverseDefinition` | Who/what is in scope | "Civilian noninstitutional population aged 16+" | | `ConceptDefinition` | How a survey operationalizes a concept | "CPS ASEC: total money income = sum of 8 components for previous calendar year" | | `Threshold` | Specific numeric boundary with operational consequences | "ACS 1-year requires 65,000+ population" | | `TemporalEvent` | Dated methodology change | "2014 CPS ASEC redesign: income questions changed (split-panel test)" | | `QualityCaveat` | Known quality issue or limitation | "CPS ASEC income nonresponse ~15%, hot-deck imputed" | | `SourceDocument` | A source document in the corpus | See §5 | ### 3.3 Layer 2: Derived Knowledge (Harvest Output) | Node Label | Description | Example | |------------|-------------|---------| | `ContextItem` | Materialized fitness-for-use judgment, derived by harvest | "Don't compare consecutive ACS 5-year estimates — they share 4/5 of the data" | **ContextItem** is NOT extracted. It is produced by Cypher pattern-matching (Layer 2) and validated by humans (Layer 3). It carries full derivation provenance. ### 3.4 Required Properties **All Extracted Nodes (Layer 1):** ``` # Per-node (minimal): assertion_type: str # "verbatim" | "paraphrased" # All other provenance lives on the SOURCED_FROM edge (see §4.6) ``` Note: `assertion_type` no longer includes "inferred" — Layer 1 is factual extraction only. Inference happens in Layer 2 via graph traversal. **MethodologicalChoice:** ``` fact_category: str # CONTROLLED: "design" | "collection" | "weighting" | # "estimation" | "variance" | "processing" | "adjustment" survey: str # "acs" | "cps" | "decennial" | "general" valid_from: str|null # ISO date (YYYY-MM-DD) when this choice took effect. Use Neo4j date() in queries. valid_until: str|null # ISO date (YYYY-MM-DD) when superseded. null = current/ongoing. ``` **QualityAttribute:** ``` name: str # Metric identifier within dimension # e.g., "overlap_fraction", "imputation_rate", "effective_sample_size" dimension: str # STABLE JOIN KEY for REQUIRES matching # e.g., "temporal_comparability", "precision", "coverage", # "definitional_alignment" value_type: str # "fraction" | "count" | "boolean" | "categorical" value_number: float|null # Numeric value (use when value_type is fraction or count) value_string: str|null # Categorical/text value (use when value_type is boolean or categorical) unit: str|null # e.g., "months", "persons", "percent" # NOTE: fractions must be 0-1, not percentages. Enforce in extraction prompt. ``` **ConceptDefinition:** ``` reference_period: str # e.g., "previous calendar year", "past 12 months" unit_of_analysis: str # e.g., "person", "household", "family" granularity: str # "composite" | "component" survey: str # "acs" | "cps" etc. ``` **QualityCaveat (optional for clustering):** ``` tse_type: str|null # OPTIONAL: "coverage" | "sampling" | "nonresponse" | # "measurement" | "processing" | null ``` **Threshold:** ``` measure: str # What's being thresholded value: str # The threshold value operator: str # "gte" | "lte" | "eq" scope_notes: str|null # Geographic/subgroup/temporal limitations ``` **ContextItem (Layer 2 output):** ``` derived_from_task: str # AnalysisTask name derived_from_choices: [str] # MethodologicalChoice IDs instantiated_warning: str # Filled template text instantiated_recommendation: str validation_status: str # "pending" | "approved" | "rejected" derivation_confidence: str # "high" (direct pattern match) | "medium" (flagged interaction) human_reviewer_notes: str|null ``` --- ## 4. Relationship Types (Phase 1: 16 types) ### 4.1 Structural Relationships | Relationship | From → To | Meaning | |-------------|-----------|---------| | `PART_OF` | MethodologicalChoice → SurveyProcess | This choice is part of this pipeline stage | | `IMPLEMENTS` | DataProduct → SurveyProcess | Informational grouping only. Do NOT use for harvest joins — use APPLIES_TO instead. | | `APPLIES_TO` | MethodologicalChoice → DataProduct | **Product-scoping edge.** This choice applies to this specific product. Required properties: `valid_from` (ISO date or null), `valid_until` (ISO date or null). **Required on all extracted MethodologicalChoice nodes.** | | `DEFINED_FOR` | ConceptDefinition → DataProduct | Links concept definitions to specific products. Required for cross-survey comparison queries. | | `OPERATIONALIZES` | ConceptDefinition → CanonicalConcept | This survey defines this concept this way | | `TARGETS` | DataProduct → UniverseDefinition | This product covers this population | | `SOURCED_FROM` | any → SourceDocument | Provenance link | > ⚠️ **Design Note (Bug Fix #1):** Generic `SurveyProcess` nodes are shared across products. > Harvest queries MUST join through `APPLIES_TO` for product-scoped results, > NEVER through `IMPLEMENTS → SurveyProcess ← PART_OF`. ### 4.2 Quality/Consequence Relationships | Relationship | From → To | Meaning | Required Properties | |-------------|-----------|---------|---------------------| | `PRODUCES` | MethodologicalChoice → QualityAttribute | This procedure creates this quality consequence | `mechanism` (how) | | `REQUIRES` | AnalysisTask → QualityAttribute | This task needs this quality dimension to meet a standard | `threshold`, `condition`, `violation_severity`, `violation_template`, `recommended_action` | | `TRADES_OFF_WITH` | QualityAttribute → QualityAttribute | Improving one degrades the other | `mechanism` | | `MITIGATES` | MethodologicalChoice → QualityCaveat | This procedure reduces this quality issue | `mechanism` | | `QUALIFIES` | QualityCaveat → DataProduct | This caveat affects this product | | ### 4.3 Temporal/Evolution Relationships | Relationship | From → To | Meaning | Required Properties | |-------------|-----------|---------|---------------------| | `SUPERSEDES` | TemporalEvent → MethodologicalChoice | This change replaced this procedure | | | `CONSTRAINS` | Threshold → DataProduct | This threshold limits this product | | | `CONFOUNDS` | any → any | These interact to create compounding effects | `mechanism`, `evidence_basis`, `interaction_type` (CONTROLLED: "bias_interaction" \| "variance_interaction" \| "comparability_break" \| "coverage_interaction") | ### 4.4 Derivation Relationship | Relationship | From → To | Meaning | |-------------|-----------|---------| | `DERIVED_FROM` | ContextItem → MethodologicalChoice[] | Reasoning chain from facts to judgment | ### 4.5 The REQUIRES Edge: Where Expert Knowledge Lives This is the most important relationship in the schema. It encodes reusable expert judgment as structured rules: ```cypher // Numeric threshold rule CREATE (task:AnalysisTask {name: "EstimateChangeOverTime"}) CREATE (qa:QualityAttribute {name: "overlap_fraction", dimension: "temporal_comparability"}) CREATE (task)-[r:REQUIRES { rule_type: "numeric_threshold", // CONTROLLED: "numeric_threshold" | "categorical_match" | // "categorical_mismatch" | "boolean_required" threshold_number: 0.2, // For numeric rules threshold_string: null, // For categorical rules condition: "data_overlap_between_consecutive_periods", violation_severity: "critical", violation_template: "Consecutive estimates share {value} of underlying data. Standard change estimates are unreliable.", recommended_action: "Use non-overlapping periods or apply published variance correction factors." }]->(qa) // Categorical match rule CREATE (task2:AnalysisTask {name: "CrossSurveyComparison"}) CREATE (qa2:QualityAttribute {name: "reference_period_alignment", dimension: "definitional_alignment"}) CREATE (task2)-[r2:REQUIRES { rule_type: "categorical_match", threshold_number: null, threshold_string: "matching", condition: "reference_periods_must_align_across_products", violation_severity: "high", violation_template: "Reference periods differ: {survey1} uses {period1} while {survey2} uses {period2}. Direct comparison produces systematic bias.", recommended_action: "Restrict to overlapping reference windows or apply published reconciliation factors." }]->(qa2) ``` **Harvest queries are dispatched by `rule_type`** — one query pattern per type, not a single generic query. See §6 for patterns. One REQUIRES edge generates warnings for ALL DataProducts where the pattern matches. Seed once, harvest forever. ### 4.6 Provenance Model All extraction provenance lives on the `SOURCED_FROM` relationship edge, NOT on node properties. **SOURCED_FROM edge properties:** ``` source_section: str # Chapter/section reference source_page: str|int # Page number or range raw_text: str # Original text passage extraction_model: str # Which LLM extracted this extraction_date: str # ISO date ``` **Why edge-based:** Enables multi-citation (one node sourced from multiple document passages). Coverage queries use `MATCH (n)-[:SOURCED_FROM]->(d:SourceDocument)` which is now guaranteed to exist. **Rule:** Every Layer 1 node MUST have at least one `SOURCED_FROM` edge. Nodes without provenance edges are invalid. **ContextItem provenance** uses `DERIVED_FROM` edges (not `SOURCED_FROM`) linking to the MethodologicalChoice/QualityAttribute nodes that generated it, plus properties: `harvest_query_id`, `harvest_date`, `derivation_confidence`. --- ## 5. Source Document Nodes ```cypher (:SourceDocument { catalog_id: "CPS-TP-077", title: "Design and Methodology: Current Population Survey", year: 2006, pages: 131, local_path: "knowledge-base/census_cps/CPS-Tech-Paper-77.pdf", ingestion_status: "complete" | "partial" | "pending", ingestion_date: "2026-02-08", pages_extracted: [1,2,3,...], pages_total: 131 }) ``` --- ## 6. Harvest Queries (Layer 2) ### 6.1a Numeric Threshold Violations ```cypher // Harvest pattern for numeric rules (overlap, sample size, CV, etc.) MATCH (task:AnalysisTask)-[req:REQUIRES]->(qa_std:QualityAttribute) WHERE req.rule_type = "numeric_threshold" MATCH (mc:MethodologicalChoice)-[ap:APPLIES_TO]->(dp:DataProduct) WHERE ap.valid_until IS NULL OR date(ap.valid_until) >= date() // current choices only MATCH (mc)-[:PRODUCES]->(qa_obs:QualityAttribute) WHERE qa_obs.dimension = qa_std.dimension AND qa_obs.value_number IS NOT NULL AND qa_obs.value_number >= req.threshold_number RETURN { task: task.name, product: dp.name, warning: replace(req.violation_template, "{value}", toString(qa_obs.value_number)), recommendation: req.recommended_action, severity: req.violation_severity, source_facts: collect(mc.name), confidence: "high" } AS candidate_context_item ``` ### 6.1b Categorical Mismatch Violations (Cross-Survey) ```cypher // Harvest pattern for categorical match rules (reference period, unit of analysis, etc.) MATCH (task:AnalysisTask)-[req:REQUIRES]->(qa_std:QualityAttribute) WHERE req.rule_type = "categorical_match" MATCH (concept:CanonicalConcept) MATCH (op1:ConceptDefinition)-[:OPERATIONALIZES]->(concept) MATCH (op1)-[:DEFINED_FOR]->(dp1:DataProduct) MATCH (op2:ConceptDefinition)-[:OPERATIONALIZES]->(concept) MATCH (op2)-[:DEFINED_FOR]->(dp2:DataProduct) WHERE dp1 <> dp2 AND qa_std.dimension = "definitional_alignment" AND op1.reference_period <> op2.reference_period RETURN { task: task.name, concept: concept.name, products: [dp1.name, dp2.name], warning: replace(replace(replace(replace(req.violation_template, "{survey1}", op1.survey), "{period1}", op1.reference_period), "{survey2}", op2.survey), "{period2}", op2.reference_period), recommendation: req.recommended_action, severity: req.violation_severity, confidence: "high" } AS candidate_context_item ``` ### 6.2 Cross-Survey Concept Misalignment See §6.1b above — concept misalignment is now a REQUIRES-driven harvest pattern. ### 6.3 Universe Mismatches ```cypher MATCH (dp1:DataProduct)-[:TARGETS]->(u1:UniverseDefinition) MATCH (dp2:DataProduct)-[:TARGETS]->(u2:UniverseDefinition) WHERE dp1.name CONTAINS "CPS" AND dp2.name CONTAINS "ACS" AND u1 <> u2 RETURN { warning: "Population scope differs: " + dp1.name + " targets " + u1.raw_text + " while " + dp2.name + " targets " + u2.raw_text, severity: "critical", confidence: "high" } AS candidate_context_item ``` ### 6.4 Unanticipated Interactions (Flagged for Expert Review) ```cypher // Co-occurring methodology changes affecting same quality dimension MATCH (mc1:MethodologicalChoice)-[:PRODUCES]->(qa1:QualityAttribute) MATCH (mc2:MethodologicalChoice)-[:PRODUCES]->(qa2:QualityAttribute) WHERE mc1 <> mc2 AND qa1.dimension = qa2.dimension AND (mc1.valid_from IS NULL OR mc2.valid_until IS NULL OR date(mc1.valid_from) <= date(mc2.valid_until)) AND (mc2.valid_from IS NULL OR mc1.valid_until IS NULL OR date(mc2.valid_from) <= date(mc1.valid_until)) AND NOT EXISTS { MATCH (mc1)-[:CONFOUNDS]-(mc2) } RETURN { type: "potential_unanticipated_interaction", choices: [mc1.name, mc2.name], dimension: qa1.dimension, confidence: "low", action: "Expert review needed" } AS flagged_interaction ``` ### 6.5 Coverage Report ```cypher MATCH (d:SourceDocument) OPTIONAL MATCH (n)-[:SOURCED_FROM]->(d) RETURN d.catalog_id, d.ingestion_status, d.pages_total, count(DISTINCT n) AS nodes_extracted ``` ### 6.6 Unconnected Facts (Extraction Gaps) ```cypher MATCH (mc:MethodologicalChoice) WHERE NOT (mc)-[:PRODUCES]->(:QualityAttribute) OPTIONAL MATCH (mc)-[src:SOURCED_FROM]->(doc:SourceDocument) RETURN mc.fact_category, doc.catalog_id AS source_document, src.source_page, src.raw_text ORDER BY mc.fact_category, src.source_page ``` --- ## 7. Extraction Prompt Engineering (Layer 1) ### 7.1 Core Principle **The extraction prompt asks for FACTS and MEASUREMENTS, not judgments.** Instead of: "What are the fitness implications of this methodology?" Ask: "What specific procedures does this survey use, and what are their measurable quality consequences?" This is more reliable because LLMs extract facts well and implications poorly. ### 7.2 Extraction Targets For each text chunk, extract: 1. **MethodologicalChoice**: What does the survey do? Include `fact_category`, scope, validity period. 2. **QualityAttribute**: What measurable quality consequence does this produce? Include dimension, value_type, value. 3. **ConceptDefinition**: How does this survey define/measure a concept? Include reference_period, unit_of_analysis. 4. **UniverseDefinition**: Who is in/out of scope? 5. **Threshold**: Any numeric boundaries with operational consequences? 6. **TemporalEvent**: Any dated methodology changes? 7. **QualityCaveat**: Any documented quality issues, limitations, error sources? ### 7.3 Prompt Requirements 1. **Penalize summary, reward operational detail.** "Ignore general descriptions. Focus on specific constraints, thresholds, adjustment procedures, edge cases, and processing rules." 2. **Force scope specificity** — "for CPS ASEC post-2014" not "for CPS" 3. **Extract the PRODUCES relationship** — every MethodologicalChoice should be paired with its quality consequence where documented 4. **Extract `fact_category`** from controlled vocabulary 5. **Extract `reference_period` and `unit_of_analysis`** for every ConceptDefinition 6. **Preserve page numbers** ### 7.4 Deduplication Strategy Do NOT deduplicate during extraction. After extraction: 1. Cluster nodes by semantic similarity + matching `fact_category` + matching `survey` 2. Maintain all evidence citations 3. Merge only with human/Opus review 4. For overlapping documents: mark later edition as `preferred_source`, preserve both --- ## 8. Seeding Guide (Layer 0) ### 8.1 Initial AnalysisTask Set (Seed 5-10 for CPS-ACS Income) | Task Name | Description | Critical Quality Dimensions | |-----------|-------------|----------------------------| | `EstimateChangeOverTime` | Compare estimates across time periods | temporal_comparability, overlap_structure, seasonal_adjustment_status | | `CrossSurveyComparison` | Compare estimates across surveys (CPS vs ACS) | definitional_alignment, universe_match, reference_period_match | | `SmallAreaEstimation` | Estimate for sub-state geographies | effective_sample_size, direct_estimate_reliability | | `SubgroupAnalysis` | Estimate for demographic subpopulations | subgroup_sample_size, design_effect | | `IncomeDistributionAnalysis` | Analyze income distribution shape | topcoding_effects, imputation_method, component_coverage | ### 8.2 REQUIRES Edge Examples ```cypher // Task: CrossSurveyComparison requires matching reference periods CREATE (t:AnalysisTask {name: "CrossSurveyComparison"}) CREATE (qa:QualityAttribute {name: "reference_period_alignment", dimension: "definitional_alignment"}) CREATE (t)-[:REQUIRES { rule_type: "categorical_match", threshold_number: null, threshold_string: "matching", condition: "reference_periods_must_align", violation_severity: "high", violation_template: "Reference periods differ: {survey1} uses {period1} while {survey2} uses {period2}. Direct comparison produces systematic bias.", recommended_action: "Restrict to overlapping reference windows or apply published reconciliation factors." }]->(qa) // Task: CrossSurveyComparison requires matching universes CREATE (qa2:QualityAttribute {name: "universe_alignment", dimension: "coverage"}) CREATE (t)-[:REQUIRES { rule_type: "categorical_match", threshold_number: null, threshold_string: "matching", condition: "target_populations_must_align", violation_severity: "critical", violation_template: "{survey1} targets {universe1} while {survey2} targets {universe2}. Aggregates are not comparable without universe restriction.", recommended_action: "Restrict both surveys to overlapping population (e.g., civilian noninstitutional 16+)." }]->(qa2) ``` ### 8.3 Seeding Process 1. Expert identifies 5-10 common analytic tasks for CPS-ACS income work 2. For each task, defines 3-5 REQUIRES edges with violation templates 3. Templates use `{placeholder}` syntax filled from graph data at harvest time 4. Total seeding effort: ~1 day for initial set 5. New tasks added incrementally as new surveys/use cases arise --- ## 9. Phase 2 Decision Points | Decision | Trigger | Action | |----------|---------|--------| | Split `MethodologicalChoice` | `fact_category` shows clear clustering | Promote categories to node labels | | Add `Scope` node | Unscoped generalizations cause false positives | Promote scope properties to nodes | | TSE-type `QualityCaveat` | Caveats are unclusterable | Require `tse_type` | | Claim reification | >10 documents, contradictions across editions | Add Claim + Evidence layer | | `ComparabilityAssessment` | `OPERATIONALIZES` + `CanonicalConcept` insufficient | Add structured comparison nodes | | Split `CONFOUNDS` | Edge overuse | Promote to BIASES, INCREASES_VARIANCE, etc. | | `WeightingStage` nodes | Weighting pipeline detail needed for diagnosis | Add ordered sub-process nodes | | `DataCollectionMode` nodes | Mode effects need first-class representation | Add mode nodes with BIASES edges | --- ## 10. Pre-Extraction Checklist - [ ] `SourceDocument` nodes created for target documents - [ ] `CanonicalConcept` nodes seeded (Household Income, Family Income, Personal Income, Earnings, Money Income, Employment) - [ ] `DataProduct` nodes seeded (CPS ASEC, CPS Basic Monthly, ACS 1-Year, ACS 5-Year) - [ ] `SurveyProcess` nodes seeded (Sampling, Collection, Weighting, Estimation, Processing, Dissemination) - [ ] `AnalysisTask` nodes seeded (5-10 per §8.1) - [ ] `REQUIRES` edges defined with templates (per §8.2) - [ ] Extraction prompt written, tested on 2-3 sample pages - [ ] `fact_category` controlled vocabulary finalized - [ ] Neo4j constraints/indexes created --- ## 11. Relationship to Existing Architecture | Component | Role | |-----------|------| | Neo4j `raw` database | The quarry — Layer 0 seeds + Layer 1 extractions | | Neo4j `pragmatics` database | Layer 3 approved ContextItems | | `staging/*.json` | Export from pragmatics DB (downstream, not authoring surface) | | `packs/*.db` | Compiled SQLite (shipped to users) | | `scripts/extract/` | Extraction pipeline code and prompts | | `scripts/harvest/` | Harvest query scripts (Layer 2) | | `provenance_catalog` table | Coverage tracking in compiled packs | --- ## Appendix A: External Review Summary Schema reviewed across five rounds by four independent models. | Model | Round | Key Contribution | |-------|-------|-----------------| | Gemini Pro | 1, 2 | Split MethodologicalFact, add AnalyticTask, soft-type via category property | | GPT-5.2 | 1, 2 | Scope as first-class, reference_period/unit_of_analysis critical, claim reification | | Kimi-K2.5 | 1 | Abandon FitnessImplication as extractable → derive via harvest. Process-centric design. | | Grok-4 | 1 | Confirmed template-based harvest architecture, concrete REQUIRES seeding patterns | | GPT-5.2 | 3 | Structural bugs: product-scoping, query path validity, type safety, provenance redundancy | **Convergent finding (all four):** Fitness implications should be derived, not extracted. Expert knowledge belongs in reusable task/requirement structures, not per-document prose. **Round 3 structural review (GPT-5.2):** Five blocking bugs identified and fixed: product-scoping via generic SurveyProcess, missing DEFINED_FOR relationship, provenance redundancy, QualityAttribute type safety, REQUIRES threshold heterogeneity. **Key architectural shift:** From "extract facts AND implications" to "extract facts, seed standards, harvest violations."

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

raw_kg_schema.md•28.1 KiB