Open Census MCP Server

README.md•5.76 KiB

# Quarry Extraction Toolkit Custom extraction pipeline for Census methodology PDFs → Neo4j quarry database. Replaces llm-graph-builder (see ADR-008) and ships as a project component (ADR-009). ## What It Does The quarry toolkit implements a 4-layer knowledge graph extraction architecture: 1. **Layer 0 (Seed):** Expert-curated AnalysisTask + REQUIRES edges encoding quality standards 2. **Layer 1 (Extract):** LLM extraction of facts from PDFs → MethodologicalChoice, QualityAttribute, etc. 3. **Layer 2 (Harvest):** Pattern-matching queries generate candidate ContextItems 4. **Layer 3 (Curate):** Expert validation of harvest output (external to toolkit) 5. **Layer 4 (Export):** Transform to pragmatics staging JSON (stub, not yet implemented) **Key insight:** Fitness implications are DERIVED by Cypher queries, not extracted from documents. ## Prerequisites - **Python:** 3.11+ - **Neo4j:** 5.x with `quarry` database created and online - **Anthropic API key:** Get from https://console.anthropic.com - **Packages:** `pip install docling anthropic neo4j` ## Installation From the repository root: ```bash pip install docling anthropic neo4j ``` ## Environment Variables - `ANTHROPIC_API_KEY` — Required for extraction - `NEO4J_PASSWORD` — Optional, defaults to `i'llbeback` in config - `NEO4J_URI` — Optional, defaults to `bolt://localhost:7687` - `LOG_LEVEL` — Optional, defaults to `INFO` ## Quick Start ```bash # 1. Seed Layer 0 (AnalysisTask + REQUIRES edges) python -m scripts.quarry.seed # 2. Extract a document (22-page CPS Handbook example) python -m scripts.quarry.extract --source cps_handbook # 3. Run harvest queries python -m scripts.quarry.harvest ``` ## Usage ### Seed Layer 0 ```bash # Dry run: print Cypher without executing python -m scripts.quarry.seed --dry-run # Execute: create Layer 0 nodes and relationships python -m scripts.quarry.seed ``` Idempotent (uses MERGE). Safe to run multiple times. ### Chunk a PDF ```bash # Test Docling chunking on CPS Handbook python -m scripts.quarry.chunk --source cps_handbook # Show first 5 chunks python -m scripts.quarry.chunk --source cps_handbook --limit 5 ``` Verifies section-aware chunking (chunk count ≠ page count). ### Extract Knowledge ```bash # Full extraction python -m scripts.quarry.extract --source cps_handbook # Dry run: show prompts and JSON output without writing to Neo4j python -m scripts.quarry.extract --source cps_handbook --dry-run # Limit to first 3 chunks (for testing) python -m scripts.quarry.extract --source cps_handbook --limit 3 ``` Extraction writes: - MethodologicalChoice, QualityAttribute, ConceptDefinition, etc. (Layer 1 nodes) - APPLIES_TO, PRODUCES, DEFINED_FOR, etc. (relationships per schema v3.1) - SOURCED_FROM edges linking all extracted nodes to SourceDocument - Exactly 1 SourceDocument node per PDF ### Run Harvest Queries ```bash python -m scripts.quarry.harvest ``` Runs all harvest queries from `docs/design/raw_kg_schema.md` §6: - §6.1a: Numeric threshold violations - §6.1b: Categorical mismatches (reference periods, universes) - §6.2: Temporal comparability breaks - §6.4: Unanticipated interactions - §6.5: Extraction coverage report - §6.6: Unconnected facts (gaps) Results saved to `tmp/harvest_results_{timestamp}.json`. ## Extending: Add a New Source Document 1. Add entry to `SOURCE_CATALOG` in `scripts/quarry/config.py`: ```python "my_document": { "catalog_id": "my_doc_id_2024", "title": "My Document Title", "year": 2024, "survey": "acs", # or "cps", "general" "local_path": "knowledge-base/source-docs/my-document.pdf" } ``` 2. Run extraction: ```bash python -m scripts.quarry.extract --source my_document ``` 3. Run harvest to generate ContextItems: ```bash python -m scripts.quarry.harvest ``` ## Schema Reference - **Full schema:** `docs/design/raw_kg_schema.md` v3.1 - **Machine-readable:** `scripts/quarry/schema.json` - **Architecture:** `docs/design/quarry_extraction_pipeline.md` - **Design narrative:** `docs/design/kg_schema_design_narrative.md` ## File Structure ``` scripts/quarry/ ├── __init__.py # Package init ├── config.py # Configuration, controlled vocabularies ├── schema.json # Machine-readable schema ├── utils.py # Shared utilities (Neo4j, Anthropic, validation) ├── chunk.py # PDF → Docling chunks ├── prompts.py # Extraction prompt templates ├── seed.py # Layer 0 seeding ├── extract.py # Main extraction pipeline (largest file) ├── harvest.py # Layer 2 harvest queries ├── export.py # Layer 4 export stub (not yet implemented) └── README.md # This file ``` ## Decisions & Architecture - **ADR-008:** Why llm-graph-builder was replaced (custom needs, Docling integration) - **ADR-009:** Quarry toolkit ships as project component (not external dependency) - **ADR-007:** KG-first authoring workflow - **ADR-001:** Neo4j authoring vs SQLite runtime separation ## Troubleshooting **"Missing ANTHROPIC_API_KEY"** → Set environment variable: `export ANTHROPIC_API_KEY='your-key'` **"MENTIONS relationships found"** → Extraction prompt needs fixing. MENTIONS is not a valid relationship type. **Chunk count equals page count** → Docling chunking not working correctly. Check `HierarchicalChunker` configuration. **Validation errors** → Check logs for specific failures. Common: controlled vocabulary violations, missing required properties. ## Known Limitations - Export pipeline (Layer 4) is stub only — harvest curation workflow not yet designed - Entity resolution is basic (MERGE on ID/name) — may need deduplication pass - No batch processing for multiple documents — run extract once per source - Cost tracking is estimate only (Sonnet 4.5 pricing may change)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•5.76 KiB