Open Census MCP Server

ADR-008-custom-extraction-pipeline.md•4.76 KiB

# ADR-008: Custom Extraction Pipeline over llm-graph-builder **Status:** Accepted **Date:** 2026-02-09 **Deciders:** Brock Webb **Traces To:** KG.5, KG.8, KG.9, KG.10, Phase 5 ## Context Week 1 of the FCSM sprint requires a pipeline to extract structured knowledge from Census methodology PDFs into the quarry Neo4j database. The initial plan was to use neo4j-labs/llm-graph-builder as the extraction engine. ## Decision **Build a custom extraction pipeline** in `scripts/quarry/` that ships with the census-mcp-server project, replacing llm-graph-builder for all extraction work. ## Rationale After a full integration test with llm-graph-builder (CPS Handbook of Methods, 22 pages), we found: ### llm-graph-builder Limitations (Empirical) 1. **Schema API is read-only** — `/schema` POST endpoint only queries existing labels, does not accept schema definitions despite documentation suggesting otherwise. Required workaround: seed skeleton nodes directly via Cypher. 2. **Page-based chunking only** — PyMuPDFLoader returns one Document per page. No paragraph, section, or semantic chunking. Census methodology docs need section-aware chunking for coherent extraction. 3. **LLMGraphTransformer does not populate typed properties** — Of 6 QualityAttribute nodes extracted, all had null `dimension`, `name`, and `value_type`. The `additional_instructions` parameter suggests properties but doesn't enforce them. Tool-use mode extraction treats properties as optional. 4. **Only 3 PRODUCES edges from 93 MethodologicalChoice nodes** — The critical join path for harvest queries (REQUIRES → QualityAttribute ← PRODUCES) was nearly empty. Required a full enrichment pass to fix. 5. **Source document hallucination** — One PDF produced 9 distinct SourceDocument nodes ("CPS Handbook of Methods", "Handbook of Methods", "CPS Technical Documentation", "CPS History Timeline", etc.). No entity resolution. 6. **291 MENTIONS fallback relationships** — When LLMGraphTransformer can't match extracted relationships to the allowed schema, it falls back to generic MENTIONS. This is noise. 7. **Frontend-dependent configuration** — Schema setting via Graph Enhancement tab requires running the React frontend. The backend API has no equivalent. 8. **Neo4j MCP is single-database** — Claude Desktop config allows only one NEO4J_DATABASE. Switching between quarry and pragmatics requires config changes or direct Python scripts. 9. **Resulted in 4 disconnected manual scripts** — seed, extract, enrich, harvest — each with hardcoded paths and no error recovery. ### What Worked - **The schema design (v3.1) is sound** — harvest queries produced real results once PRODUCES edges were populated by the enrichment pass. - **Direct LLM extraction (enrichment script) outperformed LLMGraphTransformer** — 96% of MethodologicalChoice nodes got PRODUCES edges with properly typed QualityAttribute properties. - **Harvest found one genuinely valuable insight** — 75% rotation group overlap in consecutive CPS months exceeding the 0.2 threshold. ### Custom Pipeline Advantages - **Section-aware chunking** — respect document structure (headers, numbered sections, tables) - **Direct structured JSON extraction** — one prompt per chunk with full property schema enforcement, no LLMGraphTransformer intermediary - **Single pipeline** — PDF → chunk → extract → write → harvest in one script - **Shippable as project tooling** — `scripts/quarry/` becomes a toolkit others can use to build their own pragmatics knowledge bases - **Full property control** — controlled vocabularies (fact_category, dimension, value_type) enforced at extraction time - **Entity resolution** — deduplicate at write time via MERGE on canonical names ## Consequences - llm-graph-builder remains installed for reference/experimentation but is not part of the census-mcp-server pipeline - All extraction tooling lives in census-mcp-server repo under `scripts/quarry/` - Extraction depends on: Anthropic API key, Neo4j with quarry database, PyMuPDF, langchain-anthropic - The quarry toolkit becomes a documented, reproducible methodology for the FCSM paper ## Alternatives Considered 1. **Fix llm-graph-builder** — Add section chunking, property enforcement, entity resolution. Rejected: fighting upstream design decisions. The tool is built for generic demo extraction, not domain-specific structured knowledge engineering. 2. **Use llm-graph-builder for extraction, custom scripts for enrichment** — This is what we tested. It works but doubles the API calls and adds complexity. The enrichment script proved direct extraction is cleaner. 3. **Skip KG entirely, author pragmatics manually** — Rejected: cross-document inference (CPS vs ACS definitional differences) requires graph traversal. Manual authoring can't scale to the full Census methodology corpus.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/brockwebb/open-census-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ADR-008-custom-extraction-pipeline.md•4.76 KiB