Registry Review MCP Server

2025-11-12-phase-4.2-llm-native-field-extraction-REVISED.md•21.1 kB

# Phase 4.2: LLM-Native Field Extraction for Registry Review MCP **Version:** 2.0.0 (Revised) **Date:** November 12, 2025 **Status:** Ready for Implementation **Phase:** 4.2 (Enhancement to Phase 4) **Target Completion:** 2-3 weeks --- ## Executive Summary Phase 4.1 regex-based extraction has critical limitations: only MM/DD/YYYY dates, false positives ("maps dating" as owner), no image reading, brittle patterns. **Phase 4.2 upgrades to LLM-native extraction** for universal format support, context awareness, and image analysis. **Key Improvements:** - ✅ Any date format (MM/DD/YYYY, "August 15 2022", international) - ✅ Context-aware disambiguation (project start vs sampling vs imagery dates) - ✅ Image reading (scanned land titles, maps, tables) - ✅ Semantic understanding (Nick = Nicholas, fuzzy name matching) - ✅ Registry-agnostic (Regen, Verra, Gold Standard, CAR) **Deployment Constraint:** MCP server will be **hosted remotely** and accessed by various agents. All extraction must be **self-contained** in server code. **Cost:** ~$0.10-0.30 per session | **ROI:** 400x-1000x (saves 3-5 hours at $50-75/hour) --- ## Problem Statement ### Current Limitations **Regex extraction failures:** - False positives: "maps dating" extracted as owner name - Missed formats: "August 15th, 2022" not parsed - Poor disambiguation: All dates labeled as 'project_start_date' - No image support: Cannot read scanned land titles - Registry-specific: Only works for Regen format (C06-4997) ### Target Capabilities LLM extraction enables: 1. **Universal formats**: Any date, name, ID pattern 2. **Context understanding**: Distinguishes date types, name variations 3. **Image analysis**: Reads scanned documents via Claude Vision 4. **Confidence scoring**: 0.0-1.0 with reasoning for audit trail --- ## Architecture ### Integration Pattern ``` MCP Server (validation_tools.py) └─> cross_validate() ├─> IF llm_extraction_enabled: │ └─> extract_fields_with_llm() [NEW] │ ├─> DateExtractor (AsyncAnthropic) │ ├─> LandTenureExtractor (AsyncAnthropic) │ └─> ProjectIDExtractor (AsyncAnthropic) └─> ELSE: └─> Regex extraction (fallback) ``` ### Key Components **1. Extraction Module** (`src/registry_review_mcp/extractors/llm_extractors.py`) - `ExtractedField`: Pydantic model (value, field_type, source, confidence, reasoning) - `DateExtractor`: Specialized date extraction with type classification - `LandTenureExtractor`: Owner names, areas, tenure types (with image support) - `ProjectIDExtractor`: Registry-agnostic ID patterns - `extract_fields_with_llm()`: Main interface called by `cross_validate()` **2. Cache Layer** (`src/registry_review_mcp/utils/cache.py`) - File-based TTL cache (24-hour default) - Namespace-scoped (date_extraction, land_tenure, project_ids) - Document-level caching to avoid duplicate API calls **3. Helper Functions** (in `llm_extractors.py`) ```python def extract_doc_id(source: str) -> str | None def extract_doc_name(source: str) -> str def extract_page(source: str) -> int | None def group_fields_by_document(fields: list[ExtractedField]) -> dict ``` ### Configuration ```python # Add to src/registry_review_mcp/config/settings.py class Settings(BaseSettings): # LLM Extraction anthropic_api_key: str = Field(default="") llm_extraction_enabled: bool = Field(default=False) # Conservative default llm_model: str = Field(default="claude-sonnet-4-20250514") llm_max_tokens: int = Field(default=4000) llm_temperature: float = Field(default=0.0) llm_confidence_threshold: float = Field(default=0.7) # Cost Management max_api_calls_per_session: int = Field(default=50) api_call_timeout_seconds: int = Field(default=30) ``` ### Prompt Strategy **Date Extraction:** - Role: Date extraction specialist for carbon credit projects - Task: Extract all dates, classify by type (project_start, imagery, sampling, baseline, monitoring) - Output: JSON array with value, field_type, source, confidence, reasoning - Context: Parse any format, use section headers for disambiguation **Land Tenure Extraction:** - Role: Land tenure specialist - Task: Extract owner names, areas, tenure types - Special: Read images (scanned land titles), handle name variations - Output: JSON with owner_name, area_hectares, tenure_type, confidence **Project ID Extraction:** - Role: Project ID specialist - Task: Find all project ID occurrences across documents - Patterns: Regen (C06-4997), Verra (VCS-1234), Gold Standard (GS-5678), CAR, ACR - Output: JSON with value, source, confidence --- ## Implementation Plan ### Week 1: Foundation & Date Extraction **Day 1: Setup** - Add `anthropic>=0.40.0` to dependencies - Create `extractors/` module and `ExtractedField` model - Implement `Cache` class with TTL and namespace support - Add configuration settings to `settings.py` **Day 2-3: Date Extractor** - Implement `DateExtractor` with `AsyncAnthropic` client - Create specialized date extraction prompt (external template) - Handle markdown + images from marker output - Implement caching and error handling - Unit tests: format parsing, disambiguation, caching **Day 4: Validation** - Test against Botany Farm REQ-007 (Project Start Date) - Verify all date formats recognized - Compare accuracy vs regex baseline **Day 5: Land Tenure Extractor** - Implement `LandTenureExtractor` with image support - Test with scanned land title images - Handle name variations (Nick = Nicholas) ### Week 2: IDs, Integration & Delivery **Day 1: Project ID Extractor** - Implement `ProjectIDExtractor` - Registry-agnostic pattern recognition - Cross-document consistency checking **Day 2-3: Integration** - Update `cross_validate()` to call LLM extraction - Implement helper functions for data transformation - Add confidence filtering (before transformation) - Add exception handling with fallback to regex - Cost tracking (API calls, tokens, estimated USD) **Day 4: Testing** - Integration test: LLM output compatible with validation functions - Feature toggle test: Switch between LLM and regex - Error handling tests: API failures, low confidence, corrupted images - Performance benchmark: latency, API calls, cost **Day 5: Documentation & Deployment** - Update README with LLM extraction configuration - Document environment variables for remote deployment - Create deployment validation checklist - Test in Docker container (if remote hosting planned) ### Decision Framework **Model Selection:** - Default: Claude Sonnet 4 (best quality) - If cost > $0.50/session: Test Claude Haiku - Configurable via `llm_model` setting **Caching Strategy:** - Per-document hash (invalidates on content change) - 24-hour TTL - Session-scoped for isolation **Fallback Logic:** - LLM extraction fails → Regex extraction - API unavailable → Regex extraction - Confidence < threshold → Exclude from validation --- ## Acceptance Criteria ### Must Have (P0) **Functional:** - [ ] Extract dates in any format (MM/DD/YYYY, "August 15 2022", international) - [ ] Correctly disambiguate date types (project start vs sampling vs imagery) - [ ] Handle name variations ("Nick" = "Nicholas") without false negatives - [ ] Read scanned land title images (OCR capability) - [ ] Confidence scores 0.0-1.0 for all extracted fields - [ ] Confidence filtering prevents low-quality extractions from validation - [ ] Fallback to regex if API key not set or API unavailable - [ ] All 61 existing tests pass (no regressions) **Technical:** - [ ] Use `AsyncAnthropic` client (non-blocking) - [ ] Implement `Cache` class with TTL and namespacing - [ ] Helper functions implemented (`extract_doc_id`, `extract_doc_name`, `extract_page`) - [ ] Data structures grouped correctly (tenure fields merged by document) - [ ] Exception handling with automatic fallback ### Should Have (P1) **Quality:** - [ ] 95%+ accuracy on date extraction (vs Botany Farm ground truth) - [ ] 90%+ accuracy on land tenure extraction - [ ] 100% accuracy on project ID extraction - [ ] Zero "maps dating" false positives **Performance:** - [ ] Full extraction < 30 seconds - [ ] < 20 API calls per session - [ ] Cache hit rate > 80% on repeat runs - [ ] Cost per session < $0.50 **Observability:** - [ ] Cost tracking (API calls, tokens, estimated USD) - [ ] Extraction method logged (llm vs regex) - [ ] Confidence distribution logged - [ ] API error rate < 5% ### Nice to Have (P2) - [ ] Parallel extraction for multiple documents - [ ] Retry logic with exponential backoff - [ ] Multiple model support (Sonnet → Haiku fallback) - [ ] Streaming responses for large documents --- ## Testing Strategy ### Test Suite (5-7 tests) **1. Integration Test: Contract Compatibility** ```python async def test_llm_extraction_output_compatible_with_validation(): """Verify LLM extractor output matches validation function expectations.""" # Extract with LLM llm_dates, llm_tenure, llm_ids = await extract_fields_with_llm(session_id, evidence_data) # Verify validation functions accept output format date_result = await validate_date_alignment(..., llm_dates[0]['date_value'], ...) tenure_result = await validate_land_tenure(..., fields=llm_tenure) id_result = await validate_project_id(..., occurrences=llm_ids) assert all results valid ``` **2. Accuracy Test: Ground Truth Comparison** ```python async def test_llm_extraction_accuracy_botany_farm(): """Verify extraction achieves target accuracy on Botany Farm.""" GROUND_TRUTH = { 'dates': [('2022-01-01', 'project_start_date'), ...], 'tenure': [('Nicholas Denman', 'owner_name'), (120.5, 'area_hectares')], 'project_ids': ['4997'] } extracted = await extract_fields_with_llm(session_id, evidence_data) precision, recall = calculate_metrics(extracted, GROUND_TRUTH) assert precision >= 0.95 and recall >= 0.90 assert "maps dating" not in extracted_owner_names # Specific fix verified ``` **3. Performance Test: Cost & Latency** ```python async def test_extraction_performance_and_cost(): """Verify extraction meets performance and cost targets.""" tracker = CostTracker() # Cold run with tracker, TimeIt() as timer1: result1 = await extract_fields_with_llm(session_id, evidence_data) # Cached run with tracker, TimeIt() as timer2: result2 = await extract_fields_with_llm(session_id, evidence_data) assert timer1.elapsed < 30.0 # Cold < 30s assert timer2.elapsed < 5.0 # Cached < 5s assert tracker.call_count < 20 assert tracker.total_cost < 0.50 assert result1 == result2 # Caching works ``` **4. Feature Toggle Test** ```python async def test_toggle_between_llm_and_regex(): """Verify can switch between LLM and regex extraction.""" # LLM extraction settings.llm_extraction_enabled = True llm_result = await cross_validate(session_id) # Regex extraction settings.llm_extraction_enabled = False regex_result = await cross_validate(session_id) assert both valid assert llm finds more dates (better format support) ``` **5. Error Handling: API Failure** ```python async def test_api_failure_fallback_to_regex(): """Verify graceful degradation when API fails.""" with mock_api_error(APIError("Rate limit exceeded")): result = await cross_validate(session_id) assert result valid (used regex fallback) assert result['summary']['extraction_method'] == 'regex_fallback' ``` **6-7. Additional Error Tests** - Low confidence filtering (exclude extractions < 0.7) - Corrupted image handling (continue with text-only) --- ## Cost Analysis ### Per-Session Estimate **Assumptions:** - Model: Claude Sonnet 4 ($3/MTok input, $15/MTok output) - Calls per session: ~15 (5 dates, 5 tenure, 5 IDs) - Average tokens: 5K input, 500 output per call **Calculation:** ``` Input: 15 × 5,000 tokens = 75,000 tokens = $0.23 Output: 15 × 500 tokens = 7,500 tokens = $0.11 Total: $0.34 per session ``` **Monthly (100 reviews):** $34/month **ROI:** - Time saved: 3-5 hours per review - Value: $150-375 (at $50-75/hour) - Cost: $0.34 - **ROI: 441x - 1103x** ### Cost Management **Safeguards:** 1. Hard limit: 50 API calls per session 2. Caching: Never extract same document twice 3. Confidence threshold: Only validate high-quality extractions (≥0.7) 4. Timeout: 30-second timeout per call **Monitoring:** - Log API calls to stderr - Track tokens per session - Report cost in validation results --- ## Risks & Mitigation ### Critical Risks (High Impact) **1. Marker Output Not Available** - **Impact**: High - Cannot load markdown + images for LLM - **Probability**: Medium - Depends on preprocessing workflow - **Mitigation**: - Check for marker output first (test if `.md` and `_page_*.jpeg` exist) - Fall back to pypdf text extraction if marker unavailable - Document marker preprocessing requirement clearly **2. Data Structure Mismatch** - **Impact**: High - Breaks existing validation functions - **Probability**: Medium - Complex transformation logic - **Mitigation**: - Implement helper functions carefully with unit tests - Use `group_fields_by_document()` to merge tenure fields - Integration test verifies output format matches expectations **3. Async Client Blocking** - **Impact**: High - Degrades MCP server performance - **Probability**: Low - If spec followed correctly - **Mitigation**: - Use `AsyncAnthropic` not `Anthropic` - All `client.messages.create()` calls must be `await`ed - Test with concurrent requests ### Important Risks (Medium Impact) **4. Prompt Engineering Iteration** - **Impact**: Medium - Initial prompts may need tuning - **Probability**: High - First implementation rarely optimal - **Mitigation**: - Build prompt testing harness on Day 1 - Allocate 0.5 days per extractor for tuning - Use external prompt templates (easy to iterate) **5. Token Context Limits** - **Impact**: Medium - Large documents exceed context window - **Probability**: Low - Most documents < 50 pages - **Mitigation**: - Limit to first 10K characters of markdown - Limit to first 5 images per document - Log warning if content truncated **6. Cost Overruns** - **Impact**: Medium - Budget exceeded - **Probability**: Low - Hard limits in place - **Mitigation**: - 50-call limit per session - Cost tracking with alerts - Option to disable LLM extraction ### Standard Risks (Low Impact) **7. API Rate Limiting** - **Mitigation**: Retry with exponential backoff, cache aggressively, fall back to regex **8. LLM Hallucination** - **Mitigation**: Confidence threshold (≥0.7), reasoning field for audit, validation catches inconsistencies --- ## Success Metrics ### Launch Readiness - [ ] All P0 acceptance criteria met - [ ] 5-7 new tests passing - [ ] All 61 existing tests passing - [ ] Cost per session < $0.50 - [ ] Accuracy > 90% on Botany Farm - [ ] Documentation complete ### Post-Launch (2 weeks) - 95%+ extraction accuracy (vs human review) - API error rate < 5% - Cache hit rate > 80% - Zero production incidents - ≥5 real-world extractions regex couldn't handle - Cost tracking shows < $50/month typical usage --- ## Implementation Details ### Async Client Pattern **CRITICAL:** Use `AsyncAnthropic` not `Anthropic` to avoid blocking MCP event loop. ```python from anthropic import AsyncAnthropic class DateExtractor: def __init__(self): self.client = AsyncAnthropic(api_key=settings.anthropic_api_key) self.cache = Cache("date_extraction") async def extract(self, markdown: str, images: list[Path], doc_name: str) -> list[ExtractedField]: # Check cache cache_key = f"{doc_name}_dates" if cached := self.cache.get(cache_key): return [ExtractedField(**f) for f in cached] # Build message (markdown + images) content = [{"type": "text", "text": markdown[:10000]}] for img in images[:5]: img_data = base64.b64encode(img.read_bytes()).decode() content.append({"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": img_data}}) # Call API (async!) response = await self.client.messages.create( model=settings.llm_model, max_tokens=settings.llm_max_tokens, temperature=0.0, system=DATE_EXTRACTION_PROMPT, messages=[{"role": "user", "content": content}] ) # Parse and cache fields = parse_json_response(response.content[0].text) self.cache.set(cache_key, [f.model_dump() for f in fields]) return fields ``` ### Data Transformation **CRITICAL:** Group tenure fields by document before passing to validation. ```python def group_fields_by_document(fields: list[ExtractedField]) -> list[dict]: """Group tenure fields by document into validation-ready format.""" by_doc = {} for field in fields: doc_key = extract_doc_id(field.source) or field.source if doc_key not in by_doc: by_doc[doc_key] = { 'document_id': extract_doc_id(field.source), 'document_name': extract_doc_name(field.source), 'page': extract_page(field.source), 'source': field.source, 'confidence': field.confidence } # Add field value if field.field_type == 'owner_name': by_doc[doc_key]['owner_name'] = field.value elif field.field_type == 'area_hectares': by_doc[doc_key]['area_hectares'] = field.value elif field.field_type == 'tenure_type': by_doc[doc_key]['tenure_type'] = field.value return list(by_doc.values()) ``` ### Integration with cross_validate() ```python async def cross_validate(session_id: str) -> dict[str, Any]: """Run cross-document validation with LLM or regex extraction.""" state_manager = StateManager(session_id) evidence_data = state_manager.read_json("evidence.json") # Extract fields with automatic fallback try: if settings.llm_extraction_enabled and settings.anthropic_api_key: print("Using LLM-powered extraction", file=sys.stderr) raw_fields = await extract_fields_with_llm(session_id, evidence_data) # Transform to validation format dates = transform_dates(raw_fields['dates']) tenure_fields = group_fields_by_document(raw_fields['tenure']) project_ids = transform_ids(raw_fields['project_ids']) # Filter by confidence (BEFORE passing to validation) dates = [d for d in dates if d['confidence'] >= settings.llm_confidence_threshold] tenure_fields = [t for t in tenure_fields if t['confidence'] >= settings.llm_confidence_threshold] project_ids = [p for p in project_ids if p['confidence'] >= settings.llm_confidence_threshold] extraction_method = "llm" else: raise ValueError("LLM extraction not available") except Exception as e: print(f"LLM extraction failed: {e}. Using regex fallback.", file=sys.stderr) dates = extract_dates_from_evidence(evidence_data) tenure_fields = extract_land_tenure_from_evidence(evidence_data) project_ids = extract_project_ids_from_evidence(evidence_data) extraction_method = "regex_fallback" # Rest of validation logic unchanged # ... return { "summary": { "extraction_method": extraction_method, # ... other results } } ``` --- ## Related Documents **Internal Specs:** - `specs/2025-11-12-registry-review-mcp-REFINED.md` - Main specification - `docs/PHASE_4_COMPLETION.md` - Current regex implementation - `docs/PROMPT_DESIGN_PRINCIPLES.md` - Auto-selection patterns **External References:** - [Anthropic Agent SDK](https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk) - `.claude/skills/agentic-engineering/` - Composition patterns --- ## Appendix: Example Improvements ### Date Extraction **Input:** `"01/01/2022. Monitoring rounds in August – March when soil is dormant."` **Regex (Phase 4.1):** - ✅ Extracts: `2022-01-01` as project_start_date - ❌ Misses: August-March monitoring period **LLM (Phase 4.2):** - ✅ Extracts: `2022-01-01` as project_start_date (confidence: 0.98) - ✅ Extracts: `2022-08-01 to 2023-03-31` as monitoring_period (confidence: 0.85) - ✅ Provides: Source citations, reasoning for audit trail ### Land Tenure **Input:** Scanned land registry image + text mentioning "Nick Denman" **Regex (Phase 4.1):** - ❌ False positive: "maps dating" as owner - ❌ Cannot read scanned image **LLM (Phase 4.2):** - ✅ Reads image: "Nicholas Denman" from scanned land title - ✅ Text extraction: "Nick Denman" from project plan - ✅ Recognizes: Same person (name variation) - ✅ No false positives --- **Document Status:** Ready for Implementation **Approval Required:** Yes **Next Steps:** Review → Approve → Begin implementation

Loading blob content...

Latest Blog Posts

Don't Use Large Strings as Cache Keys
By punkpeye on January 11, 2026.
markdown
node-js
cache
What are Claude Skills?
By punkpeye on January 10, 2026.
mcp
skills
How to Test MCP Streamable HTTP Endpoints Using cURL
By punkpeye on January 2, 2026.
tutorial
bash

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gaiaaiagent/regen-registry-review-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

2025-11-12-phase-4.2-llm-native-field-extraction-REVISED.md•21.1 kB