# Phase 4.2: LLM-Native Field Extraction for Registry Review MCP
**Version:** 2.0.0 (Revised)
**Date:** November 12, 2025
**Status:** Ready for Implementation
**Phase:** 4.2 (Enhancement to Phase 4)
**Target Completion:** 2-3 weeks
---
## Executive Summary
Phase 4.1 regex-based extraction has critical limitations: only MM/DD/YYYY dates, false positives ("maps dating" as owner), no image reading, brittle patterns. **Phase 4.2 upgrades to LLM-native extraction** for universal format support, context awareness, and image analysis.
**Key Improvements:**
- ✅ Any date format (MM/DD/YYYY, "August 15 2022", international)
- ✅ Context-aware disambiguation (project start vs sampling vs imagery dates)
- ✅ Image reading (scanned land titles, maps, tables)
- ✅ Semantic understanding (Nick = Nicholas, fuzzy name matching)
- ✅ Registry-agnostic (Regen, Verra, Gold Standard, CAR)
**Deployment Constraint:** MCP server will be **hosted remotely** and accessed by various agents. All extraction must be **self-contained** in server code.
**Cost:** ~$0.10-0.30 per session | **ROI:** 400x-1000x (saves 3-5 hours at $50-75/hour)
---
## Problem Statement
### Current Limitations
**Regex extraction failures:**
- False positives: "maps dating" extracted as owner name
- Missed formats: "August 15th, 2022" not parsed
- Poor disambiguation: All dates labeled as 'project_start_date'
- No image support: Cannot read scanned land titles
- Registry-specific: Only works for Regen format (C06-4997)
### Target Capabilities
LLM extraction enables:
1. **Universal formats**: Any date, name, ID pattern
2. **Context understanding**: Distinguishes date types, name variations
3. **Image analysis**: Reads scanned documents via Claude Vision
4. **Confidence scoring**: 0.0-1.0 with reasoning for audit trail
---
## Architecture
### Integration Pattern
```
MCP Server (validation_tools.py)
└─> cross_validate()
├─> IF llm_extraction_enabled:
│ └─> extract_fields_with_llm() [NEW]
│ ├─> DateExtractor (AsyncAnthropic)
│ ├─> LandTenureExtractor (AsyncAnthropic)
│ └─> ProjectIDExtractor (AsyncAnthropic)
└─> ELSE:
└─> Regex extraction (fallback)
```
### Key Components
**1. Extraction Module** (`src/registry_review_mcp/extractors/llm_extractors.py`)
- `ExtractedField`: Pydantic model (value, field_type, source, confidence, reasoning)
- `DateExtractor`: Specialized date extraction with type classification
- `LandTenureExtractor`: Owner names, areas, tenure types (with image support)
- `ProjectIDExtractor`: Registry-agnostic ID patterns
- `extract_fields_with_llm()`: Main interface called by `cross_validate()`
**2. Cache Layer** (`src/registry_review_mcp/utils/cache.py`)
- File-based TTL cache (24-hour default)
- Namespace-scoped (date_extraction, land_tenure, project_ids)
- Document-level caching to avoid duplicate API calls
**3. Helper Functions** (in `llm_extractors.py`)
```python
def extract_doc_id(source: str) -> str | None
def extract_doc_name(source: str) -> str
def extract_page(source: str) -> int | None
def group_fields_by_document(fields: list[ExtractedField]) -> dict
```
### Configuration
```python
# Add to src/registry_review_mcp/config/settings.py
class Settings(BaseSettings):
# LLM Extraction
anthropic_api_key: str = Field(default="")
llm_extraction_enabled: bool = Field(default=False) # Conservative default
llm_model: str = Field(default="claude-sonnet-4-20250514")
llm_max_tokens: int = Field(default=4000)
llm_temperature: float = Field(default=0.0)
llm_confidence_threshold: float = Field(default=0.7)
# Cost Management
max_api_calls_per_session: int = Field(default=50)
api_call_timeout_seconds: int = Field(default=30)
```
### Prompt Strategy
**Date Extraction:**
- Role: Date extraction specialist for carbon credit projects
- Task: Extract all dates, classify by type (project_start, imagery, sampling, baseline, monitoring)
- Output: JSON array with value, field_type, source, confidence, reasoning
- Context: Parse any format, use section headers for disambiguation
**Land Tenure Extraction:**
- Role: Land tenure specialist
- Task: Extract owner names, areas, tenure types
- Special: Read images (scanned land titles), handle name variations
- Output: JSON with owner_name, area_hectares, tenure_type, confidence
**Project ID Extraction:**
- Role: Project ID specialist
- Task: Find all project ID occurrences across documents
- Patterns: Regen (C06-4997), Verra (VCS-1234), Gold Standard (GS-5678), CAR, ACR
- Output: JSON with value, source, confidence
---
## Implementation Plan
### Week 1: Foundation & Date Extraction
**Day 1: Setup**
- Add `anthropic>=0.40.0` to dependencies
- Create `extractors/` module and `ExtractedField` model
- Implement `Cache` class with TTL and namespace support
- Add configuration settings to `settings.py`
**Day 2-3: Date Extractor**
- Implement `DateExtractor` with `AsyncAnthropic` client
- Create specialized date extraction prompt (external template)
- Handle markdown + images from marker output
- Implement caching and error handling
- Unit tests: format parsing, disambiguation, caching
**Day 4: Validation**
- Test against Botany Farm REQ-007 (Project Start Date)
- Verify all date formats recognized
- Compare accuracy vs regex baseline
**Day 5: Land Tenure Extractor**
- Implement `LandTenureExtractor` with image support
- Test with scanned land title images
- Handle name variations (Nick = Nicholas)
### Week 2: IDs, Integration & Delivery
**Day 1: Project ID Extractor**
- Implement `ProjectIDExtractor`
- Registry-agnostic pattern recognition
- Cross-document consistency checking
**Day 2-3: Integration**
- Update `cross_validate()` to call LLM extraction
- Implement helper functions for data transformation
- Add confidence filtering (before transformation)
- Add exception handling with fallback to regex
- Cost tracking (API calls, tokens, estimated USD)
**Day 4: Testing**
- Integration test: LLM output compatible with validation functions
- Feature toggle test: Switch between LLM and regex
- Error handling tests: API failures, low confidence, corrupted images
- Performance benchmark: latency, API calls, cost
**Day 5: Documentation & Deployment**
- Update README with LLM extraction configuration
- Document environment variables for remote deployment
- Create deployment validation checklist
- Test in Docker container (if remote hosting planned)
### Decision Framework
**Model Selection:**
- Default: Claude Sonnet 4 (best quality)
- If cost > $0.50/session: Test Claude Haiku
- Configurable via `llm_model` setting
**Caching Strategy:**
- Per-document hash (invalidates on content change)
- 24-hour TTL
- Session-scoped for isolation
**Fallback Logic:**
- LLM extraction fails → Regex extraction
- API unavailable → Regex extraction
- Confidence < threshold → Exclude from validation
---
## Acceptance Criteria
### Must Have (P0)
**Functional:**
- [ ] Extract dates in any format (MM/DD/YYYY, "August 15 2022", international)
- [ ] Correctly disambiguate date types (project start vs sampling vs imagery)
- [ ] Handle name variations ("Nick" = "Nicholas") without false negatives
- [ ] Read scanned land title images (OCR capability)
- [ ] Confidence scores 0.0-1.0 for all extracted fields
- [ ] Confidence filtering prevents low-quality extractions from validation
- [ ] Fallback to regex if API key not set or API unavailable
- [ ] All 61 existing tests pass (no regressions)
**Technical:**
- [ ] Use `AsyncAnthropic` client (non-blocking)
- [ ] Implement `Cache` class with TTL and namespacing
- [ ] Helper functions implemented (`extract_doc_id`, `extract_doc_name`, `extract_page`)
- [ ] Data structures grouped correctly (tenure fields merged by document)
- [ ] Exception handling with automatic fallback
### Should Have (P1)
**Quality:**
- [ ] 95%+ accuracy on date extraction (vs Botany Farm ground truth)
- [ ] 90%+ accuracy on land tenure extraction
- [ ] 100% accuracy on project ID extraction
- [ ] Zero "maps dating" false positives
**Performance:**
- [ ] Full extraction < 30 seconds
- [ ] < 20 API calls per session
- [ ] Cache hit rate > 80% on repeat runs
- [ ] Cost per session < $0.50
**Observability:**
- [ ] Cost tracking (API calls, tokens, estimated USD)
- [ ] Extraction method logged (llm vs regex)
- [ ] Confidence distribution logged
- [ ] API error rate < 5%
### Nice to Have (P2)
- [ ] Parallel extraction for multiple documents
- [ ] Retry logic with exponential backoff
- [ ] Multiple model support (Sonnet → Haiku fallback)
- [ ] Streaming responses for large documents
---
## Testing Strategy
### Test Suite (5-7 tests)
**1. Integration Test: Contract Compatibility**
```python
async def test_llm_extraction_output_compatible_with_validation():
"""Verify LLM extractor output matches validation function expectations."""
# Extract with LLM
llm_dates, llm_tenure, llm_ids = await extract_fields_with_llm(session_id, evidence_data)
# Verify validation functions accept output format
date_result = await validate_date_alignment(..., llm_dates[0]['date_value'], ...)
tenure_result = await validate_land_tenure(..., fields=llm_tenure)
id_result = await validate_project_id(..., occurrences=llm_ids)
assert all results valid
```
**2. Accuracy Test: Ground Truth Comparison**
```python
async def test_llm_extraction_accuracy_botany_farm():
"""Verify extraction achieves target accuracy on Botany Farm."""
GROUND_TRUTH = {
'dates': [('2022-01-01', 'project_start_date'), ...],
'tenure': [('Nicholas Denman', 'owner_name'), (120.5, 'area_hectares')],
'project_ids': ['4997']
}
extracted = await extract_fields_with_llm(session_id, evidence_data)
precision, recall = calculate_metrics(extracted, GROUND_TRUTH)
assert precision >= 0.95 and recall >= 0.90
assert "maps dating" not in extracted_owner_names # Specific fix verified
```
**3. Performance Test: Cost & Latency**
```python
async def test_extraction_performance_and_cost():
"""Verify extraction meets performance and cost targets."""
tracker = CostTracker()
# Cold run
with tracker, TimeIt() as timer1:
result1 = await extract_fields_with_llm(session_id, evidence_data)
# Cached run
with tracker, TimeIt() as timer2:
result2 = await extract_fields_with_llm(session_id, evidence_data)
assert timer1.elapsed < 30.0 # Cold < 30s
assert timer2.elapsed < 5.0 # Cached < 5s
assert tracker.call_count < 20
assert tracker.total_cost < 0.50
assert result1 == result2 # Caching works
```
**4. Feature Toggle Test**
```python
async def test_toggle_between_llm_and_regex():
"""Verify can switch between LLM and regex extraction."""
# LLM extraction
settings.llm_extraction_enabled = True
llm_result = await cross_validate(session_id)
# Regex extraction
settings.llm_extraction_enabled = False
regex_result = await cross_validate(session_id)
assert both valid
assert llm finds more dates (better format support)
```
**5. Error Handling: API Failure**
```python
async def test_api_failure_fallback_to_regex():
"""Verify graceful degradation when API fails."""
with mock_api_error(APIError("Rate limit exceeded")):
result = await cross_validate(session_id)
assert result valid (used regex fallback)
assert result['summary']['extraction_method'] == 'regex_fallback'
```
**6-7. Additional Error Tests**
- Low confidence filtering (exclude extractions < 0.7)
- Corrupted image handling (continue with text-only)
---
## Cost Analysis
### Per-Session Estimate
**Assumptions:**
- Model: Claude Sonnet 4 ($3/MTok input, $15/MTok output)
- Calls per session: ~15 (5 dates, 5 tenure, 5 IDs)
- Average tokens: 5K input, 500 output per call
**Calculation:**
```
Input: 15 × 5,000 tokens = 75,000 tokens = $0.23
Output: 15 × 500 tokens = 7,500 tokens = $0.11
Total: $0.34 per session
```
**Monthly (100 reviews):** $34/month
**ROI:**
- Time saved: 3-5 hours per review
- Value: $150-375 (at $50-75/hour)
- Cost: $0.34
- **ROI: 441x - 1103x**
### Cost Management
**Safeguards:**
1. Hard limit: 50 API calls per session
2. Caching: Never extract same document twice
3. Confidence threshold: Only validate high-quality extractions (≥0.7)
4. Timeout: 30-second timeout per call
**Monitoring:**
- Log API calls to stderr
- Track tokens per session
- Report cost in validation results
---
## Risks & Mitigation
### Critical Risks (High Impact)
**1. Marker Output Not Available**
- **Impact**: High - Cannot load markdown + images for LLM
- **Probability**: Medium - Depends on preprocessing workflow
- **Mitigation**:
- Check for marker output first (test if `.md` and `_page_*.jpeg` exist)
- Fall back to pypdf text extraction if marker unavailable
- Document marker preprocessing requirement clearly
**2. Data Structure Mismatch**
- **Impact**: High - Breaks existing validation functions
- **Probability**: Medium - Complex transformation logic
- **Mitigation**:
- Implement helper functions carefully with unit tests
- Use `group_fields_by_document()` to merge tenure fields
- Integration test verifies output format matches expectations
**3. Async Client Blocking**
- **Impact**: High - Degrades MCP server performance
- **Probability**: Low - If spec followed correctly
- **Mitigation**:
- Use `AsyncAnthropic` not `Anthropic`
- All `client.messages.create()` calls must be `await`ed
- Test with concurrent requests
### Important Risks (Medium Impact)
**4. Prompt Engineering Iteration**
- **Impact**: Medium - Initial prompts may need tuning
- **Probability**: High - First implementation rarely optimal
- **Mitigation**:
- Build prompt testing harness on Day 1
- Allocate 0.5 days per extractor for tuning
- Use external prompt templates (easy to iterate)
**5. Token Context Limits**
- **Impact**: Medium - Large documents exceed context window
- **Probability**: Low - Most documents < 50 pages
- **Mitigation**:
- Limit to first 10K characters of markdown
- Limit to first 5 images per document
- Log warning if content truncated
**6. Cost Overruns**
- **Impact**: Medium - Budget exceeded
- **Probability**: Low - Hard limits in place
- **Mitigation**:
- 50-call limit per session
- Cost tracking with alerts
- Option to disable LLM extraction
### Standard Risks (Low Impact)
**7. API Rate Limiting**
- **Mitigation**: Retry with exponential backoff, cache aggressively, fall back to regex
**8. LLM Hallucination**
- **Mitigation**: Confidence threshold (≥0.7), reasoning field for audit, validation catches inconsistencies
---
## Success Metrics
### Launch Readiness
- [ ] All P0 acceptance criteria met
- [ ] 5-7 new tests passing
- [ ] All 61 existing tests passing
- [ ] Cost per session < $0.50
- [ ] Accuracy > 90% on Botany Farm
- [ ] Documentation complete
### Post-Launch (2 weeks)
- 95%+ extraction accuracy (vs human review)
- API error rate < 5%
- Cache hit rate > 80%
- Zero production incidents
- ≥5 real-world extractions regex couldn't handle
- Cost tracking shows < $50/month typical usage
---
## Implementation Details
### Async Client Pattern
**CRITICAL:** Use `AsyncAnthropic` not `Anthropic` to avoid blocking MCP event loop.
```python
from anthropic import AsyncAnthropic
class DateExtractor:
def __init__(self):
self.client = AsyncAnthropic(api_key=settings.anthropic_api_key)
self.cache = Cache("date_extraction")
async def extract(self, markdown: str, images: list[Path], doc_name: str) -> list[ExtractedField]:
# Check cache
cache_key = f"{doc_name}_dates"
if cached := self.cache.get(cache_key):
return [ExtractedField(**f) for f in cached]
# Build message (markdown + images)
content = [{"type": "text", "text": markdown[:10000]}]
for img in images[:5]:
img_data = base64.b64encode(img.read_bytes()).decode()
content.append({"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": img_data}})
# Call API (async!)
response = await self.client.messages.create(
model=settings.llm_model,
max_tokens=settings.llm_max_tokens,
temperature=0.0,
system=DATE_EXTRACTION_PROMPT,
messages=[{"role": "user", "content": content}]
)
# Parse and cache
fields = parse_json_response(response.content[0].text)
self.cache.set(cache_key, [f.model_dump() for f in fields])
return fields
```
### Data Transformation
**CRITICAL:** Group tenure fields by document before passing to validation.
```python
def group_fields_by_document(fields: list[ExtractedField]) -> list[dict]:
"""Group tenure fields by document into validation-ready format."""
by_doc = {}
for field in fields:
doc_key = extract_doc_id(field.source) or field.source
if doc_key not in by_doc:
by_doc[doc_key] = {
'document_id': extract_doc_id(field.source),
'document_name': extract_doc_name(field.source),
'page': extract_page(field.source),
'source': field.source,
'confidence': field.confidence
}
# Add field value
if field.field_type == 'owner_name':
by_doc[doc_key]['owner_name'] = field.value
elif field.field_type == 'area_hectares':
by_doc[doc_key]['area_hectares'] = field.value
elif field.field_type == 'tenure_type':
by_doc[doc_key]['tenure_type'] = field.value
return list(by_doc.values())
```
### Integration with cross_validate()
```python
async def cross_validate(session_id: str) -> dict[str, Any]:
"""Run cross-document validation with LLM or regex extraction."""
state_manager = StateManager(session_id)
evidence_data = state_manager.read_json("evidence.json")
# Extract fields with automatic fallback
try:
if settings.llm_extraction_enabled and settings.anthropic_api_key:
print("Using LLM-powered extraction", file=sys.stderr)
raw_fields = await extract_fields_with_llm(session_id, evidence_data)
# Transform to validation format
dates = transform_dates(raw_fields['dates'])
tenure_fields = group_fields_by_document(raw_fields['tenure'])
project_ids = transform_ids(raw_fields['project_ids'])
# Filter by confidence (BEFORE passing to validation)
dates = [d for d in dates if d['confidence'] >= settings.llm_confidence_threshold]
tenure_fields = [t for t in tenure_fields if t['confidence'] >= settings.llm_confidence_threshold]
project_ids = [p for p in project_ids if p['confidence'] >= settings.llm_confidence_threshold]
extraction_method = "llm"
else:
raise ValueError("LLM extraction not available")
except Exception as e:
print(f"LLM extraction failed: {e}. Using regex fallback.", file=sys.stderr)
dates = extract_dates_from_evidence(evidence_data)
tenure_fields = extract_land_tenure_from_evidence(evidence_data)
project_ids = extract_project_ids_from_evidence(evidence_data)
extraction_method = "regex_fallback"
# Rest of validation logic unchanged
# ...
return {
"summary": {
"extraction_method": extraction_method,
# ... other results
}
}
```
---
## Related Documents
**Internal Specs:**
- `specs/2025-11-12-registry-review-mcp-REFINED.md` - Main specification
- `docs/PHASE_4_COMPLETION.md` - Current regex implementation
- `docs/PROMPT_DESIGN_PRINCIPLES.md` - Auto-selection patterns
**External References:**
- [Anthropic Agent SDK](https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk)
- `.claude/skills/agentic-engineering/` - Composition patterns
---
## Appendix: Example Improvements
### Date Extraction
**Input:** `"01/01/2022. Monitoring rounds in August – March when soil is dormant."`
**Regex (Phase 4.1):**
- ✅ Extracts: `2022-01-01` as project_start_date
- ❌ Misses: August-March monitoring period
**LLM (Phase 4.2):**
- ✅ Extracts: `2022-01-01` as project_start_date (confidence: 0.98)
- ✅ Extracts: `2022-08-01 to 2023-03-31` as monitoring_period (confidence: 0.85)
- ✅ Provides: Source citations, reasoning for audit trail
### Land Tenure
**Input:** Scanned land registry image + text mentioning "Nick Denman"
**Regex (Phase 4.1):**
- ❌ False positive: "maps dating" as owner
- ❌ Cannot read scanned image
**LLM (Phase 4.2):**
- ✅ Reads image: "Nicholas Denman" from scanned land title
- ✅ Text extraction: "Nick Denman" from project plan
- ✅ Recognizes: Same person (name variation)
- ✅ No false positives
---
**Document Status:** Ready for Implementation
**Approval Required:** Yes
**Next Steps:** Review → Approve → Begin implementation