OPENPARLIAMENT_INGESTION_STATUS.md•8.13 kB
# OpenParliament PostgreSQL → Neo4j Ingestion Status
## Overview
Migrating OpenParliament PostgreSQL database data into Neo4j graph database to fill identified gaps:
- **3.67M Hansard statements**
- **5,280 bill texts**
- **35K politician info records**
- **21K election candidacies**
## Completed Modules ✅
### 1. Hansard Statements (`hansard.py`)
**Status**: Module created and currently testing
**Records**: 3.67M statements, 18,416 documents
**File**: `/Users/matthewdufresne/FedMCP/packages/data-pipeline/fedmcp_pipeline/ingest/hansard.py` (562 lines)
**Functions**:
- `ingest_hansard_documents()` - Import debate documents
- `ingest_hansard_statements()` - Import 3.67M statement records
- `link_statements_to_mps()` - Create MADE_BY relationships
- `link_statements_to_documents()` - Create PART_OF relationships
- `link_bill_mentions()` - Create MENTIONS relationships for bill references
- `create_hansard_indexes()` - Create full-text search indexes
- `ingest_hansard_sample()` - Sample import (1,000 statements)
- `ingest_hansard_full()` - Full import (3.67M statements)
**Test Results** (currently running in background):
```
✅ Created 18,416 Document nodes
✅ Created 1,000 Statement nodes
⚠️ 0 MADE_BY relationships (MPs not yet imported from PostgreSQL)
🔄 Creating PART_OF relationships (155,000+ created and counting)
```
**Fixes Applied**:
1. Fixed SQL query error: Changed `SELECT DISTINCT document_id ORDER BY time` to include `time` in SELECT list
2. Fixed API method naming: Changed all `execute_query` to `run_query` throughout module
### 2. Bill Texts (`bill_text.py`)
**Status**: Module created, testing pending
**Records**: 5,280 bill texts with English/French text and summaries
**File**: `/Users/matthewdufresne/FedMCP/packages/data-pipeline/fedmcp_pipeline/ingest/bill_text.py`
**Functions**:
- `ingest_bill_texts()` - Import bill text records
- `link_texts_to_bills()` - Create HAS_TEXT relationships
- `create_bill_text_schema()` - Create indexes and constraints
- `ingest_bill_text_sample()` - Sample import (10 texts)
- `ingest_bill_text_full()` - Full import (5,280 texts)
**Neo4j Schema**:
- Node: `BillText` with properties: id, docid, text_en, text_fr, summary_en, created
- Relationship: `(Bill)-[:HAS_TEXT]->(BillText)`
- Full-text indexes on text_en, text_fr, summary_en
### 3. Politician Info (`politician_info.py`)
**Status**: Module created, testing pending
**Records**: 2,958 politicians with 38,641 info records
**File**: `/Users/matthewdufresne/FedMCP/packages/data-pipeline/fedmcp_pipeline/ingest/politician_info.py`
**Functions**:
- `enrich_politician_info()` - Add biographical data to existing Politician nodes
- `ingest_politician_info_sample()` - Sample enrichment (10 politicians)
- `ingest_politician_info_full()` - Full enrichment (2,958 politicians)
**Info Schema Types** (18 total, top 10):
- alternate_name: 14,731 records
- parl_affil_id: 12,569 records
- parl_id: 2,958 records
- favourite_word: 1,315 records
- email: 921 records
- phone: 920 records
- constituency_offices: 916 records
- parl_mp_id: 886 records
- parlinfo_id: 751 records
- twitter_id: 539 records
## In Progress 🔄
### Hansard Sample Test
**Background Process**: Process ID `95adb9`
**Log File**: `/tmp/hansard_test_v3.log` (note: filename has space, actual path may vary)
**Status**: Creating PART_OF relationships (155,000+ created, still running)
**Expected Completion**: Should finish within next 10-20 minutes
## Remaining Work ⏳
### 1. Elections Module (`elections.py`)
**Not Yet Created**
**Records**: 21,000 election candidacies
**PostgreSQL Table**: `elections_candidacy`
**Planned Functions**:
- `ingest_election_candidacies()` - Import candidacy records
- `link_candidacies_to_politicians()` - Create RAN_FOR relationships
- `ingest_elections_sample()` - Sample import (50 candidacies)
- `ingest_elections_full()` - Full import (21K candidacies)
**Schema Info Needed**:
```sql
-- Run to get table structure:
\d elections_candidacy
-- Key fields likely include:
-- - politician_id (FK to core_politician)
-- - election_id
-- - riding/constituency
-- - party
-- - elected (boolean)
-- - vote_count
```
### 2. Comprehensive Test Script
**Not Yet Created**
**File**: `test_sample_imports.py`
**Should Test**:
1. Hansard sample (1,000 statements)
2. Bill text sample (10 texts)
3. Politician info sample (10 politicians)
4. Elections sample (50 candidacies)
**Validation Checks**:
- Node counts match expected
- Relationship counts correct
- Sample data queries work
- Full-text search indexes functional
### 3. Testing & Validation
**Status**: Only Hansard test running so far
**Remaining Tests**:
- [ ] Test bill_text.py sample import
- [ ] Test politician_info.py sample enrichment
- [ ] Create and test elections.py
- [ ] Run comprehensive test script with all modules
### 4. Documentation
**Optional Final Step**:
- Import guide for each module
- Performance benchmarks
- Common issues and solutions
## Database Connection Info
**PostgreSQL** (OpenParliament):
```
Host: localhost
Port: 5432
Database: openparliament
User: fedmcp
Password: fedmcp2024
```
**Neo4j** (Graph Database):
```
URI: bolt://localhost:7687
User: neo4j
Password: canadagpt2024
```
**Environment File**: `/Users/matthewdufresne/FedMCP/packages/data-pipeline/.env`
## Next Steps
1. **Wait for Hansard test to complete** (~10-20 minutes)
- Check progress: `tail -f /tmp/hansard_test_v3.log`
- Or check background process status
2. **Test bill_text.py module**:
```bash
/Users/matthewdufresne/FedMCP/venv/bin/python \
/Users/matthewdufresne/FedMCP/test_bill_text_sample.py
```
3. **Create elections.py module**:
- Check PostgreSQL schema: `\d elections_candidacy`
- Follow pattern from hansard.py and bill_text.py
- Include sample and full import functions
4. **Create comprehensive test script**:
- Test all 4 modules in sequence
- Validate counts and relationships
- Report any errors or gaps
5. **Run full imports** (after sample tests pass):
- Hansard: 3.67M statements (~2-4 hours estimated)
- Bill texts: 5,280 texts (~5-10 minutes)
- Politician info: 2,958 politicians (~2-5 minutes)
- Elections: 21K candidacies (~5-10 minutes)
## Key Files
```
/Users/matthewdufresne/FedMCP/packages/data-pipeline/fedmcp_pipeline/
├── ingest/
│ ├── __init__.py
│ ├── hansard.py (✅ created, 562 lines)
│ ├── bill_text.py (✅ created)
│ ├── politician_info.py (✅ created)
│ └── elections.py (⏳ not yet created)
├── utils/
│ ├── neo4j_client.py (✅ existing)
│ ├── postgres_client.py (✅ existing)
│ └── progress.py (✅ existing)
/Users/matthewdufresne/FedMCP/
├── test_hansard_sample.py (✅ created, running)
├── test_bill_text_sample.py (✅ created, not tested)
├── test_sample_imports.py (⏳ not yet created)
└── packages/data-pipeline/.env (✅ configured)
```
## Performance Notes
- **Hansard**: Sample of 1,000 statements took ~12 minutes (including 18K documents, 155K+ relationships)
- **Batch Sizes**:
- Documents: 5,000 per batch
- Statements: 10,000 per batch
- Relationships: Created iteratively with progress tracking
- **Full Import Estimates**:
- 3.67M statements: ~2-4 hours
- Creating all relationships: additional 1-2 hours
## Issues Fixed
1. **PostgreSQL DISTINCT/ORDER BY Error**:
- Error: "ORDER BY expressions must appear in select list" with SELECT DISTINCT
- Fix: Added ORDER BY column to SELECT, then filtered unique values in Python
2. **Neo4j API Method Name**:
- Error: 'Neo4jClient' object has no attribute 'execute_query'
- Fix: Changed to correct method name `run_query` throughout codebase
3. **Test Script API Consistency**:
- Updated test_hansard_sample.py to use correct `run_query` method
- Ensures consistency across test scripts and modules
---
**Last Updated**: 2025-11-05 04:37 AM
**Status**: 3 of 4 modules created, Hansard test in progress
**Next**: Complete Hansard test validation, then proceed with remaining modules