FedMCP - Federal Parliamentary Information

MIT License

FedMCP

BULK_IMPORT_GUIDE.md•7.95 kB

# OpenParliament & Lipad Bulk Import Guide Complete guide for importing 120+ years of Canadian parliamentary data into Neo4j. --- ## Overview ### Data Sources **OpenParliament PostgreSQL Dump** - **Coverage**: 1994-present - **Size**: 1.2GB compressed, 6GB uncompressed - **Updated**: First of each month - **Contains**: MPs, debates, statements, bills, votes, committees - **URL**: https://openparliament.ca/data/openparliament.public.sql.bz2 **Lipad Historical Hansard** - **Coverage**: 1901-1993 - **Formats**: CSV (daily files), XML, PostgreSQL dump - **Contains**: Historical Hansard debates and statements - **URL**: https://www.lipad.ca/data/ **Combined Coverage**: 1901-present (120+ years!) --- ## Prerequisites ### 1. Install PostgreSQL ```bash # macOS brew install postgresql@14 brew services start postgresql@14 # Create temporary database createdb openparliament_temp ``` ### 2. Install Dependencies ```bash # Install bzip2 for decompression brew install bzip2 # Install Python packages pip install psycopg2-binary requests ``` ### 3. Verify Neo4j is Running ```bash # Check Neo4j status # Should be accessible at bolt://localhost:7687 ``` --- ## Quick Start ### Option 1: Automated Import (Recommended) ```python from fedmcp_pipeline.utils.neo4j_client import Neo4jClient from fedmcp_pipeline.utils.config import Config from fedmcp_pipeline.ingest.bulk_import import OpenParliamentBulkImporter config = Config(env_file=Path("packages/data-pipeline/.env")) pg_conn = "postgresql://localhost:5432/openparliament_temp" with Neo4jClient(config.neo4j_uri, config.neo4j_user, config.neo4j_password) as neo4j: importer = OpenParliamentBulkImporter(neo4j, pg_conn) # Full import (no limits) stats = importer.import_all( download=True, # Download 1.2GB dump load_pg=True, # Load into PostgreSQL (10-20 min) batch_size=1000 # Neo4j batch size ) ``` **Time**: ~1-2 hours total **Result**: Complete 1994-present parliamentary database ### Option 2: Step-by-Step Import ```bash # Run test script python test_bulk_import.py ``` Follow prompts to: 1. Download PostgreSQL dump 2. Load into temporary database 3. Import into Neo4j --- ## Performance Comparison ### API-Only Approach (Current) - MPs: ~2 minutes (343 MPs × 0.1s) - Debates (1994-present): ~16 minutes (10,000 debates × 0.1s) - Committees: ~1.4 hours (50,000 meetings × 0.1s) - **Total**: ~2 hours for modern data only - **Coverage**: 1994-present ### Bulk Import Approach (New) - Download: ~10 minutes (1.2GB) - PostgreSQL load: ~15 minutes - Neo4j import: ~30 minutes - **Total**: ~1 hour for ALL historical data - **Coverage**: 1994-present **Advantage**: Same time, complete historical coverage! --- ## What Gets Imported ### From OpenParliament Dump **Nodes**: - `MP`: All MPs/politicians (current + historical) - Properties: name, party, riding, photo, email, phone - `Debate`: Hansard sittings - Properties: date, parliament, session, number - `Statement`: Individual speeches/statements - Properties: content, heading, time, wordcount - `Committee`: Parliamentary committees - Properties: name, code, chamber - `Party`: Political parties - `Riding`: Electoral districts **Relationships**: - `(MP)-[:SPOKE]->(Statement)`: Who said what - `(Statement)-[:IN_DEBATE]->(Debate)`: Statements in debates - `(MP)-[:MEMBER_OF]->(Committee)`: Committee memberships - `(MP)-[:BELONGS_TO]->(Party)`: Party affiliations - `(MP)-[:REPRESENTS]->(Riding)`: Electoral representation ### Database Schema (OpenParliament) Key PostgreSQL tables: - `core_politician`: MPs/politicians - `core_politicianinfo`: Contact info - `core_party`: Political parties - `core_ridinginfo`: Electoral districts - `hansards_document`: Debate sittings - `hansards_statement`: Individual speeches - `committees_committee`: Committees - `committees_committeemeeting`: Meetings - `committees_committeemember`: Memberships --- ## Incremental Updates After initial bulk load, use API for ongoing updates: ```python # Daily/weekly incremental update from fedmcp_pipeline.ingest.parliament import ingest_parliament_data # Only fetch new data since last update stats = ingest_parliament_data( neo4j_client, batch_size=1000 ) ``` This is fast because you're only fetching recent data, not all history. --- ## Adding Lipad Historical Data (1901-1993) ### Phase 2: Historical Gap Fill **Coming Soon**: Lipad import module for pre-1994 data ```python from fedmcp_pipeline.ingest.lipad_import import LipadHistoricalImporter # Import 1901-1993 Hansard lipad = LipadHistoricalImporter(neo4j_client) stats = lipad.import_historical_hansard( source="csv", # or "xml" or "postgresql" data_dir="/path/to/lipad/data" ) ``` **Coverage after both imports**: 1901-present (120+ years!) --- ## Troubleshooting ### PostgreSQL Connection Issues ```bash # Check PostgreSQL is running pg_isready # Test connection psql -d openparliament_temp -c "SELECT version();" ``` ### Disk Space Issues **Required Space**: - Download: 1.2GB - Extracted SQL: ~6GB - PostgreSQL database: ~7GB - **Total**: ~15GB free space recommended **Cleanup after import**: ```bash # Drop temporary PostgreSQL database dropdb openparliament_temp # Remove downloaded files rm -rf /tmp/openparliament_import ``` ### Memory Issues If import fails with memory errors: - Reduce `batch_size` from 1000 to 500 or 100 - Import in phases (MPs first, then debates, then committees) ### Slow Import **Optimization tips**: 1. Disable Neo4j indexes during bulk import (re-enable after) 2. Increase Neo4j heap size in neo4j.conf 3. Use SSD for PostgreSQL data directory --- ## Architecture ### Data Flow ``` OpenParliament Dump (1.2GB .bz2) ↓ Extract with bunzip2 ↓ Load into PostgreSQL (6GB) ↓ Query with psycopg2 ↓ Transform to Neo4j format ↓ Batch insert into Neo4j ``` ### Why PostgreSQL Intermediate Step? 1. **Speed**: Native PostgreSQL queries are faster than parsing SQL 2. **Joins**: Easy to join tables for related data 3. **Flexibility**: Can query subset of data for testing 4. **Recovery**: If Neo4j import fails, don't need to re-download --- ## Maintenance ### Monthly Updates OpenParliament dump updates on 1st of each month: ```bash # Automated monthly update script # Add to cron: 0 2 2 * * /path/to/monthly_update.sh #!/bin/bash cd /path/to/FedMCP # Download latest dump python -c "from fedmcp_pipeline.ingest.bulk_import import OpenParliamentBulkImporter; \\ importer = OpenParliamentBulkImporter(neo4j, pg_conn); \\ importer.download_dump()" # Reload PostgreSQL dropdb openparliament_temp createdb openparliament_temp # ... load process ... # Incremental import to Neo4j python -c "from fedmcp_pipeline.ingest.bulk_import import OpenParliamentBulkImporter; \\ # Import only new records ..." ``` --- ## Cost-Benefit Analysis ### Initial Setup Cost - **Time**: 1-2 hours (mostly automated) - **Complexity**: Medium (requires PostgreSQL setup) - **Disk**: ~15GB temporary space ### Ongoing Benefits - **Historical Coverage**: 30+ years (vs weeks of API calls) - **Speed**: 60x faster than API-only approach - **Completeness**: Full data including relationships - **Maintenance**: Monthly refresh vs daily API polling - **Cost**: Free (no API rate limits) --- ## Next Steps 1. ✅ **Run test import**: `python test_bulk_import.py` 2. **Verify data**: Check Neo4j Browser for imported nodes 3. **Update GraphQL schema**: Add Statement, Debate nodes 4. **Build frontend**: Display historical debates 5. **Add Lipad import**: Fill 1901-1993 gap 6. **Schedule updates**: Automate monthly refreshes --- ## Resources - OpenParliament Data: https://openparliament.ca/data-download/ - Lipad Project: https://www.lipad.ca/data/ - OpenParliament Schema: https://github.com/michaelmulley/openparliament - PostgreSQL Docs: https://www.postgresql.org/docs/

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/northernvariables/FedMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server