FedMCP - Federal Parliamentary Information

MIT License

FedMCP

INGESTION_INFRASTRUCTURE_COMPLETE.md•11.9 kB

# FedMCP Data Ingestion Infrastructure - COMPLETE ✅ ## Summary Your complete data ingestion infrastructure is now operational on GCP! The VM is fully configured and ready to load 120+ years of Canadian parliamentary data into Neo4j. ## What Was Built ### 1. Ingestion VM (canadagpt-ingestion) - **Machine Type**: n2-standard-4 (4 vCPU, 16GB RAM, 150GB SSD) - **Zone**: us-central1-a - **Internal IP**: 10.128.0.4 (connects to Neo4j via internal network) - **External IP**: 35.193.249.101 ### 2. Fully Configured Environment ✅ PostgreSQL 14 installed with `openparliament_temp` database ✅ Python 3.11 with virtual environment ✅ All Python dependencies installed (neo4j, psycopg2-binary, pandas, etc.) ✅ FedMCP repository cloned (2.0GB) ✅ Environment variables configured (.env file) ✅ Neo4j connection tested (bolt://10.128.0.3:7687) ✅ Log directory created (~/ingestion_logs/) ### 3. Automated Scripts Created **Master Ingestion Script**: `~/FedMCP/scripts/run-full-ingestion.sh` - Runs complete historical import (1901-present with Lipad) - Includes all accountability data (lobbying, expenses, petitions) - Logs everything to timestamped files - Estimated time: 3-4 hours **Weekly Update Script**: `~/FedMCP/scripts/update-recent-data.sh` - Updates recent parliamentary data (2022-present) - Updates lobbying registry, expenses, and petitions - Can be automated with cron for ongoing operations - Estimated time: 20 minutes ### 4. Documentation - Complete guide on VM: `~/README_INGESTION.md` - Setup guide (this file): `INGESTION_INFRASTRUCTURE_COMPLETE.md` - Original planning doc: `INGESTION_VM_COMPLETE_GUIDE.md` ## Next Steps: Choose Your Option ### Option A: Quick Start (1994-Present) - 90 minutes **No prerequisites needed - start immediately!** ```bash # 1. SSH to the VM gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca # 2. Start a tmux session (so it keeps running if you disconnect) tmux new -s ingestion # 3. Run the modern bulk import cd ~/FedMCP source packages/data-pipeline/venv/bin/activate python3 test_bulk_import.py 2>&1 | tee ~/ingestion_logs/bulk_import.log # 4. Detach from tmux (Ctrl+b then d) # You can disconnect from SSH - it will keep running! # To reattach later: tmux attach -t ingestion ``` **Expected Results:** - ~3,000,000 nodes - Parliamentary data from 1994-2025 - Debates, MPs, Bills, Votes, Committees - Plus lobbying, expenses, petitions ### Option B: Complete History (1901-Present) - 3-4 hours **Requires Lipad data download first** #### Step 1: Download Lipad Data (on your Mac) ```bash # Visit https://www.lipad.ca/data/ and download CSV format # Then upload to Google Cloud Storage: gsutil mb -p canada-gpt-ca -l us-central1 gs://canada-gpt-ca-lipad-data gsutil cp ~/Downloads/lipad_*.csv gs://canada-gpt-ca-lipad-data/ ``` #### Step 2: Run Complete Import (on VM) ```bash # SSH to VM gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca # Download Lipad data from GCS mkdir -p ~/lipad_data gsutil -m cp gs://canada-gpt-ca-lipad-data/* ~/lipad_data/ # Start tmux and run complete import tmux new -s ingestion cd ~/FedMCP bash scripts/run-full-ingestion.sh # Detach: Ctrl+b then d ``` **Expected Results:** - ~6,000,000 nodes - Parliamentary data from 1901-2025 - Complete historical coverage - Plus lobbying, expenses, petitions ## Monitoring Progress ### Check Running Status ```bash # Reattach to tmux gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca tmux attach -t ingestion # Or check processes ps aux | grep python ``` ### View Logs ```bash # SSH to VM gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca # List logs ls -lh ~/ingestion_logs/ # Tail current log tail -f ~/ingestion_logs/*.log ``` ### Query Neo4j Stats ```bash # SSH to VM and run Python gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca cd ~/FedMCP source packages/data-pipeline/venv/bin/activate python3 << 'EOF' from neo4j import GraphDatabase driver = GraphDatabase.driver('bolt://10.128.0.3:7687', auth=('neo4j', 'canadagpt2024')) session = driver.session() # Node counts by label result = session.run("MATCH (n) RETURN labels(n)[0] as label, count(*) as count ORDER BY count DESC LIMIT 20") print("\nTop Node Types:") for record in result: print(f" {record['label']:20s}: {record['count']:>10,}") # Total result = session.run("MATCH (n) RETURN count(n) as total") print(f"\nTotal Nodes: {result.single()['total']:,}") session.close() driver.close() EOF ``` ## Cost Management ### Stop VM When Not Needed (RECOMMENDED) Saves money - only pay for disk storage (~$5/month) instead of compute (~$100/month) ```bash # Stop VM (from your Mac) gcloud compute instances stop canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca # Start when needed gcloud compute instances start canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca ``` ### Weekly Updates After the initial load completes, run weekly updates to keep data current: ```bash # SSH to VM gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca # Run update script cd ~/FedMCP bash scripts/update-recent-data.sh ``` **Automate with Cron** (optional): ```bash # SSH to VM gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca # Add weekly cron job (Sundays at 2 AM) crontab -e # Add this line: 0 2 * * 0 /home/$USER/FedMCP/scripts/update-recent-data.sh ``` ## Verifying Success After the import completes: 1. **Check Neo4j node count** (should be 3M or 6M depending on option) 2. **Test GraphQL API**: ```bash curl -X POST https://canadagpt-graph-api-213428056473.us-central1.run.app/graphql \ -H "Content-Type: application/json" \ -d '{"query": "{ politicians(options: {limit: 5}) { name party } }"}' ``` 3. **Test frontend** at http://localhost:3000 (if running locally) 4. **Check logs** for any errors in `~/ingestion_logs/` ## Troubleshooting ### Neo4j Connection Fails ```bash # SSH to VM and test connection cd ~/FedMCP source packages/data-pipeline/venv/bin/activate python3 << 'EOF' from neo4j import GraphDatabase driver = GraphDatabase.driver('bolt://10.128.0.3:7687', auth=('neo4j', 'canadagpt2024')) driver.verify_connectivity() print('✅ Connected!') driver.close() EOF ``` ### PostgreSQL Not Running ```bash # SSH to VM sudo systemctl status postgresql sudo systemctl restart postgresql ``` ### Out of Disk Space ```bash # Check disk usage df -h du -sh ~/FedMCP du -sh /var/lib/postgresql # Clean up old logs rm ~/ingestion_logs/old_*.log ``` ### Import Fails Mid-Way ```bash # Check logs for specific error tail -100 ~/ingestion_logs/*.log # Most imports are idempotent - safe to re-run cd ~/FedMCP bash scripts/run-full-ingestion.sh ``` ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ GCP Infrastructure │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ canadagpt-ingestion │ │ canadagpt-neo4j │ │ │ │ (n2-standard-4) │─────────│ (e2-standard-8) │ │ │ │ │ Internal│ │ │ │ │ - PostgreSQL temp │ VPC │ - Neo4j 5.x │ │ │ │ - Python scripts │ │ - 370k nodes │ │ │ │ - Data ingestion │ │ → millions soon │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ ↑ ↑ │ │ │ │ │ │ │ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ Data Sources │ │ GraphQL API │ │ │ │ │ │ (Cloud Run) │ │ │ │ - OpenParliament │ │ │ │ │ │ - LEGISinfo │ │ https://canada... │ │ │ │ - Lipad │ │ /graphql │ │ │ │ - Lobbying Registry │ │ │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ ↓ │ │ ┌─────────────────────┐ │ │ │ Next.js Frontend │ │ │ │ (your Mac / Cloud) │ │ │ └─────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` ## Quick Reference Commands **SSH to VM:** ```bash gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca ``` **Start tmux:** ```bash tmux new -s ingestion ``` **Detach from tmux:** `Ctrl+b` then `d` **Reattach to tmux:** ```bash tmux attach -t ingestion ``` **Stop VM:** ```bash gcloud compute instances stop canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca ``` **Start VM:** ```bash gcloud compute instances start canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca ``` ## Files on VM ``` ~/ ├── FedMCP/ # Repository (2.0GB) │ ├── packages/data-pipeline/ # Main data pipeline │ │ ├── venv/ # Python virtual environment │ │ └── .env # Neo4j connection config │ ├── scripts/ │ │ ├── run-full-ingestion.sh # Master ingestion script │ │ └── update-recent-data.sh # Weekly update script │ ├── test_bulk_import.py # 1994-present import │ ├── test_complete_historical_import.py # 1901-present │ └── test_recent_import.py # 2022-present only ├── ingestion_logs/ # All import logs ├── lipad_data/ # Lipad historical data (if downloaded) ├── fedmcp-repo.tar.gz # Original upload (can delete) └── README_INGESTION.md # This guide ``` ## Success! 🎉 Your data ingestion infrastructure is **fully operational**. You can now: 1. **Start importing data** using Option A or B above 2. **Monitor progress** via tmux, logs, or Neo4j queries 3. **Verify results** through GraphQL API or frontend 4. **Schedule weekly updates** to keep data current 5. **Manage costs** by stopping VM when not in use The complete pipeline is ready to populate Neo4j with decades of Canadian parliamentary data, accountability records, and ongoing updates. --- **Infrastructure Status**: ✅ COMPLETE **VM Status**: 🟢 RUNNING **Neo4j Connection**: ✅ TESTED **Ready to Ingest**: ✅ YES

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/northernvariables/FedMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server