FedMCP - Federal Parliamentary Information

MIT License

FedMCP

INGESTION_VM_COMPLETE_GUIDE.md•8.69 kB

# Complete Guide: FedMCP Data Ingestion VM ## Status - ✅ VM Created: `canadagpt-ingestion` (n2-standard-4, 150GB disk) - ✅ Dependencies Installed: PostgreSQL 14, Python 3.11, Git - ✅ PostgreSQL Database Created: `openparliament_temp` - ⏳ Repository Transfer: In progress (or complete by now) ## VM Details - **Name**: canadagpt-ingestion - **Zone**: us-central1-a - **Internal IP**: 10.128.0.4 - **External IP**: 35.193.249.101 - **Machine Type**: n2-standard-4 (4 vCPU, 16GB RAM) - **Disk**: 150GB SSD ## Quick Start: Complete the Setup ### Step 1: SSH into the VM ```bash gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca ``` ### Step 2: Extract the Repository (if transfer completed) ```bash cd ~ tar xzf fedmcp-repo.tar.gz cd FedMCP ``` **If transfer didn't complete**, clone directly: ```bash cd ~ git clone https://github.com/MattDuf/FedMCP.git cd FedMCP ``` ### Step 3: Install Python Dependencies ```bash # Install for data pipeline cd ~/FedMCP/packages/data-pipeline python3.11 -m venv venv source venv/bin/activate pip install --upgrade pip pip install neo4j psycopg2-binary requests beautifulsoup4 lxml tqdm python-dotenv pandas # Also install for root-level scripts cd ~/FedMCP pip install neo4j psycopg2-binary requests beautifulsoup4 lxml tqdm python-dotenv pandas ``` ### Step 4: Configure Environment ```bash cat > ~/FedMCP/packages/data-pipeline/.env << 'EOF' # Neo4j GCP Connection (internal IP - fast and free!) NEO4J_URI=bolt://10.128.0.3:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=canadagpt2024 # PostgreSQL (local) POSTGRES_URI=postgresql://localhost:5432/openparliament_temp # Environment NODE_ENV=production EOF ``` ### Step 5: Test Connections ```bash # Test Neo4j python3 << 'PYEOF' from neo4j import GraphDatabase driver = GraphDatabase.driver('bolt://10.128.0.3:7687', auth=('neo4j', 'canadagpt2024')) driver.verify_connectivity() print('✅ Neo4j connected!') driver.close() PYEOF # Test PostgreSQL psql -U postgres -d openparliament_temp -c "SELECT version();" ``` ### Step 6: Create Log Directory ```bash mkdir -p ~/ingestion_logs ``` ## Running the Full Ingestion ### Option A: Complete Historical Import (1901-Present) - 3-4 hours **Prerequisites**: Download Lipad data manually (see below) ```bash # Start a tmux session (so it keeps running if you disconnect) tmux new -s ingestion # Run complete import cd ~/FedMCP python3 test_complete_historical_import.py 2>&1 | tee ~/ingestion_logs/full_import.log # Detach from tmux: Press Ctrl+b then d # Reattach later: tmux attach -t ingestion ``` ### Option B: Recent Data Only (2022-Present) - 20 minutes **No prerequisites needed - starts immediately!** ```bash cd ~/FedMCP python3 test_recent_import.py 2>&1 | tee ~/ingestion_logs/recent_import.log ``` ### Option C: Modern Bulk (1994-Present) - 90 minutes ```bash cd ~/FedMCP python3 test_bulk_import.py 2>&1 | tee ~/ingestion_logs/bulk_import.log ``` ## Lipad Historical Data Download (For Option A Only) ### Manual Download Steps 1. On your local machine, visit: https://www.lipad.ca/data/ 2. Download CSV format (2-5GB compressed) 3. Upload to GCS bucket: ```bash # From your Mac gsutil mb -p canada-gpt-ca -l us-central1 gs://canada-gpt-ca-lipad-data gsutil cp ~/Downloads/lipad_*.csv gs://canada-gpt-ca-lipad-data/ ``` 4. Download to VM: ```bash # On the VM mkdir -p ~/lipad_data gsutil -m cp gs://canada-gpt-ca-lipad-data/* ~/lipad_data/ ``` ## Adding Accountability Data (30 minutes) After the main import, add lobbying, expenses, and petitions: ```bash cd ~/FedMCP/packages/data-pipeline source venv/bin/activate # Import lobbying registry (~15 min, ~90MB download) python3 -m fedmcp_pipeline.ingest.lobbying 2>&1 | tee ~/ingestion_logs/lobbying.log # Import MP expenses (~10 min) python3 -m fedmcp_pipeline.ingest.finances 2>&1 | tee ~/ingestion_logs/expenses.log # Import petitions (~5 min) python3 -m fedmcp_pipeline.ingest.parliament --petitions 2>&1 | tee ~/ingestion_logs/petitions.log ``` ## Verification ### Check Database Statistics ```bash # Connect to Neo4j and run these queries python3 << 'PYEOF' from neo4j import GraphDatabase driver = GraphDatabase.driver('bolt://10.128.0.3:7687', auth=('neo4j', 'canadagpt2024')) session = driver.session() # Node counts by label result = session.run("MATCH (n) RETURN labels(n)[0] as label, count(*) as count ORDER BY count DESC") print("\n=== Node Counts ===") for record in result: print(f"{record['label']:20s}: {record['count']:>10,}") # Relationship counts result = session.run("MATCH ()-[r]->() RETURN type(r) as rel_type, count(*) as count ORDER BY count DESC") print("\n=== Relationship Counts ===") for record in result: print(f"{record['rel_type']:30s}: {record['count']:>10,}") # Date coverage result = session.run("MATCH (d:Debate) RETURN min(d.date) as earliest, max(d.date) as latest") record = result.single() if record: print(f"\n=== Debate Coverage ===") print(f"Earliest: {record['earliest']}") print(f"Latest: {record['latest']}") session.close() driver.close() PYEOF ``` ### Expected Results (Complete Historical Import) ``` Total Nodes: ~6,000,000 Total Relationships: ~7,000,000 Earliest Debate: 1901-XX-XX Latest Debate: 2025-XX-XX ``` ## Ongoing Operations ### Weekly Update Script Create `~/FedMCP/update_weekly.sh`: ```bash #!/bin/bash LOG_FILE=~/ingestion_logs/update_$(date +%Y%m%d).log echo "=========================================" | tee $LOG_FILE echo "Weekly Update: $(date)" | tee -a $LOG_FILE echo "=========================================" | tee -a $LOG_FILE cd ~/FedMCP # Update recent data python3 test_recent_import.py 2>&1 | tee -a $LOG_FILE # Update accountability data cd packages/data-pipeline source venv/bin/activate python3 -m fedmcp_pipeline.ingest.lobbying --update 2>&1 | tee -a $LOG_FILE python3 -m fedmcp_pipeline.ingest.finances --update 2>&1 | tee -a $LOG_FILE echo "✅ Update complete: $(date)" | tee -a $LOG_FILE ``` Make it executable and schedule: ```bash chmod +x ~/FedMCP/update_weekly.sh # Add to crontab (runs every Sunday at 2 AM) crontab -e # Add this line: 0 2 * * 0 /home/$USER/FedMCP/update_weekly.sh ``` ## Cost Management ### Option A: Keep VM Running - **Cost**: ~$100/month - **Benefit**: Always ready for updates - **Best for**: Production with frequent updates ### Option B: Stop When Not Needed (RECOMMENDED) - **Cost**: ~$5/month (disk storage only) - **Benefit**: Much cheaper - **Best for**: Occasional updates ```bash # Stop VM (from your local machine) gcloud compute instances stop canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca # Start when needed gcloud compute instances start canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca ``` ## Troubleshooting ### Check Logs ```bash # View ingestion logs ls -lh ~/ingestion_logs/ # Tail a running import tail -f ~/ingestion_logs/full_import.log ``` ### Reattach to tmux Session ```bash tmux list-sessions tmux attach -t ingestion ``` ### PostgreSQL Issues ```bash # Restart PostgreSQL sudo systemctl restart postgresql # Check status sudo systemctl status postgresql # View PostgreSQL logs sudo tail -f /var/log/postgresql/postgresql-14-main.log ``` ### Neo4j Connection Issues ```bash # Check if Neo4j VM is running gcloud compute instances describe canadagpt-neo4j --zone=us-central1-a --format="value(status)" # Test connection python3 -c "from neo4j import GraphDatabase; driver = GraphDatabase.driver('bolt://10.128.0.3:7687', auth=('neo4j', 'canadagpt2024')); driver.verify_connectivity(); print('OK')" ``` ## Next Steps After Ingestion 1. **Verify data via GraphQL API**: ```bash curl -X POST https://canadagpt-graph-api-213428056473.us-central1.run.app/graphql \ -H "Content-Type: application/json" \ -d '{"query": "{ petitions(options: {limit: 5}) { petitionNumber title } }"}' ``` 2. **Test frontend queries** at http://localhost:3000 3. **Stop VM to save costs** (if using Option B above) 4. **Schedule weekly updates** using cron job 5. **Monitor disk usage**: ```bash df -h du -sh ~/FedMCP du -sh /var/lib/postgresql ``` ## Quick Reference **SSH to VM**: ```bash gcloud compute ssh canadagpt-ingestion --zone=us-central1-a --project=canada-gpt-ca ``` **Start tmux session**: ```bash tmux new -s ingestion ``` **Detach from tmux**: `Ctrl+b` then `d` **Reattach to tmux**: ```bash tmux attach -t ingestion ``` **Check running processes**: ```bash ps aux | grep python ``` **Disk usage**: ```bash df -h ``` **Stop VM** (from local machine): ```bash gcloud compute instances stop canadagpt-ingestion --zone=us-central1-a ```

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/northernvariables/FedMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server