# OpenDiscourse GovInfo MCP — AI Agent Usage Guide
Purpose
- Provide AI agents with a concise, actionable guide for setting up the environment, understanding required packages, and using the ingestion functions programmatically and via CLI.
Environment requirements
- Python
- Version: 3.10 (project configured via .envrc and local .venv)
- OS: Linux or macOS preferred
- Tools:
- direnv (optional, recommended for automatic env var/venv activation)
- git
Python packages and libraries (runtime)
- aiohttp — async HTTP client
- aiofiles — async file IO
- tqdm — progress reporting
- lxml — XML parsing
- xmlschema — XML schema validation
- requests — optional utilities in some scripts
Python packages (development/testing)
- pytest, pytest-asyncio, pytest-cov
- black, flake8, mypy
Installation
1) Ensure Python 3.10 is installed.
2) Create and activate a virtualenv (direnv will help automatically):
- Enter the repository directory; .envrc will create/activate .venv (Python 3.10) if present.
- Or manually:
python3.10 -m venv .venv
. .venv/bin/activate
3) Install dependencies:
pip install --upgrade pip
pip install -r requirements.txt
Project layout relevant to agents
- scripts/ingest_govinfo.py — CLI for targeted ingestion
- scripts/ingest_all_govinfo.py — CLI for full-coverage ingestion
- scripts/ingestion/
- ingestor.py — Async ingestion functions
- config.py — Ingestion configuration
- rate_limiter.py — Rate limiter
- xml_validator.py — Optional validation
- scripts/govinfo_ingest.py — Optional DB ingestion of XML
- docs/agents/ — Agent-oriented documentation and context
Artifacts generated by ingestion
- govinfo_data/{congress}/{doc_type}/
- manifest.json — Run summary and file inventory
- failures.json — Failed URLs (present when failures occur)
Usage modes
- CLI (recommended for batch ingestion)
- Single congress:
python -m scripts.ingest_govinfo --congress 118
- Multiple congresses and types:
python -m scripts.ingest_govinfo --congress 117 118 --doc-types BILLS BILLSTATUS
- All configured sessions:
python -m scripts.ingest_all_govinfo
- Programmatic (AI agent calling into functions)
- ingest_congress_data(congress: int, doc_types: list[str] | None = None, output_dir: Path | None = None, workers: int = 10) -> dict[str, int]
- ingest_all_congresses(congresses: list[int] | None = None, doc_types: list[str] | None = None, output_dir: Path | None = None, workers: int = 10) -> dict[int, dict[str, int]]
Example (programmatic)
```python
import asyncio
from pathlib import Path
from scripts.ingestion import ingest_congress_data, ingest_all_congresses
# Single congress
result = asyncio.run(
ingest_congress_data(congress=118, doc_types=["BILLS", "BILLSTATUS"], output_dir=Path("govinfo_data"), workers=12)
)
print(result)
# Multiple congresses
results = asyncio.run(
ingest_all_congresses(congresses=[117, 118], doc_types=["BILLS"], output_dir=Path("govinfo_data"), workers=8)
)
print(results)
```
Best practices for AI agents
- Use module invocation for CLIs (python -m ...) to ensure imports resolve.
- Control concurrency via the workers parameter to avoid rate limiting; adjust GOVINFO_RATE_LIMIT if needed.
- Rely on manifest.json and failures.json for resumability and monitoring.
- For validation, ensure appropriate schemas exist under scripts/ingestion/schemas and enable GOVINFO_VALIDATE_XML=true.
- Avoid modifying core ingestion code; prefer passing parameters via CLI or function args.
Troubleshooting
- Module not found: run from the project root and use python -m scripts.ingest_govinfo or adjust PYTHONPATH.
- Rate limiting (HTTP 429): reduce --workers, increase GOVINFO_RATE_LIMIT responsibly, or introduce backoff.
- Permission issues: create/activate a user-owned .venv and ensure write permissions to govinfo_data.