diagnose_dag_failure
Automatically diagnose failed Airflow DAG runs by identifying failed tasks, analyzing logs, extracting EMR application IDs, and providing root cause analysis in one step.
Instructions
One-shot diagnosis of a failed DAG run.
This tool does everything automatically:
Finds the most recent failed run (today or specified date)
Identifies which task(s) failed
Reads the failed task logs
Extracts EMR application ID from the 'initialise' task
Reads the Spark driver logs (stdout) for Python app errors
Returns a complete failure analysis
This replaces the need to call 5-6 tools manually.
Args: dag_id: The DAG to diagnose (e.g. 'ttdcustom_processing'). env: Target environment — 'dev', 'uat', 'test', or 'prod'. IMPORTANT: Do NOT guess or default. Ask the user which environment if not specified. date: Optional date to check (ISO format or 'yesterday'). Default: today.
Returns a comprehensive failure report with root cause analysis.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| dag_id | Yes | ||
| env | No | ||
| date | No |