read_spark_driver_log
Retrieve Spark driver logs from S3 for EMR Serverless job runs to diagnose application errors and Python output or Spark framework issues.
Instructions
Read the Spark driver log from S3 for an EMR Serverless job run.
DEFAULT: Reads stdout.gz — this is the PRIMARY log containing Python print statements, row counts, file paths, and application errors. This is what you want 90% of the time.
Use log_type='stderr' only when you need Spark framework logs (executor allocation, memory warnings, shuffle errors).
Use read_both=True to get BOTH logs in one call (stdout first, then stderr filtered to ERROR lines only).
How to find application_id and job_run_id:
application_id: from the 'initialise' Airflow task log → 'EMR serverless application created: 00gXXX'
job_run_id: from the processing Airflow task log → 'EMR serverless job started: 00gXXX'
Or use list_emr_applications() then list_job_runs()
Args: application_id: The EMR Serverless application ID (e.g. '00g16i3marao0c0t'). job_run_id: The job run ID (e.g. '00g16i5g2pm56o0v'). log_type: 'stdout' (default, Python app output) or 'stderr' (Spark framework logs). s3_log_uri: Optional full S3 URI to read directly (e.g. 's3://bucket/path/stdout.gz'). process_name: Optional folder name under spark-logs/ (e.g. 'stackadapt_main'). Speeds up log discovery. tail_lines: Number of lines from the end (default 300). Use -1 for all lines. search_text: Optional text to filter log lines (e.g. 'ERROR', 'Exception'). bucket: S3 bucket override (default from config). read_both: If True, read BOTH stdout and stderr in one call. stdout shown first, stderr filtered to ERROR lines.
Returns the log content, optionally filtered and tailed.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| application_id | Yes | ||
| job_run_id | Yes | ||
| log_type | No | stdout | |
| s3_log_uri | No | ||
| process_name | No | ||
| tail_lines | No | ||
| search_text | No | ||
| bucket | No | ||
| read_both | No | ||
| env | No |