slurm_submit_job
Submit a multi-node GPU training job to Slurm. Generate an sbatch script from your spec and submit it, or preview with dry run.
Instructions
Submit a multi-node GPU training job to Slurm via sbatch.
Generates a complete sbatch script from the provided spec and submits it. Set dry_run=true to preview the script without submitting.
Write operation — recorded in the audit log.
Args: job_name: Job name (--job-name). nodes: Number of nodes to allocate (--nodes). gpus_per_node: GPUs per node (--gpus-per-node). script: Shell script body — the command to run (e.g. torchrun --nproc_per_node=8 train.py). host: Slurm head node hostname (overrides SLURM_HOST). partition: Target Slurm partition (--partition). ntasks_per_node: MPI tasks per node (default: gpus_per_node). time: Wall-clock time limit in HH:MM:SS or D-HH:MM:SS format (--time). output: Path to stdout log file (default: slurm-%j.out). error: Path to stderr log file (default: slurm-%j.err). account: Slurm account / allocation for billing (--account). dry_run: If true, return the sbatch script without submitting. gateway_id: Gateway UUID for the site where the Slurm cluster is deployed.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| host | No | ||
| time | No | ||
| error | No | ||
| nodes | Yes | ||
| output | No | ||
| script | Yes | ||
| account | No | ||
| dry_run | No | ||
| job_name | Yes | ||
| partition | No | ||
| gateway_id | No | ||
| gpus_per_node | Yes | ||
| ntasks_per_node | No |