SLURM Reference

Comprehensive reference for SLURM commands and options

Job Submission Commands

salloc - Obtain a job allocation for interactive use (docs)

sbatch - Submit a batch script for later execution (docs)

srun - Obtain a job allocation and run an application (docs)

Common Options

Option Description
-A, --account=<account> Account to be charged for resources used
-a, --array=<index> Job array specification (sbatch only)
-b, --begin=<time> Initiate job after specified time
-C, --constraint=<features> Required node features
--cpu-bind=<type> Bind tasks to specific CPUs (srun only)
-c, --cpus-per-task=<count> Number of CPUs required per task
-d, --dependency=<state:jobid> Defer job until specified jobs reach specified state
-m, --distribution=<method> Specify distribution methods for remote processes
-e, --error=<filename> File for job error messages (sbatch and srun only)
-x, --exclude=<name> Host names to exclude from job allocation
--exclusive Reserve all CPUs and GPUs on allocated nodes
--export=<name=value> Export specified environment variables
--gpus-per-task=<list> Number of GPUs required per task
-J, --job-name=<name> Job name
-l, --label Prepend task ID to output (srun only)
--mail-type=<type> E-mail notification type (begin, end, fail, requeue, all)
--mail-user=<address> E-mail address
--mem=<size>[units] Memory required per allocated node (e.g., 16GB)
--mem-per-cpu=<size>[units] Memory required per allocated CPU (e.g., 2GB)
-w, --nodelist=<hostnames> Host names to include in job allocation
-N, --nodes=<count> Number of nodes required for the job
-n, --ntasks=<count> Number of tasks to be launched
--ntasks-per-node=<count> Number of tasks to be launched per node
-o, --output=<filename> File for job output (sbatch and srun only)
-p, --partition=<names> Partition in which to run the job
--signal=[B:]<num>[@time] Signal job when approaching time limit
-t, --time=<time> Limit for job run time

Examples

# Request interactive job on debug node with 4 CPUs
salloc -p debug -c 4

# Request interactive job with V100 GPU
salloc -p gpu --ntasks=1 --gpus-per-task=v100:1

# Submit batch job
sbatch batch.job

Job Management Commands

squeue - View jobs in queue

Documentation

Option Description
-A, --account=<list> Filter by accounts
-o, --format=<options> Output format to display
-j, --jobs=<list> Filter by job IDs
-l, --long Show more available information
--me Filter by your own jobs
-n, --name=<list> Filter by job names
-p, --partition=<list> Filter by partitions
-P, --priority Sort jobs by priority
--start Show expected start time for pending jobs
-t, --states=<list> Filter by states
-u, --user=<list> Filter by users

Examples:

# View your own jobs
squeue --me

# View with estimated start times for pending jobs
squeue --me --start

# View jobs on specified partition in long format
squeue -lp epyc-64

scancel - Cancel jobs

Documentation

Option Description
-A, --account=<account> Restrict to specified account
-n, --name=<job_name> Restrict to jobs with specified name
-w, --nodelist=<hostnames> Restrict to jobs using specified hosts
-p, --partition=<partition> Restrict to specified partition
-u, --user=<username> Restrict to specified user

Examples:

# Cancel specific job
scancel 111111

# Cancel all your own jobs
scancel -u $USER

# Cancel your own jobs on specified partition
scancel -u $USER -p oneweek

# Cancel all pending jobs
scancel -u $USER -t pending

sprio - View job priorities

Documentation

Option Description
-o, --format=<options> Output format to display
-j, --jobs=<list> Filter by job IDs
-l, --long Show more information
-n, --norm Show normalized priority factors
-p, --partition=<list> Filter by partitions
-u, --user=<list> Filter by users

Examples:

# View normalized job priorities for your jobs
sprio -nu $USER

# View priorities for specified partition
sprio -nlp gpu

Partition and Node Information

sinfo - View node and partition info

Documentation

Option Description
-o, --format=<options> Output format to display
-l, --long Show more information
-N, --Node Show in node-oriented format
-n, --nodes=<hostnames> Filter by host names
-p, --partition=<list> Filter by partitions
-t, --states=<list> Filter by node states
-s, --summarize Show summary information

Examples:

# View all partitions and nodes by state
sinfo

# Summarize node states by partition
sinfo -s

# View nodes in idle state
sinfo --states=idle

# View nodes for partition in long, node-oriented format
sinfo -lNp epyc-64

scontrol - View or modify configuration

Documentation

Option Description
-d, --details Show more details
-o, --oneliner Show information on one line

Examples:

# View partition information
scontrol show partition epyc-64

# View node information
scontrol show node $NODE

# View detailed job information
scontrol show job 111111 -d

# View hostnames for job
scontrol show hostnames

# Hold a pending job
scontrol hold <job_id>

# Release a held job
scontrol release <job_id>

Environment Variables

SLURM sets these variables within your job:

Variable Description
SLURM_ARRAY_TASK_COUNT Number of tasks in job array
SLURM_ARRAY_TASK_ID Job array task ID
SLURM_CPUS_PER_TASK Number of CPUs requested per task
SLURM_JOB_ACCOUNT Account used for job
SLURM_JOB_ID Job ID
SLURM_JOB_NAME Job name
SLURM_JOB_NODELIST List of nodes allocated to job
SLURM_JOB_NUM_NODES Number of nodes allocated to job
SLURM_JOB_PARTITION Partition used for job
SLURM_NTASKS Number of job tasks
SLURM_PROCID MPI rank of current process
SLURM_SUBMIT_DIR Directory from which job was submitted
SLURM_TASKS_PER_NODE Number of job tasks per node

Examples:

# Specify OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Specify MPI task
srun -n $SLURM_NTASKS ./mpi_program

Job State Codes

The squeue command shows job states:

Status Code Explanation
COMPLETED CD Job has completed successfully
COMPLETING CG Job is finishing but some processes are still active
FAILED F Job terminated with non-zero exit code
PENDING PD Job is waiting for resource allocation
PREEMPTED PR Job was terminated due to preemption
RUNNING R Job is currently running
SUSPENDED S Running job stopped with cores released
STOPPED ST Running job stopped with cores retained

Full list in SLURM documentation

Job Reason Codes

Why a job is in its current state:

Reason Code Explanation
Priority Higher priority jobs are ahead; job will run eventually
Dependency Waiting for dependent job to complete
Resources Waiting for resources; job will run eventually
InvalidAccount Invalid account; cancel and resubmit
InvalidQoS Invalid QoS; cancel and resubmit
QOSGrpCpuLimit QoS CPU limit reached; job will run eventually
QOSGrpMaxJobsLimit QoS max jobs reached; job will run eventually
QOSGrpNodeLimit QoS node limit reached; job will run eventually
PartitionCpuLimit Partition CPU limit reached; job will run eventually
PartitionMaxJobsLimit Partition max jobs reached; job will run eventually
PartitionNodeLimit Partition node limit reached; job will run eventually
AssociationCpuLimit Association CPU limit reached; job will run eventually
AssociationMaxJobsLimit Association max jobs reached; job will run eventually
AssociationNodeLimit Association node limit reached; job will run eventually

Full list in SLURM documentation