SLURM Quick Start

Run your first job with step-by-step instructions and job script examples

Before You Begin

Ensure you have:

  1. SSH access to the HPC cluster (see Getting Started)
  2. Your software environment configured (see Software Setup)

Understanding Job Submission

On the HPC cluster, you do not run computationally intensive tasks directly on the login node. Instead, you submit jobs to SLURM, which allocates compute nodes and runs your tasks there.

Key concepts:

  • Login node: Where you SSH into; for editing files and submitting jobs only
  • Compute node: Where your jobs actually run; allocated by SLURM
  • Job script: A bash script with SLURM directives that tells the scheduler what resources you need

Recommended tools:

Your First Job

Step 1: Create a Job Script

Here is an example test.sh using a cpu2mem8 instance (2 CPUs, 8GB RAM) efficiently:

#!/bin/bash                     # Use bash to run this script
#SBATCH --job-name=my_job       # Job name for identification
#SBATCH --constraint="cpu2mem8a|cpu2mem8b"  # Instance type (use | for multiple pools)
#SBATCH --nodes=1               # Request one compute node
#SBATCH --ntasks=1              # Run one task
#SBATCH --cpus-per-task=2       # Use all 2 CPUs on the node
#SBATCH --mem=7GB              # Request 7GB (leave ~1GB for system overhead, different instance type need different baseline memory, pls see ## Instance Types and Memory Limits for details)
#SBATCH --time=4:00:00          # Maximum runtime (HH:MM:SS)
#SBATCH --output=job_%j.out     # Standard output file (%j = job ID)
#SBATCH --error=job_%j.err      # Standard error file

# Sync the computing environment
source ~/.bashrc # Or simply expose the PATH, via: export PATH=$HOME/.pixi/bin:$PATH

# Test commands
echo "Hello from $(hostname)"
python -c "import sys; print(f'Hello from Python {sys.version_info.major}.{sys.version_info.minor}')"
Rscript -e "cat(paste('Hello from R', R.version.string, '\n'))"

Key points:

  • --cpus-per-task=4 gives you all 4 CPUs; use --cpus-per-task=1 if your code is single-threaded
  • --mem=7GB leaves ~1GB headroom for VM system processes (requesting full GB may cause failures)
  • Use --constraint with | to draw from multiple AWS instance pools for faster scheduling

See this page for a list of available instance types and their pricing.

Step 2: Submit Your Job

sbatch test.sh

You will see: Submitted batch job 12345

Step 3: Monitor Your Job

# Check your job status
squeue -u $USER

# View detailed job information
scontrol show job 12345

Job states:

  • PD (Pending) - Waiting for resources
  • CF (Configuring) - Node is being configured
  • R (Running) - Job is running
  • CG (Completing) - Job is finishing up
  • CD (Completed) - Job finished successfully

Step 4: View Results

Upon completion, due to specifications in the job script above, SLURM creates output files in your directory:

  • .out file: Standard output (stdout) messages
  • .err file: Standard error (stderr) messages
cat test_12345.out
cat test_12345.err

Step 5: Check Resource Usage

After your job completes, use seff to see actual resource consumption:

seff 12345

Example output:

Job ID: 12345
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:05:00
CPU Efficiency: 99.30% of 00:05:00 core-walltime
Memory Utilized: 2.00 GB
Memory Efficiency: 28.57% of 7 GB (7.00 GB/node)

This helps you right-size future jobs. In this example, only 2 GB was used out of 7 GB requested. This is when you know that a smaller instance might be used next time for this job to save costs.

Instance Types and Memory Limits

Use --constraint to select instance types. Set --mem below the instance limit to leave headroom for system processes:

Instance Type CPUs Available Memory Recommended --mem
cpu2mem4 2 4 GB 3 GB
cpu2mem8 2 8 GB 7 GB
cpu4mem32 4 32 GB 30 GB
cpu8mem64 8 64 GB 60 GB
cpu16mem128 16 128 GB 120 GB

Note: If you omit --mem or use --mem=0, the default is 4GB. Please be careful and conserative (even stingy!) on memory usage to save unncessary costs.

Job Arrays

Job arrays allow automatic parallelization of similar tasks. Multiple jobs can share compute instances efficiently.

#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --constraint="cpu2mem16a|cpu2mem16b"
#SBATCH --time=1:00:00
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --mem=7GB
#SBATCH --array=1-10

export PATH=$HOME/.pixi/bin:$PATH

echo "Processing task $SLURM_ARRAY_TASK_ID"
python process_data.py --input data_${SLURM_ARRAY_TASK_ID}.csv

Multiple array jobs can share compute instances efficiently. For example, 4 jobs requesting --mem=7GB each can share 2 instances of cpu2mem16 (16GB per instance):

36266_0  ondemand  my_job  user  CF  0:03  ondemand-dy-cpu2mem16a-2
36266_1  ondemand  my_job  user  CF  0:02  ondemand-dy-cpu2mem16a-2
36266_2  ondemand  my_job  user  CF  0:02  ondemand-dy-cpu2mem16a-3
36266_3  ondemand  my_job  user  CF  0:02  ondemand-dy-cpu2mem16a-3

Important: Do not use --exclusive with job arrays. The --exclusive flag reserves the entire node, preventing efficient resource sharing.

Using Multiple Instance Types

The --constraint option accepts multiple instance types using the OR operator (|):

#SBATCH --constraint="cpu2mem16a|cpu2mem16b|cpu4mem32a"

How it works: Each instance type (e.g., cpu2mem16a, cpu2mem16b, cpu4mem32a) maps to a pool of AWS EC2 instances with the same service unit specs (2 CPUs, 16GB RAM) but different underlying hardware (e.g. different models of AMD or Intel processors). When you specify multiple types:

  1. SLURM can draw from multiple AWS instance pools
  2. If one pool is unavailable, jobs can still start on another
  3. Jobs start faster because more resources are available

Why use this with job arrays: When submitting many jobs, AWS may not have enough instances of a single type available. Using multiple types increases the chance of getting resources quickly.

Important: Only combine multiple instance types when submitting job arrays or many similar jobs that can fully utilize the allocated resources. For a single job, specifying multiple types risks wasting resources. For example, in the setup --constraint="cpu2mem16a|cpu2mem16b|cpu4mem32a" if your job only needs 2 CPUs and 16GB memory but gets assigned to a cpu4mem32a instance, half the resources go unused.

GPU Jobs

GPU jobs require specifying a partition. Available GPU options:

A10G GPU (General purpose)

Interactive session:

srun --partition=gpu-a10g --time=24:00:00 --constraint=gpu4mem96 --pty /bin/bash

Batch job:

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu-a10g
#SBATCH --constraint=gpu1mem64
#SBATCH --time=4:00:00
#SBATCH --output=gpu_%j.out
#SBATCH --error=gpu_%j.err

export PATH=$HOME/.pixi/bin:$PATH
python train_model.py

H100 GPU (High performance)

Interactive session (~$6.88/hr):

srun --partition=gpu-h100 --time=24:00:00 --constraint=gpu1mem256 --pty /bin/bash

See Compute Pricing for GPU instance costs.

Frequently Used SLURM Commands

Job Submission

Command Description
sbatch script.sh Submit a batch job
srun --pty bash Start interactive shell on compute node
salloc -c 4 Allocate resources without starting a shell

For interactive work, srun --pty bash is the simplest, which allocates resources and immediately opens a shell on the compute node. salloc is more advanced and allocates resources while keeping you on the login node, allowing you to run multiple srun commands against the same allocation next. To release those allocations, type exit or press Ctrl+D to end your session, or, run scancel <job_id> to release the allocation..

Job Monitoring

Command Description
squeue -u $USER Show your jobs
squeue --me Show your jobs (shorthand)
scontrol show job <job_id> Detailed job information
seff <job_id> Resource usage after completion

Job Management

Command Description
scancel <job_id> Cancel a specific job
scancel -u $USER Cancel all your jobs
scancel -u $USER -t pending Cancel all pending jobs

Cluster Information

Command Description
sinfo Show partition and node status
sinfo -o "%P %N %c %m %G %f" Show available instance types
scontrol show node <node> Detailed node information

Job History

# Completed jobs with resource details
sacct -u $USER --format=JobID,JobName%30,State,Elapsed,AllocTRES%32

Troubleshooting

Job stuck in PENDING

Check the reason:

squeue --me --start

Common reasons:

  • Resources - Waiting for nodes to become available
  • Priority - Higher priority jobs are ahead
  • QOSGrpCpuLimit - Your group has reached its CPU limit

Job failed immediately

Check the error output:

cat your_job_<jobid>.err

Common issues:

  • Memory exceeded: Increase --mem or use larger instance
  • Time exceeded: Increase --time
  • Command not found: Check your environment setup (export PATH=...)

Job Management Tips

Job Naming and Tagging

Use these options to organize and track jobs:

Option Purpose
--job-name Stable, cost-relevant grouping
--comment Flexible metadata (project, analysis type)
--account Billing attribution

Check your account associations:

sacctmgr show user $USER withassoc

Cost-Saving

  1. Match instance to workload: For example don’t request cpu16mem128 for a job that uses 2GB
  2. Right-size your jobs: Related to above, use seff to check actual usage and reduce requested resources
  3. Use job arrays: Multiple small jobs can share instances efficiently

For more optimization strategies, see Job Arrays & Cost Optimization.

Next Steps