SLURM Quick Start

Run your first job with step-by-step instructions and job script examples

Before You Begin

Ensure you have:

SSH access to the HPC cluster (see Getting Started)
Your software environment configured (see Software Setup)

Understanding Job Submission

On the HPC cluster, you do not run computationally intensive tasks directly on the login node. Instead, you submit jobs to SLURM, which allocates compute nodes and runs your tasks there.

Key concepts:

Login node: Where you SSH into; for editing files and submitting jobs only
Compute node: Where your jobs actually run; allocated by SLURM
Job script: A bash script with SLURM directives that tells the scheduler what resources you need

Recommended tools:

For text editing: Use micro or helix (hx is the command prompt) as alternatives to the more traditional nano and vim:
- pixi global install helix; pixi global install micro
- Or use VS Code on your computer to view and edit code on the login node
For interactive computing: Use JupyterLab, VS Code or RStudio Server on compute nodes

Your First Job

Step 1: Create a Job Script

Here is an example test.sh using a cpu2mem8 instance (2 CPUs, 8GB RAM) efficiently:

#!/bin/bash                     # Use bash to run this script
#SBATCH --job-name=my_job       # Job name for identification
#SBATCH --constraint="cpu2mem8a|cpu2mem8b"  # Instance type (use | for multiple pools)
#SBATCH --nodes=1               # Request one compute node
#SBATCH --ntasks=1              # Run one task
#SBATCH --cpus-per-task=2       # Use all 2 CPUs on the node
#SBATCH --mem=7GB              # Request 7GB (leave ~1GB for system overhead, different instance type need different baseline memory, pls see ## Instance Types and Memory Limits for details)
#SBATCH --time=4:00:00          # Maximum runtime (HH:MM:SS)
#SBATCH --output=job_%j.out     # Standard output file (%j = job ID)
#SBATCH --error=job_%j.err      # Standard error file

# Sync the computing environment
source ~/.bashrc # Or simply expose the PATH, via: export PATH=$HOME/.pixi/bin:$PATH

# Test commands
echo "Hello from $(hostname)"
python -c "import sys; print(f'Hello from Python {sys.version_info.major}.{sys.version_info.minor}')"
Rscript -e "cat(paste('Hello from R', R.version.string, '\n'))"

Key points:

--cpus-per-task=2 gives you all 2 CPUs availabe on the compute instance; use --cpus-per-task=1 if your code does not involve multi-thread computing (or use any multi-thread software packages)
--mem=7GB leaves ~1GB headroom for VM system processes (requesting full GB may cause failures)
Use --constraint with | to draw from multiple AWS instance pools for faster scheduling

See this page for a list of available instance types and their pricing.

Step 2: Submit Your Job

sbatch test.sh

You will see: Submitted batch job 12345

Step 3: Monitor Your Job

# Check your job status
squeue -u $USER

# View detailed job information
scontrol show job 12345

Job states:

PD (Pending) - Waiting for resources
CF (Configuring) - Node is being configured
R (Running) - Job is running
CG (Completing) - Job is finishing up
CD (Completed) - Job finished successfully

Step 4: View Results

Upon completion, due to specifications in the job script above, SLURM creates output files in your directory:

.out file: Standard output (stdout) messages
.err file: Standard error (stderr) messages

cat test_12345.out
cat test_12345.err

Step 5: Check Resource Usage

After your job completes, use seff to see actual resource consumption:

seff 12345

Example output:

Job ID: 12345
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:05:00
CPU Efficiency: 99.30% of 00:05:00 core-walltime
Memory Utilized: 2.00 GB
Memory Efficiency: 28.57% of 7 GB (7.00 GB/node)

This helps you right-size future jobs. In this example, only 2 GB was used out of 7 GB requested. This is when you know that a smaller instance might be used next time for this job to save costs.

Instance Types and Memory Limits

Use --constraint to select instance types. Leave 1-2 GB headroom for system processes. For example:

Instance Type	CPUs	Total Memory	Recommended `--mem`
cpu2mem4	2	4 GB	3 GB
cpu2mem8	2	8 GB	7 GB
cpu4mem32	4	32 GB	30 GB
cpu8mem64	8	64 GB	62 GB
cpu16mem128	16	128 GB	126 GB

Note: If you omit --mem or use --mem=0, the default is 4GB. Please be careful and conservative (even stingy!) on memory usage to save unnecessary costs.

Job Arrays

Job arrays allow automatic parallelization of similar tasks. Multiple jobs can share compute instances efficiently. Please see the real data example for parallel analysis by gene.

#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --constraint="cpu2mem16a|cpu2mem16b"
#SBATCH --time=1:00:00
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --mem=7GB
#SBATCH --array=1-10

export PATH=$HOME/.pixi/bin:$PATH

echo "Processing task $SLURM_ARRAY_TASK_ID"
python process_data.py --input data_${SLURM_ARRAY_TASK_ID}.csv

Multiple array jobs can share compute instances efficiently. For example, 4 jobs requesting --mem=7GB each can share 2 instances of cpu2mem16 (16GB per instance):

36266_0  ondemand  my_job  user  CF  0:03  ondemand-dy-cpu2mem16a-2
36266_1  ondemand  my_job  user  CF  0:02  ondemand-dy-cpu2mem16a-2
36266_2  ondemand  my_job  user  CF  0:02  ondemand-dy-cpu2mem16a-3
36266_3  ondemand  my_job  user  CF  0:02  ondemand-dy-cpu2mem16a-3

Important: Do not use --exclusive with job arrays. The --exclusive flag reserves the entire node, preventing efficient resource sharing.

Using Multiple Instance Types

The --constraint option accepts multiple instance types using the OR operator (|):

#SBATCH --constraint="cpu2mem16a|cpu2mem16b|cpu4mem32a"

How it works: Each instance type (e.g., cpu2mem16a, cpu2mem16b, cpu4mem32a) maps to a pool of AWS EC2 instances with the same service unit specs (2 CPUs, 16GB RAM) but different underlying hardware (e.g. different models of AMD or Intel processors). When you specify multiple types:

SLURM can draw from multiple AWS instance pools
If one pool is unavailable, jobs can still start on another
Jobs start faster because more resources are available

Why use this with job arrays: When submitting many jobs, AWS may not have enough instances of a single type available. Using multiple types increases the chance of getting resources quickly.

Important: Only combine multiple instance types when submitting job arrays or many similar jobs that can fully utilize the allocated resources. For a single job, specifying multiple types risks wasting resources. For example, in the setup --constraint="cpu2mem16a|cpu2mem16b|cpu4mem32a" if your job only needs 2 CPUs and 16GB memory but gets assigned to a cpu4mem32a instance, half the resources go unused.

GPU Jobs

GPU jobs require specifying a partition. Available GPU options:

A10G GPU (General purpose)

Interactive session:

srun --partition=gpu-a10g --time=24:00:00 --constraint=gpu4mem96 --pty /bin/bash

Batch job:

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu-a10g
#SBATCH --constraint=gpu1mem64
#SBATCH --time=4:00:00
#SBATCH --output=gpu_%j.out
#SBATCH --error=gpu_%j.err

export PATH=$HOME/.pixi/bin:$PATH
python train_model.py

H100 GPU (High performance)

Interactive session (~$6.88/hr):

srun --partition=gpu-h100 --time=24:00:00 --constraint=gpu1mem256 --pty /bin/bash

See Compute Pricing for GPU instance costs.

Frequently Used SLURM Commands

Job Submission

Command	Description
`sbatch script.sh`	Submit a batch job
`srun --pty bash`	Start interactive shell on compute node
`salloc -c 4`	Allocate resources without starting a shell

For interactive work, srun --pty bash is the simplest, which allocates resources and immediately opens a shell on the compute node. salloc is more advanced and allocates resources while keeping you on the login node, allowing you to run multiple srun commands against the same allocation next. To release those allocations, type exit or press Ctrl+D to end your session, or, run scancel <job_id> to release the allocation..

Job Monitoring

Command	Description
`squeue -u $USER`	Show your jobs
`squeue --me`	Show your jobs (shorthand)
`scontrol show job <job_id>`	Detailed job information
`seff <job_id>`	Resource usage after completion

Job Management

Command	Description
`scancel <job_id>`	Cancel a specific job
`scancel -u $USER`	Cancel all your jobs
`scancel -u $USER -t pending`	Cancel all pending jobs

Cluster Information

Command	Description
`sinfo`	Show partition and node status
`sinfo -o "%P %N %c %m %G %f"`	Show available instance types
`scontrol show node <node>`	Detailed node information

Job History

# Completed jobs with resource details
sacct -u $USER --format=JobID,JobName%30,State,Elapsed,AllocTRES%32

Troubleshooting

Job stuck in PENDING

Check the reason:

squeue --me --start

Common reasons:

Resources - Waiting for nodes to become available
Priority - Higher priority jobs are ahead
QOSGrpCpuLimit - Your group has reached its CPU limit

Job failed immediately

Check the error output:

cat your_job_<jobid>.err

Common issues:

Memory exceeded: Increase --mem or use larger instance
Time exceeded: Increase --time
Command not found: Check your environment setup (export PATH=...)

Job Management Tips

Job Naming and Tagging

Use these options to organize and track jobs:

Option	Purpose
`--job-name`	Stable, cost-relevant grouping
`--comment`	Flexible metadata (project, analysis type)
`--account`	Billing attribution

Check your account associations:

sacctmgr show user $USER withassoc

Cost-Saving

Match instance to workload: For example don’t request cpu16mem128 for a job that uses 2GB
Right-size your jobs: Related to above, use seff to check actual usage and reduce requested resources
Use job arrays: Multiple small jobs can share instances efficiently

For more optimization strategies, see Job Arrays & Cost Optimization.

Next Steps

Job Arrays & Cost Optimization: Advanced optimization
SLURM Reference: Complete command reference