SLURM Quick Start
Before You Begin
Ensure you have:
- SSH access to the HPC cluster (see Getting Started)
- Your software environment configured (see Software Setup)
Understanding Job Submission
On the HPC cluster, you do not run computationally intensive tasks directly on the login node. Instead, you submit jobs to SLURM, which allocates compute nodes and runs your tasks there.
Key concepts:
- Login node: Where you SSH into; for editing files and submitting jobs only
- Compute node: Where your jobs actually run; allocated by SLURM
- Job script: A bash script with SLURM directives that tells the scheduler what resources you need
Recommended tools:
- For text editing: Use VS Code to view and edit code on the login node
- For interactive computing: Use JupyterLab or RStudio Server on compute nodes
Your First Job
Step 1: Create a Job Script
Here is an example test.sh using a cpu2mem8 instance (2 CPUs, 8GB RAM) efficiently:
#!/bin/bash # Use bash to run this script
#SBATCH --job-name=my_job # Job name for identification
#SBATCH --constraint="cpu2mem8a|cpu2mem8b" # Instance type (use | for multiple pools)
#SBATCH --nodes=1 # Request one compute node
#SBATCH --ntasks=1 # Run one task
#SBATCH --cpus-per-task=2 # Use all 2 CPUs on the node
#SBATCH --mem=7GB # Request 7GB (leave ~1GB for system overhead, different instance type need different baseline memory, pls see ## Instance Types and Memory Limits for details)
#SBATCH --time=4:00:00 # Maximum runtime (HH:MM:SS)
#SBATCH --output=job_%j.out # Standard output file (%j = job ID)
#SBATCH --error=job_%j.err # Standard error file
# Sync the computing environment
source ~/.bashrc # Or simply expose the PATH, via: export PATH=$HOME/.pixi/bin:$PATH
# Test commands
echo "Hello from $(hostname)"
python -c "import sys; print(f'Hello from Python {sys.version_info.major}.{sys.version_info.minor}')"
Rscript -e "cat(paste('Hello from R', R.version.string, '\n'))"
Key points:
--cpus-per-task=4gives you all 4 CPUs; use--cpus-per-task=1if your code is single-threaded--mem=7GBleaves ~1GB headroom for VM system processes (requesting full GB may cause failures)- Use
--constraintwith|to draw from multiple AWS instance pools for faster scheduling
See this page for a list of available instance types and their pricing.
Step 2: Submit Your Job
sbatch test.sh
You will see: Submitted batch job 12345
Step 3: Monitor Your Job
# Check your job status
squeue -u $USER
# View detailed job information
scontrol show job 12345
Job states:
PD(Pending) - Waiting for resourcesCF(Configuring) - Node is being configuredR(Running) - Job is runningCG(Completing) - Job is finishing upCD(Completed) - Job finished successfully
Step 4: View Results
Upon completion, due to specifications in the job script above, SLURM creates output files in your directory:
.outfile: Standard output (stdout) messages.errfile: Standard error (stderr) messages
cat test_12345.out
cat test_12345.err
Step 5: Check Resource Usage
After your job completes, use seff to see actual resource consumption:
seff 12345
Example output:
Job ID: 12345
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:05:00
CPU Efficiency: 99.30% of 00:05:00 core-walltime
Memory Utilized: 2.00 GB
Memory Efficiency: 28.57% of 7 GB (7.00 GB/node)
This helps you right-size future jobs. In this example, only 2 GB was used out of 7 GB requested. This is when you know that a smaller instance might be used next time for this job to save costs.
Instance Types and Memory Limits
Use --constraint to select instance types. Set --mem below the instance limit to leave headroom for system processes:
| Instance Type | CPUs | Available Memory | Recommended --mem |
|---|---|---|---|
| cpu2mem4 | 2 | 4 GB | 3 GB |
| cpu2mem8 | 2 | 8 GB | 7 GB |
| cpu4mem32 | 4 | 32 GB | 30 GB |
| cpu8mem64 | 8 | 64 GB | 60 GB |
| cpu16mem128 | 16 | 128 GB | 120 GB |
Note: If you omit --mem or use --mem=0, the default is 4GB. Please be careful and conserative (even stingy!) on memory usage to save unncessary costs.
Job Arrays
Job arrays allow automatic parallelization of similar tasks. Multiple jobs can share compute instances efficiently.
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --constraint="cpu2mem16a|cpu2mem16b"
#SBATCH --time=1:00:00
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --mem=7GB
#SBATCH --array=1-10
export PATH=$HOME/.pixi/bin:$PATH
echo "Processing task $SLURM_ARRAY_TASK_ID"
python process_data.py --input data_${SLURM_ARRAY_TASK_ID}.csv
Multiple array jobs can share compute instances efficiently. For example, 4 jobs requesting --mem=7GB each can share 2 instances of cpu2mem16 (16GB per instance):
36266_0 ondemand my_job user CF 0:03 ondemand-dy-cpu2mem16a-2
36266_1 ondemand my_job user CF 0:02 ondemand-dy-cpu2mem16a-2
36266_2 ondemand my_job user CF 0:02 ondemand-dy-cpu2mem16a-3
36266_3 ondemand my_job user CF 0:02 ondemand-dy-cpu2mem16a-3
Important: Do not use --exclusive with job arrays. The --exclusive flag reserves the entire node, preventing efficient resource sharing.
Using Multiple Instance Types
The --constraint option accepts multiple instance types using the OR operator (|):
#SBATCH --constraint="cpu2mem16a|cpu2mem16b|cpu4mem32a"
How it works: Each instance type (e.g., cpu2mem16a, cpu2mem16b, cpu4mem32a) maps to a pool of AWS EC2 instances with the same service unit specs (2 CPUs, 16GB RAM) but different underlying hardware (e.g. different models of AMD or Intel processors). When you specify multiple types:
- SLURM can draw from multiple AWS instance pools
- If one pool is unavailable, jobs can still start on another
- Jobs start faster because more resources are available
Why use this with job arrays: When submitting many jobs, AWS may not have enough instances of a single type available. Using multiple types increases the chance of getting resources quickly.
Important: Only combine multiple instance types when submitting job arrays or many similar jobs that can fully utilize the allocated resources. For a single job, specifying multiple types risks wasting resources. For example, in the setup --constraint="cpu2mem16a|cpu2mem16b|cpu4mem32a" if your job only needs 2 CPUs and 16GB memory but gets assigned to a cpu4mem32a instance, half the resources go unused.
GPU Jobs
GPU jobs require specifying a partition. Available GPU options:
A10G GPU (General purpose)
Interactive session:
srun --partition=gpu-a10g --time=24:00:00 --constraint=gpu4mem96 --pty /bin/bash
Batch job:
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu-a10g
#SBATCH --constraint=gpu1mem64
#SBATCH --time=4:00:00
#SBATCH --output=gpu_%j.out
#SBATCH --error=gpu_%j.err
export PATH=$HOME/.pixi/bin:$PATH
python train_model.py
H100 GPU (High performance)
Interactive session (~$6.88/hr):
srun --partition=gpu-h100 --time=24:00:00 --constraint=gpu1mem256 --pty /bin/bash
See Compute Pricing for GPU instance costs.
Frequently Used SLURM Commands
Job Submission
| Command | Description |
|---|---|
sbatch script.sh |
Submit a batch job |
srun --pty bash |
Start interactive shell on compute node |
salloc -c 4 |
Allocate resources without starting a shell |
For interactive work, srun --pty bash is the simplest, which allocates resources and immediately opens a shell on the compute node.
salloc is more advanced and allocates resources while keeping you on the login node, allowing you to run multiple srun commands against the same allocation next.
To release those allocations, type exit or press Ctrl+D to end your session, or, run scancel <job_id> to release the allocation..
Job Monitoring
| Command | Description |
|---|---|
squeue -u $USER |
Show your jobs |
squeue --me |
Show your jobs (shorthand) |
scontrol show job <job_id> |
Detailed job information |
seff <job_id> |
Resource usage after completion |
Job Management
| Command | Description |
|---|---|
scancel <job_id> |
Cancel a specific job |
scancel -u $USER |
Cancel all your jobs |
scancel -u $USER -t pending |
Cancel all pending jobs |
Cluster Information
| Command | Description |
|---|---|
sinfo |
Show partition and node status |
sinfo -o "%P %N %c %m %G %f" |
Show available instance types |
scontrol show node <node> |
Detailed node information |
Job History
# Completed jobs with resource details
sacct -u $USER --format=JobID,JobName%30,State,Elapsed,AllocTRES%32
Troubleshooting
Job stuck in PENDING
Check the reason:
squeue --me --start
Common reasons:
Resources- Waiting for nodes to become availablePriority- Higher priority jobs are aheadQOSGrpCpuLimit- Your group has reached its CPU limit
Job failed immediately
Check the error output:
cat your_job_<jobid>.err
Common issues:
- Memory exceeded: Increase
--memor use larger instance - Time exceeded: Increase
--time - Command not found: Check your environment setup (
export PATH=...)
Job Management Tips
Job Naming and Tagging
Use these options to organize and track jobs:
| Option | Purpose |
|---|---|
--job-name |
Stable, cost-relevant grouping |
--comment |
Flexible metadata (project, analysis type) |
--account |
Billing attribution |
Check your account associations:
sacctmgr show user $USER withassoc
Cost-Saving
- Match instance to workload: For example don’t request
cpu16mem128for a job that uses 2GB - Right-size your jobs: Related to above, use
seffto check actual usage and reduce requested resources - Use job arrays: Multiple small jobs can share instances efficiently
For more optimization strategies, see Job Arrays & Cost Optimization.
Next Steps
- Job Arrays & Cost Optimization: Advanced optimization
- SLURM Reference: Complete command reference