For lab managers

Usage tracking, cost management, and administrative tools

HPC cluster usage stats

SLURM provides accounting of all jobs that are run on the HPC cluster. This includes the CPU and GPU clusters. This accounting includes various data points, including job submitter, duration, CPU and memory utilization, and job state to indicate if it was completed, failed, etc.

Calculating total usage of the group

The following script calculates “CPU Hours” and “GPU Hours” for the previous month:

# Get dates of last month
lastmonth=$(date -d 'month ago' '+%Y-%m-01')
lastmonthend=$(date -d "$(date +%Y-%m-01) -1 day" +%Y-%m-%d)

# Get all CPU jobs (excludes failed jobs and GPU partition)
sacct -S $lastmonth -E $lastmonthend -a -X -T -P \
  --format=User,elapsedraw,cputimeraw,JobID,partition,state%20,nnodes,ncpus,start,end \
  | grep -v "FAILED" | grep -v "GEN-GPU" > report-cpu.txt

# Get all GPU jobs (GPU partition only, excludes failed jobs)
sacct -S $lastmonth -E $lastmonthend -a -X -T -P -r GEN-GPU \
  --format=User,elapsedraw,cputimeraw,JobID,partition,state%20,nnodes,ncpus,start,end \
  | grep -v "FAILED" > report-gpu.txt

The script above computes past month usage including all jobs (except “failed” jobs) that used CPU/GPU resources during the month. This would include jobs that started in prior month(s) or continue into future month(s). For example, if a job ran from 7/25/2024 12:00 AM to 9/5/2024 8:00 PM, the August report will show 744 hours (24 hours * 31 days).

The reporting of CPU and GPU “hours” is derived from duration of a job regardless how many processing units used. This is different from the actual “Wall time” or CPU/GPU time. For example, please note that both the following examples would result in 1 CPU/GPU Hour due to the current calculation method:

  1. A jobs runs from 8/1/2024 8:00 AM to 9:00 AM with two CPUs will have CPU Usage as 1 hour (3600 seconds). However the total “Wall time” is 2 hours (3600 seconds x 2 CPUs).
  2. A jobs runs from 8/1/2024 8:00 AM to 9:00 AM with forty GPUs will have GPU Usage as 1 hour (3600 seconds). However the total “Wall time” is 40 hours (3600 seconds x 40 GPUs).

Calculating CPU usage for a single user

To calculate CPU hours for a specific user:

  1. Determine the report range in YYYY-MM-DD format (e.g., 2024-08-01 to 2024-08-15)
  2. Run the following command (replace {UNI} with the user’s ID):
sacct -S 2024-08-01 -E 2024-08-15 -u {UNI} -X -T \
  -o elapsedraw,state%20,partition \
  | grep -v "FAILED" | grep -v "GEN-GPU" \
  | awk '{s+="$1"} END {print s/3600}'
  1. Output provides the total hours. Round to the nearest hour.

Calculating GPU usage for a single user

To calculate GPU hours for a specific user:

  1. Determine the report range in YYYY-MM-DD format (e.g., 2024-08-01 to 2024-08-15)
  2. Run the following command (replace {UNI} with the user’s ID):
sacct -S 2024-08-01 -E 2024-08-15 -u {UNI} -X -T -r GEN-GPU \
  -o elapsedraw,state%20,partition \
  | grep -v "FAILED" \
  | awk '{s+="$1"} END {print s/3600}'
  1. Output provides the total hours. Round to the nearest hour.

More details on SLURM job accounting

sacct - View job accounting data (docs)

Option Description
-A, --account=<list> Filter by accounts (comma-separated list)
-X, --allocations Show job allocations, but not job steps
-a, --allusers Show jobs for all users
-E, --endtime=<time> End of reporting period
-o, --format=<options> Output format to display
-j, --jobs=<list> Filter by job IDs (comma-separated list)
--name=<list> Filter by job names (comma-separated list)
-N, --nodelist=<hosts> Filter by host names (comma-separated list)
-r, --partition=<list> Filter by partitions (comma-separated list)
-S, --starttime=<time> Start of reporting period
-s, --state=<list> Filter by states (comma-separated list)
-u, --user=<list> Filter by users (comma-separated list)

Examples:

# View accounting data for specific job with custom format
sacct -j 111111 --format=jobid,jobname,submit,exitcode,elapsed,reqnodes,reqcpus,reqmem

# View compact accounting data for your own jobs for specified time range
sacct -X -S 2024-01-01 -E 2024-02-14

sacctmgr - View or modify account information (docs)

Common commands:

  • sacctmgr show associations - Show all associations
  • sacctmgr show user <UNI> - Show user details
Option Description
cluster=<clusters> Filter by clusters (e.g., condo, discovery)
format=<options> Output format to display
user=<list> Filter by users (comma-separated list)

Examples:

# View your own associations with custom format
sacctmgr show associations user=$UNI format=cluster,account,user,qos

sreport - Generate reports from accounting data (docs)

Common commands:

  • sreport cluster accountutilizationbyuser - Account utilization by user
  • sreport cluster userutilizationbyaccount - User utilization by account
  • sreport job sizesbyaccount - Job sizes by account
  • sreport user topusage - Top users by usage
Option Description
-T, --tres=<list> Resources to report (e.g., cpu, gpu, mem, billing, all)
clusters=<clusters> Filter by clusters (e.g., condo, discovery)
end=<time> End of reporting period
format=<options> Output format to display
start=<time> Start of reporting period
accounts=<list> Filter by accounts (comma-separated list)
users=<list> Filter by users (comma-separated list)
nodes=<hosts> Filter by host names (job reports only)
partitions=<list> Filter by partitions (job reports only)
printjobcount Print number of jobs ran instead of time used (job reports only)

Examples:

# Report account utilization for specified user and time range
sreport cluster accountutilizationbyuser start=2024-01-01 end=2024-02-14 users=$UNI

# Report account utilization by user for specified account and time range
sreport cluster userutilizationbyaccount start=2024-01-01 end=2024-02-14 accounts=$ACCOUNT

# Report job sizes for specified partition
sreport job sizesbyaccount partitions=epyc-64 printjobcount

# Report top users for specified account and time range
sreport user topusage start=2024-01-01 end=2024-07-14 accounts=$ACCOUNT

For HPC Admin

SLURM QOS

Overview

Slurm QOS provides the ability set limits and priority.

Precedence

The order in which limits are enforced is:

  • Partition QOS limit
  • Job QOS limit
  • User association
  • Account association(s), ascending the hierarchy
  • Root/Cluster association
  • Partition limit

Global Limits

Limits imposed globally, at the cluster level, for all users.

See global limits:

sacctmgr show clusters format=Cluster,name,GrpJobs,GrpSubmit,GrpTRES,GrpTRESMins,GrpTRESRunMins,GrpWall,MaxJobs,MaxSubmit,MaxTRESMins,MaxTRES,MaxWall,QOS,Share

See user limits:

sacctmgr list user $UNI withassoc format=user,maxjobs,maxnodes,maxTRES,maxSubmit,maxwall,maxTRESmins

QOS Limits

Setting specific privileges, but also limits, to users and jobs. QOS can be specified/linked per user, partition, or job.

See QOS limits:

sacctmgr list qos format=name,GrpTRES,GrpTRESMins,GrpTRESRunMin,GrpJobs,GrpSubmit,GrpWall,MaxTRES,MaxTRESPerNode,MaxTRESMins,MaxWall,MaxTRESPU,MaxJobsPU,MaxSubmitPU,MaxTRESPA,MaxJobsPA,MaxSubmitPA,MinTRES

Account Limits

Setting specific limits to SLURM Accounts.

See account limits:

sacctmgr list account [AccountName] withassoc where user=[$UNI] format=account,GrpJobs,GrpNodes,GrpTRES,GrpMem,GrpSubmit,GrpWall,GrpTRESMins,MaxJobs,MaxNodes,MaxTRES,MaxSubmit,MaxWall,MaxTRESMin

Partition Limits

See partition limits:

scontrol show partitions | egrep -ie "^P|Max"

By default jobs cannot run more than 12 days.