Getting Started

Essential guide for using the AWS-based HPC system at CU Neurology

This guide covers the essentials for using the AWS-based HPC system at CU Neurology.

Access

Please refer to our internal document, CU_Neurology_HPC_Info_2026.md (requires CUMC VPN or on-campus connection) for information on HPC account access and storage overview.

For off campus access, first connect to the CUIMC VPN. After you setup your account according to CU_Neurology_HPC_Info_2026.md, you can use SSH to access the login node:

ssh <username>@<hpc_ip>

On Windows, you can use Windows Subsystem for Linux (WSL) which includes ssh and other Linux utilities.

Understanding the system

Resource overview

While AWS HPC theoretically offers near-infinite resources, each research group (lab) has their own cluster configured with default limits:

Resource Limit
Partitions CPU on-demand instance, CPU SPOT instance, GPU
CPU nodes ~4,200 (on-demand + SPOT)
GPU nodes ~25

Current status (Jan 15, 2026):

  • Only on-demand nodes available; SPOT instance support via MemVerge is currently being tested.
  • The WaveRider feature of MemVerge across compute nodes (on-demand or SPOT) is yet to be implemented.
  • Recommended concurrent jobs: < 1,000. Submitting too many jobs may result in Resource temporarily unavailable errors.

Check partition and nodes availability:

sinfo

These limits serve as guard-rails to control costs, because without them, runaway jobs accessible to unlimited computing resource can quickly become expensive. To adjust limits temporarily for specific projects, or permanently for the group, please contact the IT team.

Architecture

There are multiple HPC clusters in production. Each PI/group works within a dedicated HPC cluster on AWS. Each cluster has a “head node” for submitting jobs, which is operational 24/7. Compute nodes start as needed when jobs are submitted and power off when jobs complete.

Our HPC cluster uses SLURM (Simple Linux Utility for Resource Management), a widely used open-source workload manager. By default, SLURM allocates resources based on what you request in your job script (CPUs, memory, etc.), not necessarily an entire node. You can configure resource allocation using --constraint to specify instance types and --mem to request specific memory amounts. See SLURM Quick Start for details on configuring job resources.

When you log in to the HPC cluster, you are placed on the head node, which is shared with other users. This node is intended only for lightweight tasks such as navigating directories, viewing files, inspecting scripts, and submitting jobs. Running computationally intensive tasks, including copying large files, will slow down the system for everyone.

Important: DO NOT run heavy computation on the head node. Always submit jobs to the cluster.

The SLURM Job Manager

Most SLURM commands and workflows are the same as our previous on-premises HPC, with the following key differences:

  • --constraint must match an available instance type; use | to separate multiple options
  • --mem must be below the instance memory limit to leave headroom for system processes

Recommended memory settings by instance type:

Instance Max --mem
cpu2mem4 3 GB
cpu2mem8 7 GB
cpu4mem32 30 GB
cpu8mem64 60 GB
cpu16mem128 120 GB

Documentation

Getting Started

  • SLURM Quick Start: Run your first job with step-by-step instructions and job templates

Software

Similar to our on-premises HPC, use module avail to list available software and module load to load them. Contact the IT team if you need additional software installed or have trouble installing software under your local account. Alternatively, follow the customized software setup to install R, Python, and other packages on your own in customized paths without using module.

SLURM Guides

Pricing

Interactive Development Environment (IDE)

Advanced Topics

Current Status

For news and known issues, visit our blog.