HPC Feature Wishlist, January 2026

Saturday, January 31, 2026

Tags:

As we continue to mature our AWS-based HPC system, we’ve compiled a wishlist of features and improvements we’d like to see implemented. Some are being actively worked on, others are longer-term goals.

We’ll update this list as features get implemented or priorities shift.

Cost Management

Auto-shutdown for idle resources. We need automatic shutdown of AWS instances when CPU usage drops below a threshold for extended periods. Idle compute nodes are expensive, and manual monitoring isn’t scalable. This is high priority because runaway costs from forgotten jobs are one of the biggest pain points.

Spot instance prioritization. The cheaper spot instance option is currently being tested and should be prioritized when selecting instances. This can significantly reduce compute costs for fault-tolerant workloads by leveraging AWS spot pricing.

Cost tracking by user and project. We need better visibility into compute costs broken down by user, project, and job type. Currently it’s difficult to attribute costs accurately, which makes budgeting and accountability challenging especially in collaborative computing tasks where a user in one lab might be running compute jobs for another lab.

Resource Scaling and Queue Management

Dynamic memory scaling via MemVerge. We’re exploring integrating MemVerge with SLURM so that jobs running low on memory can automatically migrate to higher-memory instances rather than failing. This is particularly valuable for jobs with variable memory profiles that spike briefly then return to baseline. Currently this “wave-ride” feature does not exist with SLURM although this is a feature available to native MemVerge product.

Smarter queue management and job requeuing. We need better SLURM requeuing logic for failed jobs, especially those that fail due to transient issues. Currently requires too much manual intervention.

Instance type grouping. When mixing instance types in a single queue, SLURM defaults to the lowest common denominator for memory or CPU. We need better job-to-instance matching, ideally grouping jobs by memory requirements to avoid undersized allocations.

Storage and I/O Performance

S3 random access performance. Operations requiring random access (e.g., tabix queries on indexed genomic files) are very slow when data lives on S3. Our current workaround is copying data to local VM storage, but this doesn’t scale for very large datasets. We’re exploring S3 Express One Zone or caching layers, though these add cost and complexity.

Scaling beyond current I/O limits. At around 1,600 concurrent jobs, we start seeing network and S3 overhead competing with I/O bandwidth. For scaling to 3,000+ jobs, we need to plan storage architecture more carefully. (Note: for current I/O bottleneck workarounds, see the cost optimization docs.)

Access and Authentication

VPN-free access via DUO. Currently VPN is required to access the HPC, which is cumbersome—especially for external collaborators and when VPN disconnects during long sessions. We’d like to explore DUO-based authentication as an alternative.

GPU Computing

GPU node memory limitations. Some GPU workflows (e.g., certain deep learning models) require more system RAM than currently available on our GPU nodes. We’ve encountered cases needing ~300GB but only 256GB is available. Need to either provision larger GPU instances or restructure workflows to reduce memory footprint.