Cloud computing setup: MemVerge + AWS

Initial setup
- Notes to system admin: setup AWS bucket, MMCloud OpCenter and software installation
Submit and manage interactive analysis
Submit and manage batch jobs
- A tutorial on running simulation on the cloud
Data transfer: AWS, HPC and synapse.org

Comparison to HPC from the computing prospective

The experience, solutions and suggestions described here are based on our past year (Dec 2023 to Jan 2025) using AWS for storage and computing, with MemVerge implementing software for leveraging SPOT instances robustly. This assessment examines computing, storage, and user experience perspectives to evaluate the possibility of complete HPC-to-cloud migration in academic settings, using the cloud as the main departmental computing infrastructure.

I use “mmcloud” to refer to MemVerge Cloud, which runs MemVerge software to leverage SPOT instances on AWS. While our HPC is free of charge (for now) — which makes cost comparisons impossible from where I stand — it’s important to note upfront that mmcloud computing can reduce computation costs in CPU hours by 30-50% compared to standard AWS quotes. This estimates are based on my experience in the past year, which also varies depending on SPOT instance usage policy settings within mmcloud — essentially, the more flexible we are with wait times for SPOT instance availability, the more we save. Jobs with smaller memory requirements (<100GB) typically achieve greater cost savings. This advantage underlies the following assessment.

Immediate advantages of mmcloud over HPC

CPU computing is consistently better than our current HPC system in terms of both capacity and overall computing speed. Even on basic R5 instances the single core performance is very decent. With our current quota of 4000 cores across two US-based AWS regions, we have substantially more computing capacity than our department HPC, available on-demand. This quota can be further extended through AWS requests (4000 cores is the current quota for my lab).
Jobs start quickly with essentially no queue, though mmcloud still takes some time to shop for cheaper SPOT instances. This still results in significantly reduced wait times compared to traditional HPC systems.
Initial resource allocation is less rigid — unlike HPC where CPU and memory must be set to maximum (with jobs failing if they exceed limits), mmcloud allows specifying minimum requirements with automatic migration to larger resources as needed. While we can optionally set maximum limits to prevent unreasonable machine scaling, the default approach prevents job failures due to resource constraints.
No walltime limits and reduced competition with other users allows for more flexible job scheduling.
The mmcloud OpCenter interface is GUI based, providing more user-friendly job management capabilities compared to the CLI based tools on HPC. For example, failed jobs can be rerun with a simple mouse click, and WaveWatcher offers detailed execution tracking including memory, CPU, and disk usage at any given point.
Hardware configuration is highly customizable through OpCenter GUI. Unlike HPC with fixed queues, we can create custom queue types on the fly (mimicking bigmem or highend CPU queues) and save these configurations in CLI as templates for different tasks.
mmcloud has a feature specifically for larger memory jobs. It allows for even 512GB or larger memory instances to be successfully checkpointed as it does not rely on the 2 min warning before AWS reclaiming instances, as they take checkpoint ahead of the 2 min warning so they can just capture a small delta during the 2 min window. However I have not used this feature myself as we are typically below 100GB.

Additional advantages due to our customization for academic use

We’ve customized mmcloud to closely mirror HPC usage patterns while maintaining cloud advantages:

We’ve implemented a centralized software management system on AWS EFS, managed by pixi (https://prefix.dev), that mirrors traditional HPC module systems. Rather than relying entirely on containers, which become costly when running thousands of jobs due to their size, we maintain a centralized software repository on EFS that VM instances running a bare minimal operating system can access — similar to HPC systems. Here, VMs function as compute nodes without locally installed software.
Our wrapper scripts provide SLURM-style job submission (and even more powerful!) for both large scale batch computing and interactive sessions. These scripts handle mmcloud resource management complexity (through reasonable defaults and the aforementioned software infrastructure integration) while presenting familiar HPC-style interfaces.
We’ve implemented prototype solutions for interactive computing environments including JupyterLab, VSCode, and RStudio to enable interactive data exploration on the cloud.

Remaining Challenges on Computing on the Cloud

Access to data stored on AWS S3 can be slow for certain operations requiring random access or frequent queries (e.g., using tabix for genomic data). It is unclear to me yet whether this can be improved but I think more expensive solutions with a file system such as Lustre might work out fine.
VM resources increase in powers of two (2, 4, 8, 16GB, etc.). This leads to inefficient resource allocation — for example, a job requiring 18GB memory must use a 32GB instance. While spot instance optimization might help mitigate costs, this remains a limitation compared to HPC’s granular resource allocation.
During peak usage periods, spot instances may experience temporary “floating” status while competing with other AWS users nationwide. This resembles HPC queuing, though it can be mitigated by using on-demand instances for time-sensitive jobs at higher cost.
The ease of running multiple jobs simultaneously through simple commands can lead to unexpected costs if not managed carefully. As PI, I must monitor trainee job submissions and warn about unexpected long-running jobs. AWS research support team has been reasonable about providing refunds as credits for future compute, though building this trust requires ongoing relationship management.
While nextflow and other workflow systems exist for coordinating jobs across cloud nodes, we haven’t fully explored these capabilities compared to our mature HPC workflow implementations. Based on experience, significant customization effort will likely be needed despite the advertised nextflow support on mmcloud.