Cloud computing setup: MemVerge + AWS

  1. Initial setup
  2. Submit and manage interactive analysis
  3. Submit and manage batch jobs
  4. Data transfer: AWS, HPC and synapse.org

Suggested organization of files in AWS buckets

We recommend that you create a personal folder on the bucket to save data specific to your own tasks (that are not shared with others). For example for user gaow on cumc-gao bucket, the personal folder should be s3://cumc-gao/gaow

FIXME: rewrite this section based on our discussion on May 17, 2024 at the lab meeting

HPC vs MMCloud, a personal view

Where MMCloud is better

  • Access to faster machine: compared to our HPC I can tell that the computing speed is slight to moderately better even on R5 instances.
  • Overall less wait in the queue: under given quota the jobs can start very quickly.
  • Auto-memory and CPU migration: on HPC the CPU and memory should be set to the maximum and anything beyond maximum will fail to complete due to lack of resources. On mmcloud is the opposite. We specify the minimum resource usage, it will auto-migrate to more resources as needed. Jobs will not fail due to lack of memory. (although we can also set maximum optionally in which case jobs over maximum will fail)
  • No walltime limit. And in general, no competition with other HPC users
  • Easy to rerun failed job: click a button on OpCenter for failed job will rerun it.
  • Better tracking of job execution: the OpCenter interface is great. WaveWatcher is particularly helpful.
  • Better control over the hardware: OpCenter GUI can help configure the machine types we use tailored for a batch of jobs, and the configuration can be easily saved in CLI to reuse. On HPC we only have fixed queues, preconfigured. Here we can pretty much make our own queue types on the fly for different submissions – such as bigmem, or highend CPU queues, on the fly.
  • Support team: multiple engineers to provide support on slack, very responsive.
  • No need to use VPN to access HPC.
  • Available resource: currently with 4000 cores quota each across 2 regions, mmcloud would be about 2.5x of the size of our department HPC. It gets jobs done fast. Although currently CUIT only support 1 region they are looking into supporting others.
  • Storage on AWS S3 has decent I/O performance and should be considered quite safe, possible to access from everywhere.

Where MMCloud is less convenient

  • Because CPU and memories on VMs increase by two folds like 2, 4, 8, 16, 32, …, 2^n when we migrate to larger resource it always double the resource (and also cost). Lots of cases it is unneccessary, for example a job would only need 18GB of memory but i would have to float from 16GB to 32GB in order to process the 18GB job. On HPC we can specify any memory size. Since we desire such flexibility completely for cost considerations, perhaps optimizing SPOT instance allocation by cost would help mitigate this limitation. This should already exist but we have not explored it yet
  • For spot instance there can be certain periods during the day when many other users across the country require EC2 so lots of jobs end up “floating” temporarily at those rush hours – this is technically similar to being queued during execution (like “low job priority”). The solution is to also allow for on demand instance if the jobs are time sensitive; or just wait
  • Currently not yet good support for customized JupyterLab, VSCode and Rstudio solutions. But we are working on solutions to these
  • Related to above, everything on mmcloud must be containerized. It is not a big issue to power users but could be an issue for others. My take is that currently memverge deals with this by providing services offering their manpower to build images upon customer request. This may work for 85% of the user cases but def not 100% compared to on prem setup where our admin use module system to provide software for average users. This is not a problem for my lab
  • mmcloud VM instances work independently, as opposed to a modern bioinformatics workflow system eg nextflow, wdl and SoS which can coordinate dependent jobs across multiple nodes on HPC. Solution of this using nextflow exists at memverge but we have not explored it yet.
  • Since it is easy to run many jobs, it is also easy to generate large bills. We still use local HPC for smaller scale analysis of testing and making sure everything works, before sending to the cloud.
  • there can be certain periods during the day when many other users across the country require EC2 so lots of jobs end up “floating” temporarily at those rush hours.
  • The start of the business day in the AM and end of the business day in the PM are the worst. Submitting in late evenings seem to be working well so far.