Cloud computing setup: MemVerge + AWS

Setup AWS and MemVerge toolkit

Please refer to Section Appendix I: AWS and MemVerge Software Installation and Set Up if you would like to install the relevant AWS and MemVerge software packages on your computer. If you are a savvy Singularity or Docker user please refer to Section Appendix II: Running AWS and MemVerge Software through container images to set up the tools through containers.

You may be asked to upgrade MemVerge OpCenter from time to time. To do so please refer to Section Appendix III: Upgrade MemVerge OpCenter.

Submit batch computing jobs through MemVerge CLI

Login to MMCloud

Note: there is a firewall issue on our departmental HPC, so we can not login to float on HPC. Please use your laptop or desktop to run the float commands.

This assumes that your admin has already created a MemVerge account for you, with a username and password provided. To login,

float login -u <username> -a <op_center_ip_address>

Example:

float login -u gaow -a 54.81.85.209

Note: If you see an error during your runs akin to this Token error, you likely have not logged in to the appropriate opcenter

Error: Invalid token

Execute a simple command through pre-configured docker containers

Here is an example running a simple bash script susie_example.sh using pecotmr_docker image file available from our online docker image repositories. The susie_example.sh has these contents (copied from running ?susieR::susie in R terminal):

#!/bin/bash
Rscript - << 'EOF'
# susie example
set.seed(1)
n = 1000
p = 1000
beta = rep(0,p)
beta[1:4] = 1
X = matrix(rnorm(n*p),nrow = n,ncol = p)
X = scale(X,center = TRUE,scale = TRUE)
y = drop(X %*% beta + rnorm(n))
res = susie(X,y,L = 10)
print("Maximum posterior inclusions probability:")
print(max(res$pip))
saveRDS(res, "susie_example_result.rds")
EOF

The command below will submit this bash script to AWS accessing 2 CPUs and 8GB of Memory.

float submit -i ghcr.io/cumc/pecotmr_docker -j susie_example.sh -c 2 -m 8

It will print some of the quantities to standard output stream which you can track by the follow command,

float log cat stdout.autosave -j <job_id>

where <job_id> is available via float list which will shows all jobs from the current MMCloud OpCenter. You can roughly understand OpCenter as the “login node” of an HPC that manages job submission. Please use float -h to learn more about job management using the float command.

Notice that even though this R script write output to a file calle susie_example_result.rds you will not find that file after the job finishes, because by default it is written to somewhere in the virtual machine (VM) instance that was fired up to run the computing, which is deleted after the job is done. To keep the results we need to mount to the VM some AWS S3 bucket where we keep the data. This will be introduced in the next section.

Submit multiple jobs for “embarrassing parallel” processing

To submit multiple computing tasks all at once, we assume that:

  1. Each job is one line of bash command
  2. No dependencies between jobs — multiple jobs can be executed in parallel
  3. All these jobs uses similar CPU and memory resource

Suppose you have a bioinformatics pipeline script in SoS language, a text file called my_pipeline.sos that reads like this:

[global]
paramter: gene = str
parameter: entrypoint= ""

[default]
output: f"{gene}.txt"
R: entrypoint=entrypoint, expand=True
  write("success", {_output:r})

Make sure my_pipeline.sos exists in your bucket, as that is how the VM will be able to access it.

Suppose you have 3 jobs to run in parallel, possibly using different docker images, like below. The <location> in the script will be where your bucket will be mounted on your instance.

sos run /<location>/my_pipeline.sos --gene APOE4
sos run /<location>/my_pipeline.sos --gene BIN1
sos run /<location>/my_pipeline.sos --gene TREM2

You save these lines to run_my_job.sh and use mm_jobman.sh to submit them — mm_jobman.sh is a utility script which calls executables aws and float to prepare and submit jobs. The script is available here on GitHub.

to submit,

mm_jobman.sh run_my_job.sh \
  -c 2 \
  -m 8 \
  --mountOpt mode=rw \
  --mount <your-bucket-name>:<directory-to-mount> \
  --download /home/jovyan/TEMP/data:<download-folder-in-bucket-1>/ \
             /home/jovyan/TEMP/resource:<download-folder-in-bucket-2>/ \
  --upload /home/jovyan/TEMP/output:<upload-folder-in-bucket-1>/ \
           /home/jovyan/TEMP/cache:<upload-folder-in-bucket-2>/ \
  --opcenter 54.81.85.209 \
  --image ghcr.io/cumc/pecotmr_docker:latest \
  --entrypoint "source /usr/local/bin/_activate_current_env.sh" \
  --env ENV_NAME=pecotmr \
  --job-size 3 \
  --parallel-commands 1 \
  --imageVolSize 10 \
  --dryrun

Keep in mind the --dryrun parameter below will not actually run your command, but will print it out instead for debugging purposes.

There are many parameters you can use in your run. You can call mm_jobman.sh --help to provide the comprehensive list of the parameters, which is also shown below.

Options:
    -c <min>:<optional max>                   Specify the exact number of CPUs to use, or, with ':', the min and max of CPUs to use. Required.
    -m <min>:<optional max>                   Specify the exact amount of memory to use, or, with ':', the min and max of memory in GB. Required.
    --cwd <value>                             Define the working directory for the job (default: ~).
    --download <remote>:<local>               Download files/folders from S3. Format: <S3 path>:<local path> (optional).
    --upload <local>:<remote>                 Upload folders to S3. Format: <local path>:<S3 path> (optional).
    --download-include '<value>'              Use the include flag to include certain files for download (space-separated) (optional).
    --dryrun                                  Execute a dry run, printing commands without running them.
    --entrypoint '<command>'                  Set the initial command to run in the job script (required).
    --image <value>                           Specify the Docker image to use for the job (required).
    --job-size <value>                        Set the number of commands per job for creating virtual machines (required).
    --mount <bucket>:<local>                  Mount an S3 bucket to a local directory. Format: <bucket>:<local path> (optional).
    --mountOpt <value>                        Specify mount options for the bucket (required if --mount is used).
    --ebs-mount <folder>=<size>               Mount an EBS volume to a local directory. Format: <local path>=<size>. Size in GB (optional).
    --no-fail-fast                            Continue executing subsequent commands even if one fails.
    --opcenter <value>                        Provide the Opcenter address for the job (default: 23.22.157.8).
    --parallel-commands <value>               Set the number of commands to run in parallel (default: min number of CPUs).
    --min-cores-per-command <value>           Specify the minimum number of CPU cores required per command (optional).
    --min-mem-per-command <value>             Specify the minimum amount of memory in GB required per command (optional).
    --help                                    Show this help message.


Some extra points on the parameters:

  • For the --upload option, providing a trailing / for the <local> folder will copy the contents of the folder into the <remote> folder. Not having a trailing / will copy the entire folder. Remember to not put a trailing / at the end of the <remote> folder of your upload command.
  • Providing just one argument for --mountOpt will have all --mount buckets share the same mount options. Multiple --mountOpt arguments will set mount options for the respective buckets in order. Therefore, if multiple --mountOpt arguments are provided, the script will expect the same number of --mount options.
  • For the --download option, if your <remote> is intended to be a folder, add a trailing / at the end. This will make it so the folder is copied as a folder into your <local> directory. If not trailing / is provided, you will be copying a file and overriding the file on <local>
  • Each --download-include will correspond to their respective --download (first --download-include will be for the first --download, etc). It is up to the user to make sure their parameters are correct and to make note of this respective matching.
  • Since by default the working directory of the job is ~, you will need to provide the location of my_pipeline.sos relative to that. For example, if I submit the job and mount my bucket (where I have previously uploaded my_pipeline.sos) under <directory-to-mount>, I should expect my job script to reflect that. For example, sos run /<directory-to-mount>/my_pipeline.sos --gene APOE4

For additional examples please check them out here on GitHub.

Monitoring jobs on Memverge and aws

After submitting jobs to the cloud through Memverge opcenters we need to monitor the status and details of the job through the web interface of the opcenters or using CLI. So far we have 3 opcenters for our lab (2 in east1 and 1 in west1):

east1:

  1. 54.81.85.209 (large, old version opcenter) - We highly recommend using the new opcenter
  2. 23.22.157.8 (xlarge, new version opcenter)

west1:

  1. 54.176.176.205 (xlarge, new version opcenter)

To view the status of the job you just submitted, open the web browser for your opcenter (GUI interface) by typing in the IP address, login to it using your username and password, and click on Jobs in the left-hand panel of the page after you login. The status of submitted jobs can be : Executing, Completed, Cancelled, FailedToComplete, FailedToExcute, Floating, Initializing, Starting, Submitted, etc. We can also check the staus using CLI query

float squeue

which should show the job ID. Then check log files generated for a job,

float log -j <jobID> ls #check log file name 
float log -j <jobID> cat <logfile> #check log file contents

Another way to get the logs of a job (with the use of a script) from the opcenter itself. logs of the job are stored under /mnt/memverge/slurm/work. Two levels of directories that correspond to the first two pairs of characters for the job. For example, for job id “jkbzo4y7c529fiko0jius”, the contents are stored under /mnt/memverge/slurm/work/jk/bz/o4y7c529fiko0jius.

It is also possible to use those IDs to save the log file contents via

float log download -j JOB_ID

Rerun failed jobs, and “resolve” failed jobs

  1. Cancel all or selected jobs
float scancel --filter "status=*" -f
# or
float scancel -f --filter user=* 

To suspend jobs through the CLI is float suspend -j <job_id>.

  1. Rerun FailToComplete jobs since a given submission date

unfortunately memverge do not have a way to rerun multiple jobs at once given a submission date/any other filter method. As of now, the only way for rerunning a job is through the GUI (one at a time) or with float rerun -j <JOB_ID>. Consider to change the -c and -m parameters to avoid floating which is the often the reason why these jobs failed to complete.

How to allow for more jobs to run

Both the AWS and the Memverge OpCenter have a default limit on the number of jobs we can submit at a time. However, it is possible to request AWS to increase the quota limit by contacting AWS customer service to increase the quota for Spot instances (which MMCloud uses). Currently, we have increased our AWS quota to 4000 CPUs per AWS region, although this can be changed to even largeer if we request again. Requests are usually approved within 24hrs.

Admins should also change the default maximum job of an OpCenter using CLI. See later section about OpCenter configuration via CLI.

AWS space limit issue

When we submit too many jobs each loads large EB2 volume, we may see in jobs submitted later that:

2024-02-21T02:39:10.861: Failed to create float data volume, error: VolumeLimitExceeded: You have exceeded your maximum gp3 storage limit of 50 TiB in this region. Please contact AWS Support to request an Elastic Block Store service limit increase.

Although it is possible to ask AWS customer service to increase this limit, since EB2 volume should hold just temporary files, 50TB is a decent limit. If your jobs uses more than 50TB EB2 volume on the fly to support the computing, it is advised to examine into your jobs and decide if this is truly necessary (likely not!).

Server internal error

When you got the error Error: server internal error. please contact the service owner. detail: json: cannot unmarshal string into Go struct field jobRest.Jobs.actualEmission of type float64.

The likely reason for this is that your local binary version is older than the OpCenter version. You can follow the lab wiki instructions to upgrade it or re-install float binary.

Interacting with data on AWS buckets

Check files generated:

aws s3 ls --human-readable s3://$your-bucket-name/$result_folder/ | head

Copy specific types of files (try not to do it; try to look into results directly on AWS), eg stderr below

aws s3 cp s3://$your-bucket-name/$result_folder/ ./ --recursive --exclude "*" --include "*.stderr"

for example the above includes only stderr files.

Explore results on AWS directly

It is possible to use JupyterLab to directly explore the data generated on AWS. There is a utility script jupyter_setup.sh to launch jupyterlab serverhere on GitHub.

  1. bash the jupyter_setup.sh script.
  2. Enter username and password following the prompts.
  3. Wait for about 10 mins to get a URL and copy it to your broswer to launch interactive jupyter notebook.

You can track the job to be intialized and executed at the OpCenter it was submited to. When the status changes to Executing you should see the URL.

Data transfer from HPC to synapse

The assuption is that the user has access to synapse. Synapse provides APIs to store or access your data from the web or programmatically via one of ther supported analytical clients(Command line, Python or R). We recommend to transfer data using programmatically.

To make analysis outputs and results accessible for collaborators we may upload some of our outputs and results to the synapse.

On your HPC login with your syanpse credentials. Follow the instruction from synapse website to create authtoken and prepare a .synapseConfig in home folder.

###########################
# Login Credentials       #
###########################

## Used for logging in to Synapse. See https://python-docs.synapse.org/tutorials/authentication/
## for information on retrieving an auth token.
[authentication]
username = xxx
authtoken = xxx

authtoken is not the password you use, should be created following the instruction above.

upload data

synapse add --parentid synxxxxxxxx <file>
  • -parentid is the synapse id for the folder store that file
for i in `ls *.rds`;do synapse add --parentId syn54079330 $i;done

Or in parallel

cat sample_data| xargs -I {} -P 10 synapse add --parentid synxxxxxxxx {}
  • sample_data is a file list all files need to be uploaded
  • xargs change cahracter to parameter
  • -I is file name
  • -P is parallel number

Download data to HPC

synapse login
synapse get -r SynID

Data transfer from aws to synapse

Currently, when attempting to transfer data from an S3 bucket to Synapse, there are different methods available:

First, download the data to a local machine, then upload it to Synapse using the Command Line Interface (CLI). Alternatively, launch a Jupyter notebook via Opcenter, employing either a bash notebook or a terminal kernel for the upload process. However, this approach has its drawbacks: the terminal kernel tends to be slow and prone to freezing, and utilizing a bash notebook for large tasks (such as uploading over 30,000 files) could lead to crashes due to interactive messages during the upload process.

There’s also the possibility of directly linking the S3 bucket with Synapse by granting it ‘ListObject’ permissions, as outlined in the synapse documentation)

FIXME: working on to come up with a more efficient way of uploading data from aws s3 bucket to synapse

Notes for Data Admin

This section is only relevant to data admins in charge of transfering data between local computing environment and AWS buckets. If you are a user analyzing data that already exists on an AWS bucket you can skip this.

First time: configure your pre-existing account on project AWS bucket

This assumes that your admin has already an user account on their bucket, with Access Key ID and Secret Access Key for your access to the project AWS bucket. To configure your account from command terminal,

aws configure

You will be prompted to provide these information:

AWS Access Key ID [None]:
AWS Secret Access Key [None]: 
Default region name [None]: us-east-1
Default output format [None]: yaml

The first two pieces of info should be available from your admin.

Upload data to pre-existing AWS bucket

This assumes that your admin has already created a storage bucket on AWS, and that you can access it. In this documentation the pre-existing bucket is called cumc-gao. You should have your username from your admin.

To copy file eg $DATA_FILE to the bucket,

aws s3 cp $DATA_FILE s3://$S3_BUCKET/ 

Example:

touch test_wiki.txt
aws s3 cp test_wiki.txt s3://cumc-gao/

To copy folder $DATA_DIR to the bucket,

aws s3 cp $DATA_DIR s3://$S3_BUCKET/ 

Example:

mkdir test_wiki
mv test_wiki.txt test_wiki
aws s3 cp test_wiki s3://cumc-gao/ --recursive

Once you completed uploading files to AWS bucket, you are ready to run your analysis through MemVerge.

Download data from pre-existing AWS bucket

After your analysis is done, it is possible to retrieve the results stored on S3 to your local machine, simply by reversing the command for Upload discussed in the previous section. For example,

aws s3 cp s3://$S3_BUCKET/$DATA_DIR/output output --recursive

Data transfer within AWS bucket

To copy data directory eg with differnt buckets

aws s3 cp s3://source-bucket-name/source-folder s3://destination-bucket-name/destination-folder --recursive

To copy data directory eg within the same bucket

aws s3 cp s3://my-bucket/source-folder s3://my-bucket/destination-folder --recursive

Remove data from AWS buckets

Warning: think twice before you run it!

FIXME: on our end we should set it up such that a user can only remove files they created, not from other people

aws s3 rm s3://$S3_BUCKET/$DATA_DIR/cache --recursive

Suggested organization of files in AWS buckets

We recommend that you create a personal folder on the bucket to save data specific to your own tasks (that are not shared with others). For example for user gaow on cumc-gao bucket, the personal folder should be s3://cumc-gao/gaow

FIXME: provide a command to show users how to create a personal folder

Interacting with your data

These instructions provide a step-by-step guide to access AWS S3 through the web browser in a more organized and clear manner.

  • Open your web browser and navigate to the AWS Management Console.

  • Sign in to the AWS Management Console using your AWS account credentials:
    • Enter your 12-digit AWS Account ID (you can find this in your account details).
    • Enter your IAM user name (you can find this in your account details).
    • Enter your password (If you don’t have a password, please contact your aws administrator to create one for you).
  • In the AWS Management Console, locate the search bar and type “S3,” then select “S3” from the search results.

  • In the Amazon S3 dashboard, click on the S3 bucket named “statfungen” to access your data.

  • You can now easily interact with your datasets within the selected S3 bucket.

Notes for System Admin

This section is only relevant to system admins. If you are a user you can skip this

Setting Up Your IAW User and Account

This is a one-time job for the system admin, done through GUI

FIXME: the approach below will gave every one in the group the full access to the whole bucket, so everyone can read and edit others’ file, that would be convenient but also dangerous. Need to manage it better next step

  • Log into AWS Console:
    • Navigate to AWS Console.
    • Sign up for a root AWS account if you’re new, else log in.
  • Search for IAW:
    • After logging in, search for “IAW” using the top search bar.
  • Creat Group
    • Click “User groups” on the left.
    • Attach “AmazonS3FullAccess” for this group
    • Add Users to this group.
  • Add user and set up access key
    • GUI/or maybe for root user (first time to set up the access key)
      • Add User:
        • Click “Users” on the left and then click “Create user” on the right.
        • Click “Next” following instructions.
      • Manage Access Keys:
        • Find “Security recommendations” on the IAW dashboard.
        • Click “Manage access keys”.
      • Create an Access Key:
        • Go to the “Access keys” section.
        • Select “Create access key”.
      • Retrieve Your Access Key and Secret Access Key:
        • A dialogue will show your Access Key ID and Secret Access Key.
        • Check the box, then click “Next”.
        • Download a copy of these keys for safekeeping.
    • CLI (change to root access)
       aws iam create-user --user-name YourUserName
       aws iam add-user-to-group --user-name YourUserName --group-name Gao-lab
       aws iam create-access-key --user-name YourUserName
       # create password 
       aws iam create-login-profile --user-name YourUserName --password NEW_PASSWORD
      

      copy these keys for safekeeping.

  • Configure AWS CLI:
    • Run the following in your terminal:
      aws configure
      
    • Provide:
      • Your Access Key ID and Secret Access Key.
      • Region: us-east-1.
      • Output format (e.g., yaml).

Create project S3 Bucket

To create an S3 bucket, ensure your $S3_BUCKET name is globally unique and in lowercase. For example:

aws s3api create-bucket --bucket $S3_BUCKET --region $AWS_REGION

Example:

aws s3api create-bucket --bucket cumc-gao --region us-east-1

MemVerge account management

First, login as admin,

float login -u <admin username> -p <admin passwd> -a /<op_center_ip_address>

Then create a new user for example tom,

float user add tom

Setup MemVerge OpCenter for project

FIXME: add how to set up an op_center here, the step 1 in this GUI tutorial is nice but it would be better if you have a CLI instructions

FIXME: [Gao] I don’t understand the concept and relevance of OpCenter – so when you describe how to set it up here, please also give a bit of background and motivation so I (and other future admin) understand it

Appendix I: AWS and MemVerge Software Installation and Set Up

You can safely skip this section if you use the docker/singularity image provided by MemVerge, as detailed in the first section of this document. Here we keep these contents as Appendix for book-keeping purpose.

AWS CLI tools

(Linux/Windows/Mac) https://docs.aws.amazon.com/cli/latest/userguide/gteting-started-install.html

  • If you are Mac user, you can use below commands to install AWS CLI tools.
       curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
       sudo installer -pkg AWSCLIV2.pkg -target /
    
  • If you are HPC/Linux user, you can use to install AWS CLI tools (Also add export PATH=$PATH:/home/<UNI>/.local/bin in your ~/.bashrc and don’t forget to source it).
       curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
       unzip awscliv2.zip
       ./aws/install -i /home/<UNI>/.local/bin/aws-cli -b /home/<UNI>/.local/bin
    
  • If you are Windows user, you can open cmd as administrator and use below commands to install AWS CLI tools.
       msiexec.exe /i https://awscli.amazonaws.com/AWSCLIV2.msi
    

Check if it was installed successfully with

   which aws
   aws --version

MemVerge float for job submission

Download Float from the Operation Center

  • For linux user
       wget https://<op_center_ip_address>/float --no-check-certificate
       # Example using an IP address:
       wget https://23.22.157.8/float --no-check-certificate
    
  • For Mac Intel chip user
       wget https://<op_center_ip_address>/float.darwin_amd64 --no-check-certificate
       # Example using an IP address:
       wget https://23.22.157.8/float.darwin_amd64 --no-check-certificate
    
  • For Mac Apple Silicon M chip user
       wget https://<op_center_ip_address>/float.darwin_arm64 --no-check-certificate
       # Example using an IP address:
       wget https://23.22.157.8/float.darwin_arm64 --no-check-certificate
    
  • For Windows user, the float file above is not compatible, so you need to access https://23.22.157.8 and manually downloaded the version of the tool specifically for Windows.

Or you can choose to open MMCloud OpCenter and download it with GUI.

Move and Make It Executable

  • For MAC and Linux users with sudo access, replace <float_binary> below with what you just downloaded,
       sudo mv <float_binary> /usr/local/bin/float
       sudo chmod +x /usr/local/bin/float
    
  • For users without sudo you can either use alias float=/path/to/float_binary/float or add export PATH=$PATH:<PATH> in your ~/.bashrc where <PATH> is path to float executable. Don’t forget to source it afterwards. Then,
      chmod +x <PATH>/float 
    
  • For Windows user: Files located in C:\Windows\System32 are automatically included in the system’s PATH environment variable on Windows. This means that any executable file in this directory can be run from any location in the Command Prompt without specifying the full path to the executable. The System32 directory is a crucial part of the Windows operating system, containing many of its core files and utilities. So, if float.exe is in this directory, you can run it from anywhere in the Command Prompt by just typing float.

Addressing Mac Security Settings

Optional: For Mac Users

If you are using a Mac, float might be blocked due to your security settings. Follow these steps to address it:

  • Open ‘System Preferences’.
  • Navigate to ‘Privacy & Security ‘.
  • Under the ‘Security’ tab, you’ll see a message about Float being blocked. Click on ‘Allow Anyway’.

Rename Float to Avoid Name Conflicts

Optional For Mac Users, but would change your float name in downstream instruction and the future work

If there’s an existing application or command named float, rename the downloaded float into mmfloat to avoid conflicts:

mv /usr/local/bin/float /usr/local/bin/mmfloat

Appendix II: Running AWS and MemVerge Software through container images

We recommend using singularity to run MemVerge and AWS tools. On conventional Linux-based HPC, singularity should be available. If you want to manage your computing from a local computer, here is a setup guide for Mac users.

If you prefer to not use singularity or if the commands below does not work for you, please find the installation instructions for aws and float at the end of this document.

First, download the singularity image that contains MemVerge and AWS related tools:

singularity pull mmc_utils.sif docker://ashleytung148/mmc-utils:latest

When an update for this image is available, you can simply delete it and pull again:

rm -f mmc_utils.sif
singularity pull mmc_utils.sif docker://ashleytung148/mmc-utils:latest

Here are some command line alias you may want to set in your bash configuration, as shortcut to use tools from mmc_utils.sif

FIXME: test and improve this. for example Mac users may have different alias depending on how they installed singularity

alias aws="singularity exec /path/to/mmc_utils.sif aws"
alias float="singularity exec /path/to/mmc_utils.sif float"
alias mm_jobman.sh="singularity exec /path/to/mmc_utils.sif mm_jobman.sh"

where /path/to/mmc_utils.sif is where you save the mmc_utils.sif file.

If you are unable to run the float CLI through this method, you may install it from the https://<op_center_ip_address> site. You should be able to see options for Windows, Linux, and Max installations at the bottom. Please contact your admin if you do not know your opcenter ip address.

Appendix III: Upgrade MemVerge OpCenter

  • float release ls (check the version thats available)
  • float release upgrade (upgrade to latest)
  • wait for 1-2 mins
  • float login (to login again)
  • float release sync (upgrade local float binary. You can skip this if you use the latest containers provided by MemVerge, see Appendix II. You may get a permission deneied error, if so, please use sudo)
  • float release migrate --dbPath /mnt/memverge/data/opcenter (this is a one time upgrade of the backend DB)
  • done!

    Configure OpCenter

We can configure the Opcenter in two way

  1. Using the GUI interface Using the GUI interface admins can change the setting of the OpCenter like to expand instance type and allow for more retry on spot instance before jobs fail.

Configure using GUI

  1. Using CLI configuration commands (recommended). Here are some examples
    ```bash float config set cloud.createVMPolicy spotOnly float config set cloud.createVMRetryInterval 5m0s float config set cloud.createVMRetryLimit “-1” float config set migrate.cpuLowerBoundDuration 10m0s float config set migrate.cpuLowerBoundRatio 1 float config set migrate.cpuLowerLimit 8 float config set migrate.cpuMigrateStep 25 float config set migrate.cpuUpperBoundDuration 10m0s float config set migrate.cpuUpperBoundRatio 99 float config set migrate.memDisable false float config set migrate.memLowerBoundDuration 10m0s float config set migrate.memLowerBoundRatio 1 float config set migrate.memLowerLimit 8 float config set migrate.memMigrateStep 25 float config set migrate.memUpperBoundDuration 10m0s float config set provider.allowList “r5,r6,r7*” float config set scheduler.jobExecutorLimit 900 float config ser cloud.handleRebalanceMemThreshold 128G

```

Some of the lines may result an error like this if you use MAC float config set cloud.createVMRetryLimit -1 (Error: unknown shorthand flag: ‘1’ in -1’) and float config set provider.allowList r5*,r6* (zsh: no matches found: r5,r6).

enclosing with quotes work for float config set provider.allowList "r5*,r6*" make sure that each line has been configured without any error.

Clean up OpCenter space (recommeded to contact Memverge person first)

Sometimes we may need to free up some space by deleting older builds on root volume. The admin of the opcenter instance should have the ssh phrase to be able to ssh into opcenter and clean it up.

  1. login using ssh -i ec2-user id created as an admin . This needs the pem key.
  2. Check the log files du -lh -d 1 and how much space is used on the system df -h
  3. Run sudo podman image prune -a -f to clean up some space

Known issues

This section documents some of the known problems / possible improvements we can make down the road.

Better streamlined workflows

Currently we support only embarrassing parallel jobs running the same docker image. This is done via the mm_jobman.sh shell script which is a simple wrapper to float for job submission. Down the road we plan to use nextflow to manage other commands include those written in SoS. In the context of the FunGen-xQTL protocol for example, the mini-protocols can be implemented using nextflow whereas the modules can be implemented using any language including SoS.

Current status summary: comparison with HPC

Goal

Gao’s (and possibly my lab’s) subjective view on mmcloud vs our department HPC to identify what we like about mmcloud compared to HPC, and what we believe should be improved. Computing cost is not taken into consideration here.

Where mmcloud is better

  • Access to faster machine: compared to our HPC I can tell that the computing speed is slight to moderately better even on R5 instances.
  • Overall less wait in the queue: under given quota the jobs can start very quickly.
  • Auto-memory and CPU migration: on HPC the CPU and memory should be set to the maximum and anything beyond maximum will fail to complete due to lack of resources. On mmcloud is the opposite. We specify the minimum resource usage, it will auto-migrate to more resources as needed. Jobs will not fail due to lack of memory. (although we can also set maximum optionally in which case jobs over maximum will fail)
  • No walltime limit. And in general, no competition with other HPC users
  • Easy to rerun failed job: click a button on OpCenter for failed job will rerun it.
  • Better tracking of job execution: the OpCenter interface is great. WaveWatcher is particularly helpful.
  • Better control over the hardware: OpCenter GUI can help configure the machine types we use tailored for a batch of jobs, and the configuration can be easily saved in CLI to reuse. On HPC we only have fixed queues, preconfigured. Here we can pretty much make our own queue types on the fly for different submissions – such as bigmem, or highend CPU queues, on the fly.
  • Support team: multiple engineers to provide support on slack, very responsive.
  • No need to use VPN to access HPC.
  • Available resource: currently with 4000 cores quota each across 2 regions, mmcloud would be about 2.5x of the size of our department HPC. It gets jobs done fast. Although currently CUIT only support 1 region they are looking into supporting others.
  • Storage on AWS S3 has decent I/O performance and should be considered quite safe, possible to access from everywhere.

Where mmcloud is less convenient

  • Because CPU and memories on VMs increase by two folds like 2, 4, 8, 16, 32, …, 2^n when we migrate to larger resource it always double the resource (and also cost). Lots of cases it is unneccessary, for example a job would only need 18GB of memory but i would have to float from 16GB to 32GB in order to process the 18GB job. On HPC we can specify any memory size. Since we desire such flexibility completely for cost considerations, perhaps optimizing SPOT instance allocation by cost would help mitigate this limitation. This should already exist but we have not explored it yet
  • For spot instance there can be certain periods during the day when many other users across the country require EC2 so lots of jobs end up “floating” temporarily at those rush hours – this is technically similar to being queued during execution (like “low job priority”). The solution is to also allow for on demand instance if the jobs are time sensitive; or just wait
  • Currently not yet good support for customized JupyterLab, VSCode and Rstudio solutions. But we are working on solutions to these
  • Related to above, everything on mmcloud must be containerized. It is not a big issue to power users but could be an issue for others. My take is that currently memverge deals with this by providing services offering their manpower to build images upon customer request. This may work for 85% of the user cases but def not 100% compared to on prem setup where our admin use module system to provide software for average users. This is not a problem for my lab
  • mmcloud VM instances work independently, as opposed to a modern bioinformatics workflow system eg nextflow, wdl and SoS which can coordinate dependent jobs across multiple nodes on HPC. Solution of this using nextflow exists at memverge but we have not explored it yet.
  • Since it is easy to run many jobs, it is also easy to generate large bills. We still use local HPC for smaller scale analysis of testing and making sure everything works, before sending to the cloud.
  • there can be certain periods during the day when many other users across the country require EC2 so lots of jobs end up “floating” temporarily at those rush hours.
  • The start of the business day in the AM and end of the business day in the PM are the worst. Submitting in late evenings seem to be working well so far.