Cloud computing setup: MemVerge + AWS

Setup AWS and MemVerge toolkit

Please refer to Section Appendix I: AWS and MemVerge Software Installation and Set Up if you would like to install the relevant AWS and MemVerge software packages on your computer. If you are a savvy Singularity or Docker user please refer to Section Appendix II: Running AWS and MemVerge Software through container images to set up the tools through containers.

You may be asked to upgrade MemVerge OpCenter from time to time. To do so please refer to Section Appendix III: Upgrade MemVerge OpCenter.

Submit computing jobs through MemVerge CLI

First time user: configure your pre-existing MemVerge account

Note: there is a firewall issue on our departmental HPC, so we can not login to float on HPC. Please use your laptop or desktop to run the float commands.

This assumes that your admin has already created a MemVerge account for you, with a username and password provided. To login,

float login -u <username> -a <op_center_ip_address>

Example:

float login -u gaow -a 54.81.85.209

Execute a simple command through pre-configured docker containers

Here is an example running a simple bash script susie_example.sh using pecotmr_docker image file available from our online docker image repositories. The susie_example.sh has these contents (copied from running ?susieR::susie in R terminal):

#!/bin/bash
# run_r_code.sh

Rscript - << 'EOF'
# susie example
set.seed(1)
n = 1000
p = 1000
beta = rep(0,p)
beta[1:4] = 1
X = matrix(rnorm(n*p),nrow = n,ncol = p)
X = scale(X,center = TRUE,scale = TRUE)
y = drop(X %*% beta + rnorm(n))
res = susie(X,y,L = 10)
saveRDS(res, "susie_example_result.rds")
EOF

The command below will submit this bash script to AWS accessing 2 CPUs, 8GB of Memory, and 10GB of EBS storage mounted on /home/jovyan/AWS/data via the --dataVolume parameter.

float submit -i ghcr.io/cumc/pecotmr_docker -j susie_example.sh -c 2 -m 8 --dataVolume "[size=10]:/home/jovyan/AWS/data"

Submit multiple jobs for “embarrassing parallel” processing

For the multiple jobs to submit all at once, we assume that:

  1. Each job is one line of bash command
  2. Multiple jobs can be executed in parallel
  3. All these jobs uses the same CPU and memory resource

Suppose you have a bioinformatics pipeline script in SoS language, a text file called my_pipeline.sos that reads like this:

[global]
paramter: gene = str
parameter: entrypoint= ""

[default]
output: f"{gene}.txt"
R: entrypoint=entrypoint, expand=True
  write("success", {_output:r})

Make sure my_pipeline.sos exists in your bucket, as that is how the VM will be able to access it.

Suppose you have 3 jobs to run in parallel, possibly using different docker images, like below. The <location> in the script will be where your bucket will be mounted on your instance.

sos run /<location>/my_pipeline.sos --gene APOE4
sos run /<location>/my_pipeline.sos --gene BIN1
sos run /<location>/my_pipeline.sos --gene TREM2

You save these lines to run_my_job.sh and use mm_jobman.sh to submit them — mm_jobman.sh is a utility script included in the default PATH for executables in the mm_utils.sif image. If you don’t use the mm_utils.sif image you can obtain the script here on GitHub.

mm_jobman.sh run_my_job.sh \
  -c 2 \
  -m 8 \
  --mountOpt mode=rw \
  --mount <your-bucket-name>:<directory-to-mount> \
  --download /home/jovyan/TEMP/data:<download-folder-in-bucket-1>/ \
             /home/jovyan/TEMP/resource:<download-folder-in-bucket-2>/ \
  --upload /home/jovyan/TEMP/output:<upload-folder-in-bucket-1>/ \
           /home/jovyan/TEMP/cache:<upload-folder-in-bucket-2>/ \
  --opcenter 54.81.85.209 \
  --image ghcr.io/cumc/pecotmr_docker:latest \
  --entrypoint "source /usr/local/bin/_activate_current_env.sh" \
  --env ENV_NAME=pecotmr \
  --job-size 3 \
  --parallel-commands 1 \
  --imageVolSize 10 \
  --dryrun

Keep in mind the --dryrun parameter below will not actually run your command, but will print it out instead for debugging purposes.

There are many parameters you can use in your run. You can call mm_jobman.sh --help to provide the comprehensive list of the parameters, which is also shown below.

Options:
  -c <value>                   Specify the number of CPUs to use (default and recommended for AWS Spot Instances: 2).
  -m <value>                   Set the amount of memory in GB (default: 16).
  --cwd <value>                Define the working directory for the job (default: ~).
  --download <remote>:<local>  Download files/folders from S3. Format: <S3 path>:<local path> (optional).
  --upload <local>:<remote>    Upload folders to S3. Format: <local path>:<S3 path> (optional).
  --recursive <true>                        Use the recursive flag for downloading. (optional).
  --download-include '<value>'              Use the include flag to include certain files for download (space-separated) (optional).
  --dryrun                     Execute a dry run, printing commands without running them.
  --entrypoint '<command>'     Set the initial command to run in the job script (required).
  --env <key>=<val>            Set environmental variables for the job in the format KEY=VALUE (optional).
  --image <value>              Specify the Docker image to use for the job (required).
  --imageVolSize <value>       Define the size of the image volume in GB (depends on the size of input image).
  --job-size <value>           Set the number of commands per job for creating virtual machines (default: 2).
  --mount <bucket>:<local>     Mount an S3 bucket to a local directory. Format: <bucket>:<local path> (optional).
  --mountOpt <value>           Specify mount options for the bucket (required).
  --volMount <size>:<folder>   Mount a volume under a directory. Size in GB (optional).
  --no-fail-fast               Continue executing subsequent commands even if one fails.
  --opcenter <value>           Provide the Opcenter address for the job (required).
  --parallel-commands <value>  Set the number of commands to run in parallel (default: number of CPUs).
  --help                       Show this help message.


Some extra points on the parameters:

  • For the --upload option, providing a trailing / for the <local> folder will copy the contents of the folder into the <remote> folder. Not having a trailing / will copy the entire folder. Remember to not put a trailing / at the end of the <remote> folder of your upload command.
  • Providing just one argument for --mountOpt will have all --mount buckets share the same mount options. Multiple --mountOpt arguments will set mount options for the respective buckets in order. Therefore, if multiple --mountOpt arguments are provided, the script will expect the same number of --mount options.
  • For the --download option, if your <remote> is intended to be a folder, add a trailing / at the end. This will make it so the folder is copied as a folder into your <local> directory. If not trailing / is provided, you will be copying a file and overriding the file on <local>
  • Each --download-include will correspond to their respective --download (first --download-include will be for the first --download, etc). It is up to the user to make sure their parameters are correct and to make note of this respective matching. Same will go with --recursive.
  • Since by default the working directory of the job is ~, you will need to provide the location of my_pipeline.sos relative to that. For example, if I submit the job and mount my bucket (where I have previously uploaded my_pipeline.sos) under <directory-to-mount>, I should expect my job script to reflect that. For example, sos run /<directory-to-mount>/my_pipeline.sos --gene APOE4

For additional examples please check them out here on GitHub.

Monitor jobs

After the job is submitted and being managed, you can check the status using:

float squeue

which should show the job ID. The check log files generate for a job,

float log -j <jobID> ls #check log file name 
float log -j <jobID> cat <logfile> #check log file contents

Notes for Data Admin

This section is only relevant to data admins in charge of transfering data between local computing environment and AWS buckets. If you are a user analyzing data that already exists on an AWS bucket you can skip this.

First time: configure your pre-existing account on project AWS bucket

This assumes that your admin has already an user account on their bucket, with Access Key ID and Secret Access Key for your access to the project AWS bucket. To configure your account from command terminal,

aws configure

You will be prompted to provide these information:

AWS Access Key ID [None]:
AWS Secret Access Key [None]: 
Default region name [None]: us-east-1
Default output format [None]: yaml

The first two pieces of info should be available from your admin.

Upload data to pre-existing AWS bucket

This assumes that your admin has already created a storage bucket on AWS, and that you can access it. In this documentation the pre-existing bucket is called cumc-gao. You should have your username from your admin.

To copy file eg $DATA_FILE to the bucket,

aws s3 cp $DATA_FILE s3://$S3_BUCKET/ 

Example:

touch test_wiki.txt
aws s3 cp test_wiki.txt s3://cumc-gao/

To copy folder $DATA_DIR to the bucket,

aws s3 cp $DATA_DIR s3://$S3_BUCKET/ 

Example:

mkdir test_wiki
mv test_wiki.txt test_wiki
aws s3 cp test_wiki s3://cumc-gao/ --recursive

Once you completed uploading files to AWS bucket, you are ready to run your analysis through MemVerge.

Download data from pre-existing AWS bucket

After your analysis is done, it is possible to retrieve the results stored on S3 to your local machine, simply by reversing the command for Upload discussed in the previous section. For example,

aws s3 cp s3://$S3_BUCKET/$DATA_DIR/output output --recursive

Remove data from AWS buckets

Warning: think twice before you run it!

FIXME: on our end we should set it up such that a user can only remove files they created, not from other people

aws s3 rm s3://$S3_BUCKET/$DATA_DIR/cache --recursive

Suggested organization of files in AWS buckets

We recommend that you create a personal folder on the bucket to save data specific to your own tasks (that are not shared with others). For example for user gaow on cumc-gao bucket, the personal folder should be s3://cumc-gao/gaow

FIXME: provide a command to show users how to create a personal folder

Interacting with your data

These instructions provide a step-by-step guide to access AWS S3 through the web browser in a more organized and clear manner.

  • Open your web browser and navigate to the AWS Management Console.

  • Sign in to the AWS Management Console using your AWS account credentials:
    • Enter your 12-digit AWS Account ID (you can find this in your account details).
    • Enter your IAM user name (you can find this in your account details).
    • Enter your password (If you don’t have a password, please contact your aws administrator to create one for you).
  • In the AWS Management Console, locate the search bar and type “S3,” then select “S3” from the search results.

  • In the Amazon S3 dashboard, click on the S3 bucket named “statfungen” to access your data.

  • You can now easily interact with your datasets within the selected S3 bucket.

Notes for System Admin

This section is only relevant to system admins. If you are a user you can skip this

Setting Up Your IAW User and Account

This is a one-time job for the system admin, done through GUI)

FIXME: the approach below will gave every one in the group the full access to the whole bucket, so everyone can read and edit others’ file, that would be convenient but also dangerous. Need to manage it better next step

  • Log into AWS Console:
    • Navigate to AWS Console.
    • Sign up for a root AWS account if you’re new, else log in.
  • Search for IAW:
    • After logging in, search for “IAW” using the top search bar.
  • Creat Group
    • Click “User groups” on the left.
    • Attach “AmazonS3FullAccess” for this group
    • Add Users to this group.
  • Add user and set up access key
    • GUI/or maybe for root user (first time to set up the access key)
      • Add User:
        • Click “Users” on the left and then click “Create user” on the right.
        • Click “Next” following instructions.
      • Manage Access Keys:
        • Find “Security recommendations” on the IAW dashboard.
        • Click “Manage access keys”.
      • Create an Access Key:
        • Go to the “Access keys” section.
        • Select “Create access key”.
      • Retrieve Your Access Key and Secret Access Key:
        • A dialogue will show your Access Key ID and Secret Access Key.
        • Check the box, then click “Next”.
        • Download a copy of these keys for safekeeping.
    • CLI (change to root access)
       aws iam create-user --user-name YourUserName
       aws iam add-user-to-group --user-name YourUserName --group-name Gao-lab
       aws iam create-access-key --user-name YourUserName
       # create password 
       aws iam create-login-profile --user-name YourUserName --password NEW_PASSWORD
      

      copy these keys for safekeeping.

  • Configure AWS CLI:
    • Run the following in your terminal:
      aws configure
      
    • Provide:
      • Your Access Key ID and Secret Access Key.
      • Region: us-east-1.
      • Output format (e.g., yaml).

Create project S3 Bucket

To create an S3 bucket, ensure your $S3_BUCKET name is globally unique and in lowercase. For example:

aws s3api create-bucket --bucket $S3_BUCKET --region $AWS_REGION

Example:

aws s3api create-bucket --bucket cumc-gao --region us-east-1

MemVerge account management

First, login as admin,

float login -u <admin username> -p <admin passwd> -a /<op_center_ip_address>

Then create a new user for example tom,

float user add tom

Setup MemVerge OpCenter for project

FIXME: add how to set up an op_center here, the step 1 in this GUI tutorial is nice but it would be better if you have a CLI instructions

FIXME: [Gao] I don’t understand the concept and relevance of OpCenter – so when you describe how to set it up here, please also give a bit of background and motivation so I (and other future admin) understand it

Appendix I: AWS and MemVerge Software Installation and Set Up

You can safely skip this section if you use the docker/singularity image provided by MemVerge, as detailed in the first section of this document. Here we keep these contents as Appendix for book-keeping purpose.

AWS CLI tools

(Linux/Windows/Mac) https://docs.aws.amazon.com/cli/latest/userguide/gteting-started-install.html

  • If you are Mac user, you can use below commands to install AWS CLI tools.
       curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
       sudo installer -pkg AWSCLIV2.pkg -target /
    
  • If you are HPC/Linux user, you can use to install AWS CLI tools (Also add export PATH=$PATH:/home/<UNI>/.local/bin in your ~/.bashrc and don’t forget to source it).
       curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
       unzip awscliv2.zip
       ./aws/install -i /home/<UNI>/.local/bin/aws-cli -b /home/<UNI>/.local/bin
    
  • If you are Windows user, you can open cmd as administrator and use below commands to install AWS CLI tools.
       msiexec.exe /i https://awscli.amazonaws.com/AWSCLIV2.msi
    

Check if it was installed successfully with

   which aws
   aws --version

MemVerge float for job submission

Download Float from the Operation Center

  • For linux user
       wget https://<op_center_ip_address>/float --no-check-certificate
       # Example using an IP address:
       wget https://54.81.85.209/float --no-check-certificate
    
  • For Mac Intel chip user
       wget https://<op_center_ip_address>/float.darwin_amd64 --no-check-certificate
       # Example using an IP address:
       wget https://54.81.85.209/float.darwin_amd64 --no-check-certificate
    
  • For Mac Apple Silicon M chip user
       wget https://<op_center_ip_address>/float.darwin_arm64 --no-check-certificate
       # Example using an IP address:
       wget https://54.81.85.209/float.darwin_arm64 --no-check-certificate
    
  • For Windows user, the float file above is not compatible, so you need to access https://54.81.85.209 and manually downloaded the version of the tool specifically for Windows.

Or you can choose to open MMCloud OpCenter and download it with GUI.

Move and Make It Executable

  • For MAC and Linux users with sudo access, replace <float_binary> below with what you just downloaded,
       sudo mv <float_binary> /usr/local/bin/float
       sudo chmod +x /usr/local/bin/float
    
  • For users without sudo you can either use alias float=/path/to/float_binary/float or add export PATH=$PATH:<PATH> in your ~/.bashrc where <PATH> is path to float executable. Don’t forget to source it afterwards. Then,
      chmod +x <PATH>/float 
    
  • For Windows user: Files located in C:\Windows\System32 are automatically included in the system’s PATH environment variable on Windows. This means that any executable file in this directory can be run from any location in the Command Prompt without specifying the full path to the executable. The System32 directory is a crucial part of the Windows operating system, containing many of its core files and utilities. So, if float.exe is in this directory, you can run it from anywhere in the Command Prompt by just typing float.

Addressing Mac Security Settings

Optional: For Mac Users

If you are using a Mac, float might be blocked due to your security settings. Follow these steps to address it:

  • Open ‘System Preferences’.
  • Navigate to ‘Privacy & Security ‘.
  • Under the ‘Security’ tab, you’ll see a message about Float being blocked. Click on ‘Allow Anyway’.

Rename Float to Avoid Name Conflicts

Optional For Mac Users, but would change your float name in downstream instruction and the future work

If there’s an existing application or command named float, rename the downloaded float into mmfloat to avoid conflicts:

mv /usr/local/bin/float /usr/local/bin/mmfloat

Appendix II: Running AWS and MemVerge Software through container images

We recommend using singularity to run MemVerge and AWS tools. On conventional Linux-based HPC, singularity should be available. If you want to manage your computing from a local computer, here is a setup guide for Mac users.

If you prefer to not use singularity or if the commands below does not work for you, please find the installation instructions for aws and float at the end of this document.

First, download the singularity image that contains MemVerge and AWS related tools:

singularity pull mmc_utils.sif docker://ashleytung148/mmc-utils:latest

When an update for this image is available, you can simply delete it and pull again:

rm -f mmc_utils.sif
singularity pull mmc_utils.sif docker://ashleytung148/mmc-utils:latest

Here are some command line alias you may want to set in your bash configuration, as shortcut to use tools from mmc_utils.sif

FIXME: test and improve this. for example Mac users may have different alias depending on how they installed singularity

alias aws="singularity exec /path/to/mmc_utils.sif aws"
alias float="singularity exec /path/to/mmc_utils.sif float"
alias mm_jobman.sh="singularity exec /path/to/mmc_utils.sif mm_jobman.sh"

where /path/to/mmc_utils.sif is where you save the mmc_utils.sif file.

If you are unable to run the float CLI through this method, you may install it from the https://<op_center_ip_address> site. You should be able to see options for Windows, Linux, and Max installations at the bottom. Please contact your admin if you do not know your opcenter ip address.

Appendix III: Upgrade MemVerge OpCenter

  • float release ls (check the version thats available)
  • float release upgrade (upgrade to latest)
  • wait for 1-2 mins
  • float login (to login again)
  • float release sync (upgrade local float binary. You can skip this if you use the latest containers provided by MemVerge, see Appendix II. You may get a permission deneied error, if so, please use sudo)
  • float release migrate --dbPath /mnt/memverge/data/opcenter (this is a one time upgrade of the backend DB)
  • done!

Known issues

This section documents some of the known problems / possible improvements we can make down the road.

Better streamlined workflows

Currently we support only embarrassing parallel jobs running the same docker image. This is done via the mm_jobman.sh shell script which is a simple wrapper to float for job submission. Down the road we plan to use nextflow to manage other commands include those written in SoS. In the context of the FunGen-xQTL protocol for example, the mini-protocols can be implemented using nextflow whereas the modules can be implemented using any language including SoS.