Submit batch computing jobs to MMCloud

Login to MMCloud

Note: there is a firewall issue on our departmental HPC, so we can not login to float on HPC. Please use your laptop or desktop to run the float commands.

This assumes that your admin has already created a MMCloud account for you, with a username and password provided. To login,

float login -u <username> -a <op_center_ip_address>

Example:

float login -u gaow -a 54.81.85.209

Note: If you see an error during your runs akin to this Token error, you likely have not logged in to the appropriate opcenter

Error: Invalid token

Execute a simple command through pre-configured docker containers

Here is an example running a simple bash script susie_example.sh using pecotmr_docker image file available from our online docker image repositories. The susie_example.sh has these contents (copied from running ?susieR::susie in R terminal):

#!/bin/bash
Rscript - << 'EOF'
# susie example
set.seed(1)
n = 1000
p = 1000
beta = rep(0,p)
beta[1:4] = 1
X = matrix(rnorm(n*p),nrow = n,ncol = p)
X = scale(X,center = TRUE,scale = TRUE)
y = drop(X %*% beta + rnorm(n))
res = susie(X,y,L = 10)
print("Maximum posterior inclusions probability:")
print(max(res$pip))
saveRDS(res, "susie_example_result.rds")
EOF

The command below will submit this bash script to AWS accessing 2 CPUs and 8GB of Memory.

float submit -i ghcr.io/cumc/pecotmr_docker -j susie_example.sh -c 2 -m 8

It will print some of the quantities to standard output stream which you can track by the follow command,

float log cat stdout.autosave -j <job_id>

where <job_id> is available via float list which will shows all jobs from the current MMCloud OpCenter. You can roughly understand OpCenter as the “login node” of an HPC that manages job submission. Please use float -h to learn more about job management using the float command.

Notice that even though this R script write output to a file calle susie_example_result.rds you will not find that file after the job finishes, because by default it is written to somewhere in the virtual machine (VM) instance that was fired up to run the computing, which is deleted after the job is done. To keep the results we need to mount to the VM some AWS S3 bucket where we keep the data. This will be introduced in the next section.

Submit multiple jobs for “embarrassing parallel” processing

To submit multiple computing tasks all at once, we assume that:

  1. Each job is one line of bash command
  2. No dependencies between jobs — multiple jobs can be executed in parallel
  3. All these jobs uses similar CPU and memory resource

Example: running multiple SoS pipelines in parallel

Suppose you have a bioinformatics pipeline script in SoS language, a text file called my_pipeline.sos that reads like this:

[global]
paramter: gene = str
parameter: entrypoint= ""

[default]
output: f"{gene}.txt"
R: entrypoint=entrypoint, expand=True
  write("success", {_output:r})

Make sure my_pipeline.sos exists in your bucket, as that is how the VM will be able to access it.

Suppose you have 3 jobs to run in parallel, possibly using different docker images, like below. The <location> in the script will be where your bucket will be mounted on your instance.

sos run /<location>/my_pipeline.sos --gene APOE4
sos run /<location>/my_pipeline.sos --gene BIN1
sos run /<location>/my_pipeline.sos --gene TREM2

You save these lines to run_my_job.sh and use mm_jobman.sh to submit them — mm_jobman.sh is a utility script which calls executables aws and float to prepare and submit jobs. The script is available here on GitHub.

to submit,

mm_jobman.sh run_my_job.sh \
  -c 2 \
  -m 8 \
  --mountOpt mode=rw \
  --mount <your-bucket-name>:<directory-to-mount> \
  --download /home/jovyan/TEMP/data:<download-folder-in-bucket-1>/ \
             /home/jovyan/TEMP/resource:<download-folder-in-bucket-2>/ \
  --upload /home/jovyan/TEMP/output:<upload-folder-in-bucket-1>/ \
           /home/jovyan/TEMP/cache:<upload-folder-in-bucket-2>/ \
  --opcenter 54.81.85.209 \
  --image ghcr.io/cumc/pecotmr_docker:latest \
  --entrypoint "source /usr/local/bin/_activate_current_env.sh" \
  --env ENV_NAME=pecotmr \
  --job-size 3 \
  --parallel-commands 1 \
  --imageVolSize 10 \
  --dryrun

Keep in mind the --dryrun parameter below will not actually run your command, but will print it out instead for debugging purposes.

There are many parameters you can use in your run. You can call mm_jobman.sh --help to provide the comprehensive list of the parameters, which is also shown below.

Options:
    -c <min>:<optional max>                   Specify the exact number of CPUs to use, or, with ':', the min and max of CPUs to use. Required.
    -m <min>:<optional max>                   Specify the exact amount of memory to use, or, with ':', the min and max of memory in GB. Required.
    --cwd <value>                             Define the working directory for the job (default: ~).
    --download <remote>:<local>               Download files/folders from S3. Format: <S3 path>:<local path> (optional).
    --upload <local>:<remote>                 Upload folders to S3. Format: <local path>:<S3 path> (optional).
    --download-include '<value>'              Use the include flag to include certain files for download (space-separated) (optional).
    --dryrun                                  Execute a dry run, printing commands without running them.
    --entrypoint '<command>'                  Set the initial command to run in the job script (required).
    --image <value>                           Specify the Docker image to use for the job (required).
    --job-size <value>                        Set the number of commands per job for creating virtual machines (required).
    --mount <bucket>:<local>                  Mount an S3 bucket to a local directory. Format: <bucket>:<local path> (optional).
    --mountOpt <value>                        Specify mount options for the bucket (required if --mount is used).
    --ebs-mount <folder>=<size>               Mount an EBS volume to a local directory. Format: <local path>=<size>. Size in GB (optional).
    --no-fail-fast                            Continue executing subsequent commands even if one fails.
    --opcenter <value>                        Provide the Opcenter address for the job (default: 23.22.157.8).
    --parallel-commands <value>               Set the number of commands to run in parallel (default: min number of CPUs).
    --min-cores-per-command <value>           Specify the minimum number of CPU cores required per command (optional).
    --min-mem-per-command <value>             Specify the minimum amount of memory in GB required per command (optional).
    --help                                    Show this help message.


Some extra points on the parameters:

  • For the --upload option, providing a trailing / for the <local> folder will copy the contents of the folder into the <remote> folder. Not having a trailing / will copy the entire folder. Remember to not put a trailing / at the end of the <remote> folder of your upload command.
  • Providing just one argument for --mountOpt will have all --mount buckets share the same mount options. Multiple --mountOpt arguments will set mount options for the respective buckets in order. Therefore, if multiple --mountOpt arguments are provided, the script will expect the same number of --mount options.
  • For the --download option, if your <remote> is intended to be a folder, add a trailing / at the end. This will make it so the folder is copied as a folder into your <local> directory. If not trailing / is provided, you will be copying a file and overriding the file on <local>
  • Each --download-include will correspond to their respective --download (first --download-include will be for the first --download, etc). It is up to the user to make sure their parameters are correct and to make note of this respective matching.
  • Since by default the working directory of the job is ~, you will need to provide the location of my_pipeline.sos relative to that. For example, if I submit the job and mount my bucket (where I have previously uploaded my_pipeline.sos) under <directory-to-mount>, I should expect my job script to reflect that. For example, sos run /<directory-to-mount>/my_pipeline.sos --gene APOE4

For additional examples please check them out here on GitHub.

Monitoring jobs on MMCloud

After submitting jobs to the cloud through an MMCloud OpCenter we need to monitor the status and details of the job through the web interface of the opcenters or using CLI. So far we have 3 opcenters for our lab (2 in east1 and 1 in west1):

east1:

  1. 54.81.85.209 (large, old version opcenter) - We highly recommend using the new opcenter
  2. 23.22.157.8 (xlarge, new version opcenter)

west1:

  1. 54.176.176.205 (xlarge, new version opcenter)

FIXME: this is no longer relevant as we have different opcenters now. we need to change it but let’s hold on because i may have a better way to do this

To view the status of the job you just submitted, open the web browser for your opcenter (GUI interface) by typing in the IP address, login to it using your username and password, and click on Jobs in the left-hand panel of the page after you login. The status of submitted jobs can be : Executing, Completed, Cancelled, FailedToComplete, FailedToExcute, Floating, Initializing, Starting, Submitted, etc. We can also check the staus using CLI query

float squeue

which should show the job ID. Then check log files generated for a job,

float log -j <jobID> ls #check log file name 
float log -j <jobID> cat <logfile> #check log file contents

Another way to get the logs of a job (with the use of a script) from the opcenter itself. logs of the job are stored under /mnt/memverge/slurm/work. Two levels of directories that correspond to the first two pairs of characters for the job. For example, for job id “jkbzo4y7c529fiko0jius”, the contents are stored under /mnt/memverge/slurm/work/jk/bz/o4y7c529fiko0jius.

It is also possible to use those IDs to save the log file contents via

float log download -j JOB_ID

Cancel and rerun failed jobs

  1. Cancel all or selected jobs
float scancel --filter "status=*" -f
# or
float scancel -f --filter user=* 

To suspend jobs through the CLI is float suspend -j <job_id>.

  1. Rerun FailToComplete jobs since a given submission date

Currently, unfortunately memverge do not have a way to rerun multiple jobs at once given a submission date/any other filter method. As of now, the only way for rerunning a job is through the GUI (one at a time) or with float rerun -j <JOB_ID>. Consider to change the -c and -m parameters to avoid floating which is the often the reason why these jobs failed to complete.

Trouble-shooting

This section documents frequently encountered issues and solutions.

Server internal error

When you got the error Error: server internal error. please contact the service owner. detail: json: cannot unmarshal string into Go struct field jobRest.Jobs.actualEmission of type float64.

The likely reason for this is that your local binary version is older than the OpCenter version. You can follow the lab wiki instructions to upgrade it or re-install float binary.

Failure to Connect to MMCloud Due to Network Issues

It is known that:

  1. float cannot connect to MMCloud from Columbia Neurology HPC.
  2. float cannot connect to MMCloud from some VPN networks at certain institutes, although it can connect if you are on the CUMC VPN.

When these issues occur, please consider switching to a different network and try again.

Known issues

This section documents some of the known problems / possible improvements we can make down the road.

Better streamlined workflows

Currently we support only embarrassing parallel jobs running the same docker image. This is done via the mm_jobman.sh shell script which is a simple wrapper to float for job submission. Down the road we plan to use nextflow to manage other commands include those written in SoS. In the context of the FunGen-xQTL protocol for example, the mini-protocols can be implemented using nextflow whereas the modules can be implemented using any language including SoS.