Example: Simulation Study on MMCloud in R

Written by Alex McCreight, May 2024

Here we explain the logic for designing the Rscript that will be submitted to MMCloud, the cloud service we use in the lab.

The first step of designing your Rscript is loading any libraries, data, and other external functions that you will use throughout your script. Please note that all data and external functions must be uploaded to an S3 bucket and all libraries must be included in a docker image. For this example, we will create a simulation_script.R from scratch and slowly build it up. Our first step will be to set a pathway to our S3 bucket and read in all data/functions/libraries. For this example, we will mount our S3 bucket to /home/apm2217/data.

# Set Path to Mounted S3 Bucket
data_path <- "/home/apm2217/data/"

# Read-in functions/script(s) from mounted bucket

# Read-in data from mounted bucket
X <- readRDS("X20")

# Read-in Libraries (must be included in your docker image)

Important: Accessing Non-Exported Functions from R Packages

When using external R scripts stored in a mounted bucket, you may encounter issues where certain functions from additional libraries are not recognized. This often occurs because these functions are not exported in the NAMESPACE of the package from which they originate. For example, in susie-ash.R, which extends upon the susie() function from susieR, there are specific functions which are not exported by the susieR package and therefore are not accessible when the script is executed in the cloud. To use non-exported functions in your scripts, you MUST explicitly reference these functions using the ::: operator. This operator allows you to access non-exported functions directly from the package.

For example, susie-ash.R includes the compute_colstats() function from susieR which was NOT exported to the susieR NAMESPACE folder on GitHub. So, to use this particular function in my susie-ash.R script I must include it as susieR:::compute_colstats(). If you are uncertain whether a function is exported, you can check the NAMESPACE file located in the package’s GitHub repository. Search for the function name to determine if it has been included in the exports.

Now, in this example simulation_script.R, I include functions for data generation (using real data), running methods, computing metrics, and a final one which will run all of these together. For this example, we have seven input parameters: num_simulations, total_heritability, sparse_effects, nonsparse_coverage, theta_beta_ratio, L, and threshold. We first set these equal to default parameters which makes it possible to interactively run this function on your own machine to help you test and tweak the function as needed. Below the default parameter initialization, we include an option to run through a set of parameters to be specified in the next section.

After choosing parameters, either default or prespecified, the simulation() function continues to run previously defined data generation, method, and scoring functions and saves the results. Note, before you save your results, you will need to specify the output directory to a mounted bucket. For this simulation, we will save a series of .Rds files in home/apm2217/output/.

# Function to generate data using real-data imported from the S3 bucket
generate_data <- function(...)

# Function that runs various methods on the generated data and computes metrics
method_and_score <- function(...) 

# The main simulation function that combines the data generation and method/scoring function
simulation <-  function(num_simulations = NULL, total_heritability = NULL, sparse_effects = NULL, nonsparse_coverage = NULL, theta_beta_ratio = NULL, L = NULL, threshold = NULL){

# Set default parameters
num_simulations = 2
total_heritability = 0.5
sparse_effects = 3
nonsparse_coverage = 0.01
theta_beta_ratio = 1.4
L = 10
threshold = 0.95

# If provided, run through a set of parameters
for (arg in commandArgs(trailingOnly = TRUE)) {

#### {running data generation, methods, scoring multiple iterations} ####

# Set Path to Mounted S3 Bucket Output Folder
output_dir <- "/home/apm2217/output"

# Save simulation results to a list
simulation_results <- list(
  avg_metrics = avg_metrics,
  all_metrics = all_metrics,
  all_betas = all_betas,
  all_thetas = all_thetas,
  all_susie_outputs = all_susie_outputs,
  all_susie_ash_outputs = all_susie_ash_outputs,
  all_seeds = all_seeds

# Create dynamic file names for each simulation scenario
file_name <- paste0("numIter", num_simulations,
                    "_totHeritability", total_heritability,
                    "_sparseEffect", sparse_effects,
                    "_nonsparse", nonsparse_coverage,
                    "_ratio", theta_beta_ratio,
                    "_L", L)

# Save the simulation results to the output directory using dynamic file name
saveRDS(simulation_results, file.path(output_dir, file_name))

  # Return all results

Creating a commands_to_submit.txt file using R

Now that we properly set our simulation_script.R, we can create various simulation scenarios that we would like to investigate. We will create an R script commands_to_submit.R to run on your local machine to produce the commands_to_submit.txt file. First, you must define your parameter grid. This will take a single input or multiple inputs for each parameter and create a dataframe containing every permutation. See the example below.

# Define parameter grid
parameter_grid <- expand.grid(
  num_simulations = c(5),
  total_heritability = c(0.50),
  sparse_effects = c(2),
  nonsparse_coverage = c(0.01, 0.05),
  theta_beta_ratio = c(1.4, 3),
  L = c(10),
  threshold = c(0.90),
  stringsAsFactors = FALSE

For this example, we specify a single input for num_simulations, total_heritability, sparse_effects, L, and threshold. For nonsparse_coverage and theta_beta_ratio we specify 2 inputs. Thus, there will be 4 total permutations covering every possible simulation scenario.

  1. num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.01 theta_beta_ratio=1.4 L=10 threshold=0.9
  2. num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.05 theta_beta_ratio=1.4 L=10 threshold=0.9
  3. num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.01 theta_beta_ratio=3 L=10 threshold=0.9
  4. num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.05 theta_beta_ratio=3 L=10 threshold=0.9

After we specify the paramter grid, we will now create the commands_to_submit.txt file and begin adding to it. We will loop over each parameter in the grid and begin pasting everything together in the proper format. Note, when we submit this job to the cloud you will upload the simulation_script.R,that we created in the previous section, to your respective S3 data bucket. Here we call our script using the pathway to the mounted bucket that contains the script: /home/apm2217/data/simulation_script.R.

# Create the commands_to_submit.txt file
commands_file <- "commands_to_submit.txt"
file_conn <- file(commands_file, open = "w")

# Iterate over each row of the parameter grid and write commands to the file
for (i in 1:nrow(parameter_grid)) {
  params <- parameter_grid[i, ]

  # Extract parameter values
  num_simulations <- params["num_simulations"]
  total_heritability <- params["total_heritability"]
  sparse_effects <- params["sparse_effects"]
  nonsparse_coverage <- params["nonsparse_coverage"]
  theta_beta_ratio <- params["theta_beta_ratio"]
  L <- params["L"]
  threshold <- params["threshold"]

  # Create the command
  command <- paste0("Rscript /home/apm2217/data/simulation_script.R",
                    " num_simulations=",num_simulations,
                    " total_heritability=", total_heritability,
                    " sparse_effects=", sparse_effects,
                    " nonsparse_coverage=", nonsparse_coverage,
                    " theta_beta_ratio=", theta_beta_ratio,
                    " L=", L,
                    " threshold=", threshold)

  # Write the command to the file
  writeLines(command, file_conn)

# Close the file connection

cat("Commands file 'commands_to_submit.txt' created successfully.\n")

You will then run this script on your local machine and be mindful where you store it. Later, when we bash the main script to submit to the cloud, you must include the local pathway to this commands_to_submit.txt. Below is an example of what this commands_to_submit.txt file looks like for this example.

Rscript /home/apm2217/data/simulation_script.R num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.01 theta_beta_ratio=1.4 L=10 threshold=0.9
Rscript /home/apm2217/data/simulation_script.R num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.05 theta_beta_ratio=1.4 L=10 threshold=0.9
Rscript /home/apm2217/data/simulation_script.R num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.01 theta_beta_ratio=3 L=10 threshold=0.9
Rscript /home/apm2217/data/simulation_script.R num_simulations=5 total_heritability=0.5 sparse_effects=2 nonsparse_coverage=0.05 theta_beta_ratio=3 L=10 threshold=0.9

Important: Formatting

When formatting with the paste0() function you must properly capitalize “Rscript” and you can not include spaces between the equal signs and the parameter name and parameter value. Both of these errors will cause the job to fail.

Combining the Rscript, commands_to_submit.txt, and mm_jobman.sh into the final script

Finally, once you have created your Rscript and commands_to_submit.txt, you are ready to write your script that you will submit to the cloud. Here, we will call this script susie_ash_example.sh. For reproducibility, you should include a username argument that you will call when you mount your files. Now, notice that we have two local files included in this script, ./src/mm_jobman.sh and ./commands_to_submit.txt. You must include the pathway to these files from your local machine. You can download the mm_jobman.sh script from GitHub and we walked through how to make commands_to_submit.txt above. For this example, I have already uploaded my data and external functions to my own S3 bucket statfungen/ftp_fgc_xqtl/interactive_analysis/$username/susie-ash-data and I mounted this bucket to the /home/apm2217/data folder which we called upon in the Rscript from above. For the output folder, you do not need to make this folder in your bucket before you submit your job as it automatically will. Recall in the Rscript above we save our outputs to /home/apm2217/output.

./src/mm_jobman.sh \
 ./commands_to_submit.txt \
 -c 2 -m 16 \
 --job-size 4 \
 --parallel-commands 4 \
 --mount statfungen/ftp_fgc_xqtl/interactive_analysis/$username/susie-ash-data:/home/$username/data \
 --mount statfungen/ftp_fgc_xqtl/analysis_result/interactive_analysis/$username/susie-ash-output:/home/$username/output \
 --mountOpt "mode=r" "mode=rw" \
 --cwd "/home/$username/data" \
 --image ghcr.io/cumc/pecotmr_docker:latest \
 --entrypoint "source /usr/local/bin/_activate_current_env.sh" \
 --env ENV_NAME=pecotmr \
 --imageVolSize 10 \

Once this susie_ash_example.sh script is created, locate the proper directory on your local machine (with respect to the pathnames for mm_jobman.sh and commands_to_submit.txt) where you will submit your job. From here, you will simply bash this script from the proper directory in your terminal. Please ensure that you have already logged into your float account and opcenter.

bash susie_ash_example.sh