Notes for System Admin

This section is only relevant to system admins. If you are a user you can ignore this page.

AWS setup

Setting Up Your IAW User and Account

This is a one-time job for the system admin, done through GUI

FIXME: the approach below will gave every one in the group the full access to the whole bucket, so everyone can read and edit others’ file, that would be convenient but also dangerous. Need to manage it better next step

  • Log into AWS Console:
    • Navigate to AWS Console.
    • Sign up for a root AWS account if you’re new, else log in.
  • Search for IAW:
    • After logging in, search for “IAW” using the top search bar.
  • Creat Group
    • Click “User groups” on the left.
    • Attach “AmazonS3FullAccess” for this group
    • Add Users to this group.
  • Add user and set up access key
    • GUI/or maybe for root user (first time to set up the access key)
      • Add User:
        • Click “Users” on the left and then click “Create user” on the right.
        • Click “Next” following instructions.
      • Manage Access Keys:
        • Find “Security recommendations” on the IAW dashboard.
        • Click “Manage access keys”.
      • Create an Access Key:
        • Go to the “Access keys” section.
        • Select “Create access key”.
      • Retrieve Your Access Key and Secret Access Key:
        • A dialogue will show your Access Key ID and Secret Access Key.
        • Check the box, then click “Next”.
        • Download a copy of these keys for safekeeping.
    • CLI (change to root access)
       aws iam create-user --user-name YourUserName
       aws iam add-user-to-group --user-name YourUserName --group-name Gao-lab
       aws iam create-access-key --user-name YourUserName
       # create password 
       aws iam create-login-profile --user-name YourUserName --password NEW_PASSWORD
      

      copy these keys for safekeeping.

  • Configure AWS CLI:
    • Run the following in your terminal:
      aws configure
      
    • Provide:
      • Your Access Key ID and Secret Access Key.
      • Region: us-east-1.
      • Output format (e.g., yaml).

Create project S3 Bucket

To create an S3 bucket, ensure your $S3_BUCKET name is globally unique and in lowercase. For example:

aws s3api create-bucket --bucket $S3_BUCKET --region $AWS_REGION

Example:

aws s3api create-bucket --bucket cumc-gao --region us-east-1

AWS storage quota increase request

When we submit too many jobs each loads large EB2 volume, we may see in jobs submitted later that:

2024-02-21T02:39:10.861: Failed to create float data volume, error: VolumeLimitExceeded: You have exceeded your maximum gp3 storage limit of 50 TiB in this region. Please contact AWS Support to request an Elastic Block Store service limit increase.

Although it is possible to ask AWS customer service to increase this limit, since EB2 volume should hold just temporary files, 50TB is a decent limit. If your jobs uses more than 50TB EB2 volume on the fly to support the computing, it is advised to examine into your jobs and decide if this is truly necessary (likely not!).

MMCloud setup

MMCloud account management

First, login as admin,

float login -u <admin username> -p <admin passwd> -a /<op_center_ip_address>

Then create a new user for example tom,

float user add tom

Setup MMCloud OpCenter

MMCloud OpCenter is analogous to the login node on a HPC.

As of May 2024, OpCenters are created and managed by MemVerge support team. We no longer need to worry about setting them up ourselves.

Upgrade MMCloud OpCenter

You may be asked to upgrade MMCloud OpCenter from time to time. To do so,

  • float release ls (check the version thats available)
  • float release upgrade (upgrade to latest)
  • wait for 1-2 mins
  • float login (to login again)
  • float release sync (upgrade local float binary. You can skip this if you use the latest containers provided by MemVerge, see Appendix II. You may get a permission deneied error, if so, please use sudo)
  • float release migrate --dbPath /mnt/memverge/data/opcenter (this is a one time upgrade of the backend DB)
  • done!

Configure OpCenter

We can configure the Opcenter in two ways

  1. Using the GUI interface

Using the GUI interface admins can change the setting of the OpCenter like to expand instance type and allow for more retry on spot instance before jobs fail.

Configure using GUI

  1. Using CLI configuration commands (recommended). You must be logged in as the admin user. Below are configurations for the batch (3.82.198.55) and interactive (44.222.241.133) opcenter.

For the 3.82.198.55 opcenter:

float config set cloud.createVMPolicy spotFirst
float config set cloud.createVMRetryInterval 5m0s
float config set cloud.createVMRetryLimit "10"
float config set migrate.cpuDisable true
float config set migrate.cpuLowerBoundDuration 10m0s
float config set migrate.cpuLowerBoundRatio 1
float config set migrate.cpuLowerLimit 8
float config set migrate.cpuMigrateStep 10
float config set migrate.cpuUpperBoundDuration 10m0s
float config set migrate.cpuUpperBoundRatio 99
float config set migrate.memDisable false
float config set migrate.memLimit 100
float config set migrate.memLowerBoundDuration 10m0s
float config set migrate.memLowerBoundRatio 5
float config set migrate.memLowerLimit 8
float config set migrate.memMigrateStep 10
float config set migrate.memUpperBoundDuration 1m0s
float config set migrate.memUpperBoundRatio 90
float config set provider.allowList "*"
float config set provider.denyList "t*"
float config set scheduler.jobExecutorLimit 900
float config set cloud.handleRebalanceMemThreshold 128G
float config set cloud.swapFileSize 12G

For the 44.222.241.133 opcenter:

float config set cloud.createVMPolicy spotFirst
float config set cloud.createVMRetryInterval 5m0s
float config set cloud.createVMRetryLimit "10"
float config set migrate.cpuDisable true
float config set migrate.cpuLowerBoundDuration 10m0s
float config set migrate.cpuLowerBoundRatio 1
float config set migrate.cpuLowerLimit 8
float config set migrate.cpuMigrateStep 10
float config set migrate.cpuUpperBoundDuration 10m0s
float config set migrate.cpuUpperBoundRatio 99
float config set migrate.memDisable false
float config set migrate.memLimit 100
float config set migrate.memLowerBoundDuration 10m0s
float config set migrate.memLowerBoundRatio 1
float config set migrate.memLowerLimit 8
float config set migrate.memMigrateStep 10
float config set migrate.memUpperBoundDuration 10m0s
float config set migrate.memUpperBoundRatio 90
float config set provider.allowList "*"
float config set provider.denyList "t*"
float config set scheduler.jobExecutorLimit 900
float config set cloud.handleRebalanceMemThreshold 128G
float config set cloud.swapFileSize 12G

Some of the lines may result an error like this if you use MAC float config set cloud.createVMRetryLimit -1 (Error: unknown shorthand flag: ‘1’ in -1’) and float config set provider.allowList r5*,r6* (zsh: no matches found: r5,r6). Enclosing with quotes work for float config set provider.allowList "r5*,r6*". You need to make sure that each line has been configured without any error, when you run the float commands above.

Additional settings

  • Gateway setup
  • Security group for port 8888 created in the AWS console

Clean up OpCenter space (recommeded to contact Memverge support team first)

Sometimes we may need to free up some space by deleting older builds on root volume. The admin of the opcenter instance should have the ssh phrase to be able to ssh into opcenter and clean it up.

  1. login using ssh -i ec2-user id created as an admin . This needs the pem key.
  2. Check the log files du -lh -d 1 and how much space is used on the system df -h
  3. Run sudo podman image prune -a -f to clean up some space

Compute resource quota increase request

Both the AWS and the MMCloud OpCenter have a default limit on the number of jobs we can submit at a time. However, it is possible to request AWS to increase the quota limit by contacting AWS customer service to increase the quota for Spot instances (which MMCloud uses). Currently, we have increased our AWS quota to 4000 CPUs per AWS region, although this can be changed to even largeer if we request again. Requests are usually approved within 24hrs.

Admins should also change the default maximum job of an OpCenter using CLI. See later section about OpCenter configuration via CLI.

Install OEM packages

Admin can install packages by running the command below:

bash mm_jobman.sh --shared-admin ...

using --shared-admin mode instead of --oem-packages and --mount-packages modes. Other parameters are the same as running an interactive session.

This will start a tmate session, where the admin can paste the tmate url into the browser and run bash commands to update packages. IMPORTANT NOTE: Any changes done to the packages will affect packages used for ALL users using the batch setup. To install a package, run the command

pixi global install <PACKAGE>

To install R packages

pixi global install --environment r-base <PACKAGE>

To install python packages

pixi global install --environment python <PACKAGE>

To update installed packages, please specify the version of package to the latest version you intend to update,

pixi global install --environment r-base <PACKAGE>=<VERSION>
pixi global install --environment python <PACKAGE>=<VERSION>