Notes for System Admin
This section is only relevant to system admins. If you are a user you can ignore this page.
AWS setup
Setting Up Your IAW User and Account
This is a one-time job for the system admin, done through GUI
FIXME: the approach below will gave every one in the group the full access to the whole bucket, so everyone can read and edit others’ file, that would be convenient but also dangerous. Need to manage it better next step
- Log into AWS Console:
- Navigate to AWS Console.
- Sign up for a root AWS account if you’re new, else log in.
- Search for IAW:
- After logging in, search for “IAW” using the top search bar.
- Creat Group
- Click “User groups” on the left.
- Attach “AmazonS3FullAccess” for this group
- Add Users to this group.
- Add user and set up access key
- GUI/or maybe for root user (first time to set up the access key)
- Add User:
- Click “Users” on the left and then click “Create user” on the right.
- Click “Next” following instructions.
- Manage Access Keys:
- Find “Security recommendations” on the IAW dashboard.
- Click “Manage access keys”.
- Create an Access Key:
- Go to the “Access keys” section.
- Select “Create access key”.
- Retrieve Your Access Key and Secret Access Key:
- A dialogue will show your Access Key ID and Secret Access Key.
- Check the box, then click “Next”.
- Download a copy of these keys for safekeeping.
- Add User:
- CLI (change to root access)
aws iam create-user --user-name YourUserName aws iam add-user-to-group --user-name YourUserName --group-name Gao-lab aws iam create-access-key --user-name YourUserName # create password aws iam create-login-profile --user-name YourUserName --password NEW_PASSWORD
copy these keys for safekeeping.
- GUI/or maybe for root user (first time to set up the access key)
- Configure AWS CLI:
- Run the following in your terminal:
aws configure
- Provide:
- Your Access Key ID and Secret Access Key.
- Region:
us-east-1
. - Output format (e.g.,
yaml
).
- Run the following in your terminal:
Create project S3 Bucket
To create an S3 bucket, ensure your $S3_BUCKET
name is globally unique and in lowercase. For example:
aws s3api create-bucket --bucket $S3_BUCKET --region $AWS_REGION
Example:
aws s3api create-bucket --bucket cumc-gao --region us-east-1
AWS storage quota increase request
When we submit too many jobs each loads large EB2 volume, we may see in jobs submitted later that:
2024-02-21T02:39:10.861: Failed to create float data volume, error: VolumeLimitExceeded: You have exceeded your maximum gp3 storage limit of 50 TiB in this region. Please contact AWS Support to request an Elastic Block Store service limit increase.
Although it is possible to ask AWS customer service to increase this limit, since EB2 volume should hold just temporary files, 50TB is a decent limit. If your jobs uses more than 50TB EB2 volume on the fly to support the computing, it is advised to examine into your jobs and decide if this is truly necessary (likely not!).
MMCloud setup
MMCloud account management
First, login as admin,
float login -u <admin username> -p <admin passwd> -a /<op_center_ip_address>
Then create a new user for example tom
,
float user add tom
Setup MMCloud OpCenter
MMCloud OpCenter is analogous to the login node on a HPC.
As of May 2024, OpCenters are created and managed by MemVerge support team. We no longer need to worry about setting them up ourselves.
Upgrade MMCloud OpCenter
You may be asked to upgrade MMCloud OpCenter from time to time. To do so,
float release ls
(check the version thats available)float release upgrade
(upgrade to latest)- wait for 1-2 mins
float login
(to login again)float release sync
(upgrade local float binary. You can skip this if you use the latest containers provided by MemVerge, see Appendix II. You may get a permission deneied error, if so, please usesudo
)float release migrate --dbPath /mnt/memverge/data/opcenter
(this is a one time upgrade of the backend DB)- done!
Configure OpCenter
We can configure the Opcenter in two ways
- Using the GUI interface
Using the GUI interface admins can change the setting of the OpCenter like to expand instance type and allow for more retry on spot instance before jobs fail.
- Using CLI configuration commands (recommended). You must be logged in as the admin user. Below are configurations for the batch (
3.82.198.55
) and interactive (44.222.241.133
) opcenter.
For the 3.82.198.55
opcenter:
float config set cloud.createVMPolicy spotFirst
float config set cloud.createVMRetryInterval 5m0s
float config set cloud.createVMRetryLimit "10"
float config set migrate.cpuDisable true
float config set migrate.cpuLowerBoundDuration 10m0s
float config set migrate.cpuLowerBoundRatio 1
float config set migrate.cpuLowerLimit 8
float config set migrate.cpuMigrateStep 10
float config set migrate.cpuUpperBoundDuration 10m0s
float config set migrate.cpuUpperBoundRatio 99
float config set migrate.memDisable false
float config set migrate.memLimit 100
float config set migrate.memLowerBoundDuration 10m0s
float config set migrate.memLowerBoundRatio 5
float config set migrate.memLowerLimit 8
float config set migrate.memMigrateStep 10
float config set migrate.memUpperBoundDuration 1m0s
float config set migrate.memUpperBoundRatio 90
float config set provider.allowList "*"
float config set provider.denyList "t*"
float config set scheduler.jobExecutorLimit 900
float config set cloud.handleRebalanceMemThreshold 128G
float config set cloud.swapFileSize 12G
For the 44.222.241.133
opcenter:
float config set cloud.createVMPolicy spotFirst
float config set cloud.createVMRetryInterval 5m0s
float config set cloud.createVMRetryLimit "10"
float config set migrate.cpuDisable true
float config set migrate.cpuLowerBoundDuration 10m0s
float config set migrate.cpuLowerBoundRatio 1
float config set migrate.cpuLowerLimit 8
float config set migrate.cpuMigrateStep 10
float config set migrate.cpuUpperBoundDuration 10m0s
float config set migrate.cpuUpperBoundRatio 99
float config set migrate.memDisable false
float config set migrate.memLimit 100
float config set migrate.memLowerBoundDuration 10m0s
float config set migrate.memLowerBoundRatio 1
float config set migrate.memLowerLimit 8
float config set migrate.memMigrateStep 10
float config set migrate.memUpperBoundDuration 10m0s
float config set migrate.memUpperBoundRatio 90
float config set provider.allowList "*"
float config set provider.denyList "t*"
float config set scheduler.jobExecutorLimit 900
float config set cloud.handleRebalanceMemThreshold 128G
float config set cloud.swapFileSize 12G
Some of the lines may result an error like this if you use MAC float config set cloud.createVMRetryLimit -1
(Error: unknown shorthand flag: ‘1’ in -1’) and float config set provider.allowList r5*,r6*
(zsh: no matches found: r5,r6).
Enclosing with quotes work for float config set provider.allowList "r5*,r6*"
. You need to make sure that each line has been configured without any error, when you run the float
commands above.
Additional settings
- Gateway setup
- Security group for port 8888 created in the AWS console
Clean up OpCenter space (recommeded to contact Memverge support team first)
Sometimes we may need to free up some space by deleting older builds on root volume. The admin of the opcenter instance should have the ssh
phrase to be able to ssh into opcenter and clean it up.
- login using
ssh -i
ec2-user id created as an admin . This needs thepem key
. - Check the log files
du -lh -d 1
and how much space is used on the systemdf -h
- Run
sudo podman image prune -a -f
to clean up some space
Compute resource quota increase request
Both the AWS and the MMCloud OpCenter have a default limit on the number of jobs we can submit at a time. However, it is possible to request AWS to increase the quota limit by contacting AWS customer service to increase the quota for Spot instances (which MMCloud uses). Currently, we have increased our AWS quota to 4000 CPUs per AWS region, although this can be changed to even largeer if we request again. Requests are usually approved within 24hrs.
Admins should also change the default maximum job of an OpCenter using CLI. See later section about OpCenter configuration via CLI.
Install OEM packages
Admin can install packages by running the command below:
bash mm_jobman.sh --shared-admin ...
using --shared-admin
mode instead of --oem-packages
and --mount-packages
modes. Other parameters are the same as running an interactive session.
This will start a tmate session, where the admin can paste the tmate url into the browser and run bash commands to update packages. IMPORTANT NOTE: Any changes done to the packages will affect packages used for ALL users using the batch setup. To install a package, run the command
pixi global install <PACKAGE>
To install R packages
pixi global install --environment r-base <PACKAGE>
To install python packages
pixi global install --environment python <PACKAGE>
To update installed packages, please specify the version of package to the latest version you intend to update,
pixi global install --environment r-base <PACKAGE>=<VERSION>
pixi global install --environment python <PACKAGE>=<VERSION>