Data transfer: HPC, AWS and synapse.org

Browse data on AWS buckets

To access AWS S3 through the web browser,

  • Open your web browser and navigate to the AWS Management Console.

  • Sign in to the AWS Management Console using your AWS account credentials:
    • Enter your 12-digit AWS Account ID (you can find this in your account details).
    • Enter your IAM user name (you can find this in your account details).
    • Enter your password (If you don’t have a password, please contact your aws administrator to create one for you).
  • In the AWS Management Console, locate the search bar and type “S3,” then select “S3” from the search results.

  • In the Amazon S3 dashboard, click on the S3 bucket named “statfungen” to access your data.

  • You can now easily interact with your datasets within the selected S3 bucket.

Data transfer from HPC (or any local computer) to AWS

This section is only relevant to data admins in charge of transfering data between local computing environment and AWS buckets. If you are a user analyzing data that already exists on an AWS bucket you can skip this.

First time: configure your pre-existing account on project AWS bucket

This assumes that your admin has already an user account on their bucket, with Access Key ID and Secret Access Key for your access to the project AWS bucket. To configure your account from command terminal,

aws configure

You will be prompted to provide these information:

AWS Access Key ID [None]:
AWS Secret Access Key [None]: 
Default region name [None]: us-east-1
Default output format [None]: yaml

The first two pieces of info should be available from your admin.

Upload data to pre-existing AWS bucket

This assumes that your admin has already created a storage bucket on AWS, and that you can access it. In this documentation the pre-existing bucket is called cumc-gao. You should have your username from your admin.

To copy file eg $DATA_FILE to the bucket,

aws s3 cp $DATA_FILE s3://$S3_BUCKET/ 

Example:

touch test_wiki.txt
aws s3 cp test_wiki.txt s3://cumc-gao/

To copy folder $DATA_DIR to the bucket,

aws s3 cp $DATA_DIR s3://$S3_BUCKET/ 

Example:

mkdir test_wiki
mv test_wiki.txt test_wiki
aws s3 cp test_wiki s3://cumc-gao/ --recursive

Once you completed uploading files to AWS bucket, you are ready to run your analysis through MemVerge.

Download data from pre-existing AWS bucket

After your analysis is done, it is possible to retrieve the results stored on S3 to your local machine, simply by reversing the command for Upload discussed in the previous section. For example,

aws s3 cp s3://$S3_BUCKET/$DATA_DIR/output output --recursive

More examples

Check files generated:

aws s3 ls --human-readable s3://$your-bucket-name/$result_folder/ | head

Copy specific types of files (try not to do it; try to look into results directly on AWS), eg stderr below

aws s3 cp s3://$your-bucket-name/$result_folder/ ./ --recursive --exclude "*" --include "*.stderr"

for example the above includes only stderr files.

Data transfer between AWS buckets

To copy data directory eg with differnt buckets

aws s3 cp s3://source-bucket-name/source-folder s3://destination-bucket-name/destination-folder --recursive

To copy data directory eg within the same bucket

aws s3 cp s3://my-bucket/source-folder s3://my-bucket/destination-folder --recursive

Remove data from AWS buckets

Warning: think twice before you run it!

FIXME: on our end we should set it up such that a user can only remove files they created, not from other people

aws s3 rm s3://$S3_BUCKET/$DATA_DIR/cache --recursive

Data transfer between HPC and synapse.org

The assuption is that the user has access to synapse. Synapse provides APIs to store or access your data from the web or programmatically via one of ther supported analytical clients(Command line, Python or R). We recommend to transfer data using programmatically.

To make analysis outputs and results accessible for collaborators we may upload some of our outputs and results to the synapse.

On your HPC login with your syanpse credentials. Follow the instruction from synapse website to create authtoken and prepare a .synapseConfig in home folder.

###########################
# Login Credentials       #
###########################

## Used for logging in to Synapse. See https://python-docs.synapse.org/tutorials/authentication/
## for information on retrieving an auth token.
[authentication]
username = xxx
authtoken = xxx

authtoken is not the password you use, should be created following the instruction above.

upload data

synapse add --parentid synxxxxxxxx <file>
  • -parentid is the synapse id for the folder store that file
for i in `ls *.rds`;do synapse add --parentId syn54079330 $i;done

Or in parallel

cat sample_data| xargs -I {} -P 10 synapse add --parentid synxxxxxxxx {}
  • sample_data is a file list all files need to be uploaded
  • xargs change cahracter to parameter
  • -I is file name
  • -P is parallel number

Download data to HPC

synapse login
synapse get -r SynID

Data transfer between AWS and synapse

Currently, when attempting to transfer data from an S3 bucket to Synapse, there are different methods available:

First, download the data to a local machine, then upload it to Synapse using the Command Line Interface (CLI). Alternatively, launch a Jupyter notebook via Opcenter, employing either a bash notebook or a terminal kernel for the upload process. However, this approach has its drawbacks: the terminal kernel tends to be slow and prone to freezing, and utilizing a bash notebook for large tasks (such as uploading over 30,000 files) could lead to crashes due to interactive messages during the upload process.

There’s also the possibility of directly linking the S3 bucket with Synapse by granting it ‘ListObject’ permissions, as outlined in the synapse documentation)

FIXME: working on to come up with a more efficient way of uploading data from aws s3 bucket to synapse