Quirky Queries & Awesome Answers

A conscientious bioinformatician from Isabelle’s group started this list summarizing computational questions that they posed to various people on the PH 19 West Wing and documented their answers. This list was later circulated among other new trainees as “an extremely helpful document”. Here I repost some of it on the lab wiki to encourage others to expand this list (by sending a pull request to this document).

Questions Posed to Hao

What if Jupyter kernels keep dying?

This happened to us several times, and solution on this ticket helps.

Questions Posed To Bale

How do I check the status of a jupyter_columbia.sh job I have submitted?

Use qstat

How do I remove a job from the que if I am not using it anymore?

First, you need to check what the job number is so run the following:

qstat

Then when you know the job number:

qdel my_job_number

I ran the jupyter_columbia.sh file and for some reason it doesn’t seem to work?

qdel any jobs you need to and then re-run the jupyter_columbia.sh file via qsub.

Questions Posed To Francesca

How do open up R on HPC?

If you don’t know what the versions of R you have are you can type the following:

module spider R

Then you need to load the version of R you prefer:

module load R/Version_You_Prefer

In order to enter R in the Linux command line you simply type R:

R

You can also set up the JupyterLab server via jupyter_columbia.sh and start a notebook with R kernel.

What do you type to quit out of R in bash?

q()

When installing R packages do I say yes to add it to a personal library?

Yes. You need to add new packages you install to a personal library.

I have issues debugging other than asking you is there something I can use for debugging?

Chat GPT : https://openai.com/blog/chatgpt/

Questions Posed To Isabelle

How do I view the HPC documents/folders with a graphical interface?

You open Finder on Mac. Then, click Go on the top bar and then “Connect to server”. Use this smb://prometheus.neuro.columbia.edu/csg and then type in your UNI and password. This should allow you to see the server visually. This tip was also mentioned by Thashi. If this does not work, ask Jose to trouble-shoot.

Questions Posed To Tian Yi (Amanda)

I messed up my conda environment. What should I do?

Delete you miniconda files using the following command (modify it if necessary):

rm -r miniconda3

Then follow these instructions on this website to reinstall: https://wanggroup.org/orientation/jupyter-setup.html

Questions Posed To Rui (pronounced Rae) Dong

How to monitor memory usage for computing tasks?

This works on both local computers and on our (or any) clusters.

First, find the monitor.py file (that Gao wrote in 2017) in folder /misc/monitor/ in this lab-wiki. Under the same folder, you will see a readme file starting with a brief introduction as “It is a Python wrapper to system command calls that recursively checks memory usage at specified intervals (default every 1 second) and report the peak usage at the end of execution”.

Second, download/save this file to your local or cluster (remember the path).

Third, read the readme file in misc/monitor/ again, and test on your local/cluster by running:

path_to_this_monitor.py_file/monitor.py R --slave -e "x = runif(1E5)"

If everything is correct, you will see something like:

time elapsed: 4.71s
peak first occurred: 0.00s
peak last occurred: 4.18s
max vms_memory: 0.10GB
max rss_memory: 0.05GB
memory check interval: 1s
return code: 0

At last, if you want to test the memory usage of a program, for example, in R, simply put monitor.py at the beginning of the command:

path_to_this_monitor.py_file/monitor.py Rscript xxx.R

It also works in SoS:

path_to_this_monitor.py_file/monitor.py sos run xxxx

Note: this monitor.py itself recursively checks memory usage alongwith running the original code that you write, thus it is only recommended to run it once to test the memory usage and then adjust the memory you ask for. When you submit the whole computing jobs to cluster/computer (after the estimating of memory usage by monitor.py), you should not include monitor.py in front of your command any more. Otherwise it is a waste of computing source and time.

I do not want to log in and type in the csg login and password everytime. Is there a way around this.

Yes! First do the following:

cd ~

Then,

mkdir -p .ssh
chmod 700 .ssh
cd .ssh
touch config 
.ssh$ chmod 600 config

Once that is done:

nano .ssh/config

Do the following:

Host Name_You_Wish_To_Use_To_Refer_To_This
    HostName csglogin.neuro.columbia.edu
    User MY_UNI

Make sure to replace the Name_You_Wish_To_Use_To_Refer_To_This with something representative of the login for you.

Once that is done make sure to save your file.

Then type in:

ssh Name_You_Wish_To_Use_To_Refer_To_This

You may need to use your password this time.

  • You will need to scp your id_rsa.pub onto the cloud.

Once that is done log in to the HPC again and then do the following in your home folder:

mkdir .ssh
cd .ssh
nano authorized_keys

Add one space to the authorized_keys file and then save it and close.

cd ..
cat id_rsa.pub >> ~/.ssh/authorized_keys

This should allow you to simply log in to the HPC using (without using a password):

ssh Name_You_Wish_To_Use_To_Refer_To_This

How can I automatically clean up my jupyter-notebook-####### files?

Open your bashrc

nano ~/.bashrc

Then add this to your bashrc:

# Clean up the jupyter-notebooks when I logout

rm jupyter-notebook-*.login_info
rm jupyter-notebook-*.out

Do source ~/.bashrc if required.

Questions Posed To Yasmin

I do not know how to use docker what are some resources I may look at to learn how to use Docker & R?

There are 2 links I found useful below:

  • https://www.r-bloggers.com/2019/02/running-your-r-script-in-docker/

  • https://colinfay.me/docker-r-reproducibility/

What is a command I can use to get into the bashrc file?

nano ~/.bashrc 

I am having difficulty remembering all the commands? What should I do?

Make a cheat sheet that summarizes all the commands you feel you need to know given the projects you are doing. Here are some frequently used commands: https://wanggroup.org/computing_tutorial/shell-must-know

Is there a command I can used to cd into my home folder?

cd ~

Is there a summary sheet of the git commands I need to use when backing it up to GitHub.

Yes there is and here is the link (https://jdsalaro.com/files/0001/gittimeworkflow.png)

I cannot log in to my CUMC email

Ask CUMC IT for help.

Questions Posed To Gao

I have the Columbia University Duo but when I try to login to the VPN/HPC it doesn’t work. What should I do?

There are multiple Duos. You may need the CUMC Duo in addition to the CU Duo.

What programming languages do you use and what language is preferred?

Short answer: R (45%), Python (45%), and others (10%) mainly including C++ (typically with R through Rcpp), BASH and SQLite. On top of these, we use DSC to perform simulation studies and SoS to write analysis pipelines.

Long answer: the choice of language typically depends on the nature of the project and what has already been done for the project. More often than not, a project involving both numerical simulation studies and real-world data analysis might require using a combination of programming languages for different tasks. That is where DSC and SoS can be very useful in “gluing” pieces of code in different languages together.

If an old project already has a code base using some languages and your new project is an extension of the old one, then at least at the beginning stage it is often more beneficial to reuse existing codes thus sticking to whichever language the old project is written in. By a similar token, R language might be more suitable to program statistical methods that involves using R packages written by others to accomplish some functionality in your method. Python might be great to program machine learning tasks because of the well-established scikit-learn and tensorflow libraries for machine learning.

In my view no one language is preferred over another, although the researcher might be more comfortable writing in one language over another. Generally speaking, R is great language for prototyping statistical model implementation, making simple plots, and analyzing small scale data-set. Python is great for machine learning, data wrangling & munging, bioinformatics (many bioinformatics libraries written in Python), making complicated plots (with matplotlib), and high performance computing (e.g., leveraging GPU and HDFS system). C++ nowadays is most suitable for optimizing pieces of codes already prototyped in R or Python, and it is the easiest to write them as R extensions using Rcpp package – since we often implement new statistical methods in R, it is great to speed it up with C++ extensions. A handful BASH commands such as sed, awk are the most useful for quick data wrangling tasks and can be a lot easier to write than R or Python codes when used to solve the problem they fit in.

This is why we promote the idea of breaking up your project to smaller problems, writing “modules” for each problem, and glue them together using tools such as DSC and SoS pipelines. In my view, putting together casually written scripts of different languages into rigorous pipelines works much better than writing a sophisticated software program in one language.

I want to add a caution for R language: R has some annoy pitfalls that when you commit them, they don’t trigger an error message but rather falls back to some default behavior which might lead to what seems very strange errors down the road. Hidden errors are always frustrating. For a quick example, in R the behavior below can be confusing (you can try print(a$k) to figure out why),

> a = list(); if (a$k) print(9)
Error in if (a$k) print(9) : argument is of length zero

versus for Python the error behavior and message is much better,

>>> a = dict()
>>> if (a['k']): print(9)
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'k'
>>>

I also want to point out that in bioinformatics it almost always work better to use specialized tool to process specific type of data than to write your own scripts. For example, vcftools (BASH) or cyvcf2 (Python) should be used to work with VCF genotype file format, rather than to parse them line by line. bedtools should be used to query genomic regions rather than using generic data base tools such as SQLite for the job. Specialized tools are less prone to coding errors, and typically will work faster than your own code.

To summarize, I believe in the statement that “a good programmer writes good codes, but a great programmer reuses good codes”. When you start a project please try to think ahead into various stages of the project, and look for existing resources that you might be able to reuse for each stage. Then you take these factors into consideration before choosing a languages. But don’t get too fixated in that decision making because of two reasons. First, you can always prototype with whatever language you like, then convert them to other languages if necessary (for performance reason, for compatibility with other packages etc.). In my opinion it is a lot easier to translate between languages than to write new code. Secondly, with DSC and SoS you can always switch between languages for different tasks, as long as you “modularize” your project well enough! DSC and SoS are tools developed in house thus might lack extensive documentation and support from the community compared to other languages. But you can ask for help from others in the group who have more experiences with these tools.

How about IDE choice: text editor, JupyterLab or Rstudio?

I personally use vscode with vim key binding. Before finally settle with vscode, I have used vim, emacs, atom and subline. I use Jupyter for interactive analysis and bioinformatics pipeline development.

Rstudio: if you use R for 90% of your research then Rstudio is perhaps way to go.

Jupyter: if you work with many languages then JupyterLab is perhaps better because with multiple kernels you can develop codes for different languages in similar environment. With SoS kernel in JupyterLab, you can even switch between languages and transfer data among them, in the same notebook. More importantly, we promote the use of pipelines in data analysis. Starting writing interactive codes in Juptyer and later modify them in place into pipelines written in SoS language is very straightforward. In fact such in-place modification is one of the major design feature of SoS. In this context it makes much more sense to use JupyterLab as your IDE.

Rstudio + Jupyter: you can write interactive analysis codes for data exploration in R, and write SoS pipelines in Jupyter notebooks. If you develop R codes interactively and later want to make them pipelines, then you’ll have to copy codes from Rmarkdown and paste them to another Jupyter file. This is extra work and is error prone — just be warned!

Although I recommend SoS Notebook and SoS Workflow be the primary tool for daily computational research, I acknowledge there are limitations to using IPython notebooks in general for interactive analysis, cf, this presentation. However most of such issues can be avoided if you recognize them and develop good habits in using notebooks and not commit those pitfalls. Additionally, these limitations do not apply to when you use notebooks to develop SoS Workflows; and I always prefer writing small workflows over interactive notebooks — as you hopefully have learned from the above tasks and agree that it is almost trivial to turn an interactive SoS notebook into an SoS workflow.

I use obsidian with vim key binding, and push all my notes to github — because I like Markdown files and github!