Frequently Asked Questions

Various frequently asked computing questions and tips, some beyond HPC

Jupyter Kernels Keep Dying?

This usually indicates memory issues. Try requesting more memory:

sbatch --constraint="cpu4mem32a" jupyter_cumc_cloud.sh

If the problem persists, see this Jupyter issue for additional solutions.

How Can I Clean Up Old Jupyter Log Files?

Add this to your ~/.bashrc:

# Clean up jupyter-notebook files on logout
rm -f jupyter-notebook-*.login_info jupyter-notebook-*.out 2>/dev/null

Then run source ~/.bashrc.

I Have CUMC Duo But VPN/HPC Login Doesn’t Work?

Columbia University has two Duo accounts. You need the CUMC Duo in addition to the general CU Duo. Contact CUMC IT if you need to set up the CUMC Duo.

How Do I Set Up Passwordless SSH Login?

First, set up your SSH config on your local machine:

cd ~
mkdir -p .ssh
chmod 700 .ssh
cd .ssh
touch config 
chmod 600 config

Edit the config file:

nano ~/.ssh/config

Add an entry for the HPC:

Host hpc
    HostName <hpc_ip>
    User <username>

Replace hpc with any name you prefer, <hpc_ip> with the login node IP, and <username> with your UNI.

Now you can connect with:

ssh hpc

You may need to enter your password this time.

Next, set up key-based authentication. On your local machine, generate a key if you don’t have one:

ssh-keygen -t rsa

Copy your public key to the HPC:

scp ~/.ssh/id_rsa.pub hpc:~/

Log in to the HPC and add the key to authorized_keys:

ssh hpc
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
rm ~/id_rsa.pub

Now you can log in without a password:

ssh hpc

IDE Choice: Text Editor, JupyterLab, or RStudio?

RStudio: If you use R for 90% of your research then RStudio is perhaps the way to go. See RStudio Server on HPC.

JupyterLab: If you work with multiple languages then JupyterLab is perhaps better because with multiple kernels you can develop code for different languages in a similar environment. With the SoS kernel in JupyterLab, you can even switch between languages and transfer data among them in the same notebook. See JupyterLab on HPC.

VS Code: Recommended for viewing and editing text files on the HPC. It connects to the cluster using the Remote-SSH extension. See VS Code on HPC.

You can also combine approaches: write interactive analysis code for data exploration in R or RStudio, and write SoS pipelines in Jupyter notebooks.

Our lab recommend SoS Notebook and SoS Workflow as primary tools for daily computational research. However this workflow language might not be the most popular nowadays, and there are known limitations to using IPython notebooks for interactive analysis. Still, these limitations do not apply when using notebooks to develop SoS Workflows, and most issues can be avoided with good computing habits you develop most suited for your projects needs.

What Programming Languages Do We Use for Computational Biology Projects?

Short answer: in our lab, perhaps R (45%), Python (45%), and others (10%) including C++ (typically with R through Rcpp), BASH, and SQLite. On top of these, we use DSC for simulation studies and SoS for analysis pipelines.

The choice of language typically depends on the nature of the project and what has already been done. A project involving both numerical simulation studies and real-world data analysis might require a combination of languages for different tasks. DSC and SoS can be very useful in “gluing” pieces of code in different languages together.

Generally speaking:

  • R is great for prototyping statistical methods, making simple plots, and analyzing small-scale datasets
  • Python is great for machine learning, data wrangling, bioinformatics (many libraries available), making complex plots (matplotlib), and high-performance computing (GPU, HDFS)
  • C++ is most suitable for optimizing code already prototyped in R or Python, typically as R extensions using Rcpp
  • BASH commands (sed, awk, etc.) are useful for quick data wrangling and can be easier than R or Python for certain tasks

In bioinformatics, it almost always works better to use specialized tools for specific data types than to write your own scripts. For example, use vcftools or cyvcf2 for VCF files rather than parsing them line by line. Use bedtools for genomic region queries rather than generic database tools.

To summarize: “A good programmer writes good code, but a great programmer reuses good code.” When starting a project, think ahead to various stages and look for existing resources you can reuse. With DSC and SoS you can switch between languages for different tasks, as long as you modularize your project well.

Obsidian works well for old-school markdown-based notes that can be synced across devices (at a small cost). Notion is a popular choice among many lab members and as good AI features (at a modest cost).