Daily research practice

This document discusses reproducible research practice in the lab. A presentation (an outdated version) of this document is available here.

I initially started this document exclusively for master and junior PhD students, although it turns out that the entire lab can benefit a lot from working under a unified style that facilicates collaboration between projects.

Why and how to organize your computational research

Please read this paper and this paper for motivations and general guideline for keeping your computational work organized.

Suggested organization of a git repository for a project

We use git repositories hosted on github.com to organize our research (if you are not familiar with git please check out this 5 minutes git tutorial). We recommend the following organization for a project repository:

- analysis # interactive analysis notebooks
- code
   - dsc # statistical benchmarking (using the DSC language, https://stephenslab.github.io/dsc-wiki)
   - python # Python utility scripts
   - R # R utility scripts
- workflows # SoS pipeline notebooks (using the SoS language, https://vatlab.github.io/sos-docs)
- data # meta-data for data on cluster or cloud, and a few essential and small (<5MB) data-sets
- output # output folder should never be uploaded to github
- writeup # To document what you have learned, the list of papers you plan to read, links to DOCX, PPTX and PDF files on Cloud, etc

Suggested format of lab notebooks

Regardless of your focus (on methods development or applied data analysis) it is required that all computational experiments in your daily research be documented, well organized and version controlled (using git), and is always ready to be communicated with others.

Most people in the lab use either IPython Notebook (also known as Jupyter Notebook) or Rmarkdown, or both. We highly recommend Jupyter Notebooks over Rmarkdown (for reasons to be elaborated later) unless you have strong preference of the latter.

In a lab notebook you should document your research clearly, communicating your results along with the codes that generated them in a single self-contained document for a specific task. In particular, in a notebook you should write notes in Markdown chunks in between code chunks to explain what you do with those codes. Code commenting is different from and cannot replace such scientific narratives in a notebook.

We suggest the following organization of a lab notebook:

  1. Title, and in the same notebook cell a brief one sentence summary of the goal to achieve with the notebook.
  2. Motivation or Aims: describe the problem under investigation into actionable items.
  3. Methods overview: a high-level description of methods used to solve the problem.
  4. Main conclusions (not applicable to a pure workflow notebook): take home message from your investigations.
  5. Data input and output (if applicable): describe data used and generated from the notebook.
  6. Key parameters (if applicable): explain model or algorithm parameters that are crucial to the analysis (no need to explain all parameters).
  7. The rest of the notebook: multiple sections of detailed steps, with interactive codes / workflows and narratives, as well as diagnostic summary statistics, plots and tables at each step.

Usually, a notebook should have a single goal although it can contain multiple tasks. It is usually not more than a week worth of work (40 hrs).

Your computational experiments should evolve over time, even during the course of a day:

  1. You can start off with unpolished code and unpolished narrative for an analysis.
  2. As you improve your experiments, you continue to polish the narratives and organizations of the analysis code. Iteratively improving the notebook can help crystallizing your ideas and reduce errors (and make your analysis presentable to other readers including yourself in 6 months in the future!).
  3. If you find yourself repeatedly using some of the codes you developed, you can convert them into functions or mini bioinformatics pipelines inside a notebook (see the next section for more details), and call them repeatedly in the same notebook to run the experiments.
  4. If you deem these functions or pipelines useful for other projects, you should make them utility library / package or formal bioinformatics workflows with a minimal working example (MWE) data-set to help testing and improving it in the future.

You should organize all notebooks in a github repository to a research website. The next section will discuss some tools to implement that.

Software for lab notebooks

JupyterLab and Jupyter Book

An important reason we recommend JupyterLab over Rstudio is because our work typically involves using multiple programming languages to perform both interactive data analysis and batch execution of bioinformatic workflows. We use the SoS suite — which is both a multi-language notebook for interactive analysis as well as a workflow system for batch data analysis — for our daily research computing. SoS uses JupyterLab as its IDE.

To create a website from Jupyter notebooks I recommend the program Jupyter Book, which includes a tutorial to publish a website to github.io. Here is an example Jupyter Book based website created by our student for a project.

I used to build project websites using a small program I wrote. See for example this repository (and the HTML version). The program is still usable but I no longer maintain it. But please don’t use this program any more – please use Jupyter Book.

In your daily research you will be expected to use SoS Notebook to analyze data, document your workflows with suggested analysis report format, and make them available as websites to share with our colleagues.

Rstudio and workflowr

For R users who write R codes and Rmarkdown documents, Rstudio with workflowr is a good alternative to JupyterLab for interactive data analysis. Here are some excellent workflowr based research websites from our colleagues:

  • https://brimittleman.github.io/apaQTL
  • https://brimittleman.github.io/threeprimeseq
  • https://lsun.github.io/truncash
  • https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHnn2.html

Currently workflowr is actively maintained and evolve with Rstudio. Still, Rstudio with workflowr is not recommended if your work often involves analyzing data with bioinformatics pipelines developed by the group, although the excellent research website examples are still relevant regardless of the lab notebook software used.

Daily pull request(s)

Since all of our computational work happen on github, it is required that lab members fork project repositories and send in pull request (PR) on daily basis to communicate with me about work.

Why daily PR

  • Daily PR is NOT “progress monitor”.
  • Daily PR makes my workload reasonable — it will ensure that every day I can spend a small amount of time to quickly review a limited amount of work (of what one can do during a single day) in order to keep up to your pace and to provide timely feedback.
  • Daily PR makes our in person meeting more efficient — so when we meet we can focus on hard problems we did not resolve on github or slack, or focus on learning from each other.
  • With your lab notebook written using suggested format (above), sending it in via daily PR will help me to review and clarify the goals and action items for your work. Goal setting and task clarification is crucial to success of projects.

What if I don’t have updates in terms of codes and analysis?

Even if you spent your day reading literature and took your own notes elsewhere, you can always use a dedicated lab notebook in your project repository eg under writeup folder to keep a list of relevant literature for your project and add some high level comments to each of the papers. This type of PR can also inform and educcate me of your literature research.

Here is an example of daily pull requests from Hao Sun on the Brain xQTL project.

Github issues

Instant messages via slack is a great way to communicate quick Q&A between teammates on a project. However, when nobody in the team has an obvious solution to a problem at the moment, it is best to use github Issues to keep track of all discussions up to the point, in order to revisit them with later attempts to solve. Even if the problem may have already been solved for the time being, it is also useful to document the complete thread of discussions on issues with non-trivial and/or suboptimal solutions on github. To elaborate: