Daily research practice

This document discusses reproducible research practice in the lab. A presentation version (might be outdated) of this document is available here.

Why and how to organize your computational research

Please read this paper and this paper for motivations and general guideline for keeping your computational work organized.

Suggested organization of a git repository for a project

We use git repositories hosted on github.com to organize our research (if you are not familiar with git please check out this 5 minutes git tutorial). We recommend the following organization for a project repository:

project_root
- analysis # interactive analysis notebooks
- code
   - dsc # statistical benchmarking (using the DSC language, https://stephenslab.github.io/dsc-wiki)
   - python # Python utility scripts
   - R # R utility scripts
- workflows # SoS pipeline notebooks (using the SoS language, https://vatlab.github.io/sos-docs)
- data # only those essential and <5MB should be uploaded to github
- output # output folder should never be uploaded to github
- writeup # Random PDF or Docx write-ups

Suggested format of lab notebooks

Regardless of your focus (on methods development or applied data analysis) it is required that all computational experiments in your daily research be documented, well organized and version controlled (using git), and is always ready to be communicated with others.

Most people in the lab use either IPython Notebook (also known as Jupyter Notebook) or Rmarkdown, or both. We highly recommend Jupyter Notebooks over Rmarkdown (for reasons to be elaborated later) unless you have strong preference of the latter.

In a lab notebook you should document your research clearly, communicating your results along with the codes that generated them in a single self-contained document for a specific task. In particular, in a notebook you should write notes in Markdown chunks in between code chunks to explain what you do with those codes. Code commenting is different from and cannot replace such scientific narratives in a notebook.

We suggest the following organization of a lab notebook:

  1. Title, and in the same notebook cell a brief one sentence summary of what the notebook is about.
  2. Motivation or Aims: describe the problem under investigation.
  3. Methods overview: a high-level description of methods used to solve the problem.
  4. Main conclusions (not applicable to a pure workflow notebook): take home message from your investigations.
  5. Data input and output (if applicable): describe data used and generated from the notebook.
  6. Key parameters (if applicable): explain model or algorithm parameters that are crucial to the analysis (no need to explain all parameters).
  7. The rest of the notebook: multiple sections of detailed steps, with interactive codes / workflows and narratives, as well as diagnostic summary statistics, plots and tables at each step.

Your computational experiments should evolve over time, even during the course of a day:

  1. You can start off with unpolished code and unpolished narrative for an analysis.
  2. As you improve your experiments, you continue to polish the narratives and organizations of the analysis code. Iteratively improving the notebook can help crystallizing your ideas and reduce errors (and make your analysis presentable to other readers including yourself in 6 months in the future!).
  3. If you find yourself repeatedly using some of the codes you developed, you can convert them into functions or mini bioinformatics pipelines inside a notebook (see the next section for more details), and call them repeatedly in the same notebook to run the experiments.
  4. If you deem these functions or pipelines useful for other projects, you should make them utility library / package or formal bioinformatics workflows with a minimal working example (MWE) data-set to help testing and improving it in the future.

You should organize all notebooks in a github repository to a research website. The next section will discuss some tools to implement that.

Software for lab notebooks

JupyterLab and Jupyter Book

An important reason we recommend JupyterLab over Rstudio is because our work typically involves using multiple programming languages to perform both interactive data analysis and batch execution of bioinformatic workflows. We use the SoS suite — which is both a multi-language notebook for interactive analysis as well as a workflow system for batch data analysis — for our daily research computing. SoS uses JupyterLab as its IDE.

To create a website from Jupyter notebooks I recommend the program Jupyter Book, which includes a tutorial to publish a website to github.io. Here is an example Jupyter Book based website created by our student for a project.

I used to build project websites using a small program I wrote. See for example this repository (and the HTML version). The program is still usable but I no longer maintain it.

In your daily research you will be expected to use SoS Notebook to analyze data, document your workflows with suggested analysis report format, and make them available as websites to share with our colleagues.

Rstudio and workflowr

For R users who write R codes and Rmarkdown documents, Rstudio with workflowr is a good alternative to JupyterLab for interactive data analysis. Here are some excellent workflowr based research websites from our colleagues:

  • https://brimittleman.github.io/apaQTL
  • https://brimittleman.github.io/threeprimeseq
  • https://lsun.github.io/truncash
  • https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHnn2.html

Currently workflowr is actively maintained and evolve with Rstudio. Still, Rstudio with workflowr is not recommended if your work often involves analyzing data with bioinformatics pipelines developed by the group, although the excellent research website examples are still relevant regardless of the lab notebook software used.

Github issues

Instant messages via slack is a great way to communicate quick Q&A between teammates on a project. However, when nobody in the team has an obvious solution to a problem at the moment, it is best to use github Issues to keep track of all discussions up to the point, in order to revisit them with later attempts to solve. Even if the problem may have already been solved for the time being, it is also useful to document the complete thread of discussions on issues with non-trivial and/or suboptimal solutions on github. To elaborate:

Daily pull request(s)

Since all of our computational work happen on github, it is required that lab members fork project repositories and send in pull request (PR) on daily basis to communicate with me about work. Daily PR is not meant as a “progress monitor”, but rather as a way to make my workload reasonable — it will ensure that every day I can spend a small amount of time to quickly review a limited amount of work (of what one can do during a single day) in order to keep up to your pace and to provide timely feedback. It would be challenging for me in terms of time and bandwidth if instead of sending in daily PR you communicate a week’s worth of work with me all at once and ask for my feedback. Even if you spent your day reading literature and did not perform any computational experiments, you can always use a dedicated lab notebook in your project repository to keep track of relevant literature for your project and add the ones you read to that notebook, then send in a PR so I can also be informed and learn from your literature research.

Here is an example of daily pull requests from Hao Sun on the Brain xQTL project.