Daily research practice

This document discusses reproducible research practice in the lab. A presentation (an outdated version) of this document is available here.

I initially started this document exclusively for master and junior PhD students, although it turns out that the entire lab can benefit a lot from working under a unified style that facilitates collaboration between projects.

Why and how to organize your computational research

Please read this paper and this paper for motivations and general guideline for keeping your computational work organized.

Suggested organization of a git repository for a project

We use git repositories hosted on github.com to organize our research (if you are not familiar with git please check out this 5 minutes git tutorial). We recommend the following organization for a project repository:

project_root
- analysis # interactive analysis notebooks
- code
   - dsc # statistical benchmarking (using the DSC language, https://stephenslab.github.io/dsc-wiki)
   - python # Python utility scripts
   - R # R utility scripts
- workflows # SoS pipeline notebooks (using the SoS language, https://vatlab.github.io/sos-docs)
- data # meta-data for data on cluster or cloud, and a few essential and small (<5MB) data-sets
- input # input folder should never be uploaded to github. You can skip via `.gitignore` file
- output # output folder should never be uploaded to github. You can skip via `.gitignore` file

Additionally we should use the following features on GitHub:

  • “Discussion”: you can activate it under GitHub Settings -> Enable Discussion. You can use this to summarize meeting notes, document what you have learned, the list of papers you plan to read, links to DOCX, PPTX and PDF files on Cloud, etc.
  • “Issues”: use this to keep track of small blockers and problems, especially technical ones, to have a more focused discussion with your collaborators.

Suggested format of lab notebooks

Regardless of your focus (on methods development or applied data analysis) it is required that all computational experiments in your daily research be documented, well organized and version controlled (using git), and is always ready to be communicated with others.

Most people in the lab use either IPython Notebook (also known as Jupyter Notebook) or Rmarkdown, or both. We highly recommend Jupyter Notebooks over Rmarkdown (for reasons to be elaborated later) unless you have strong preference of the latter.

In a lab notebook you should document your research clearly, communicating your results along with the codes that generated them in a single self-contained document for a specific task. In particular, in a notebook you should write notes in Markdown chunks in between code chunks to explain what you do with those codes. Code commenting is different from and cannot replace such scientific narratives in a notebook.

We suggest the following organization of a lab notebook:

  1. Title, and in the same notebook cell a brief one sentence summary of the goal to achieve with the notebook.
  2. Motivation or Aims: describe the problem under investigation into actionable items.
  3. Methods overview: a high-level description of methods used to solve the problem.
  4. Main conclusions (not applicable to a pure workflow notebook): take home message from your investigations.
  5. Data input and output (if applicable): describe data used and generated from the notebook.
  6. Key parameters (if applicable): explain model or algorithm parameters that are crucial to the analysis (no need to explain all parameters).
  7. The rest of the notebook: multiple sections of detailed steps, with interactive codes / workflows and narratives, as well as diagnostic summary statistics, plots and tables at each step.

Usually, a notebook should have a single goal although it can contain multiple tasks. It is usually not more than a week worth of work (40 hrs).

Your computational experiments should evolve over time, even during the course of a day:

  1. You can start off with unpolished code and unpolished narrative for an analysis.
  2. As you improve your experiments, you continue to polish the narratives and organizations of the analysis code. Iteratively improving the notebook can help crystallizing your ideas and reduce errors (and make your analysis presentable to other readers including yourself in 6 months in the future!).
  3. If you find yourself repeatedly using some of the codes you developed, you can convert them into functions or mini bioinformatics pipelines inside a notebook (see the next section for more details), and call them repeatedly in the same notebook to run the experiments.
  4. If you deem these functions or pipelines useful for other projects, you should make them utility library / package or formal bioinformatics workflows with a minimal working example (MWE) data-set to help testing and improving it in the future.

You should organize all notebooks in a github repository to a research website. The next section will discuss some tools to implement that.

Software for lab notebooks

JupyterLab and Jupyter Book

An important reason we recommend JupyterLab over Rstudio is because our work typically involves using multiple programming languages to perform both interactive data analysis and batch execution of bioinformatic workflows. We use the SoS suite — which is both a multi-language notebook for interactive analysis as well as a workflow system for batch data analysis — for our daily research computing. SoS uses JupyterLab as its IDE.

To create a website from Jupyter notebooks I recommend the program Jupyter Book, which includes a tutorial to publish a website to github.io. Here is an example Jupyter Book based website created by our student for a project.

I used to build project websites using a small program I wrote. See for example this repository (and the HTML version). The program is still usable but I no longer maintain it. But please don’t use this program any more – please use Jupyter Book.

In your daily research you will be expected to use SoS Notebook to analyze data, document your workflows with suggested analysis report format, and make them available as websites to share with our colleagues.

Rstudio and workflowr

For R users who write R codes and Rmarkdown documents, Rstudio with workflowr is a good alternative to JupyterLab for interactive data analysis. Here are some excellent workflowr based research websites from our colleagues:

  • https://brimittleman.github.io/apaQTL
  • https://brimittleman.github.io/threeprimeseq
  • https://lsun.github.io/truncash
  • https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHnn2.html

Currently workflowr is actively maintained and evolve with Rstudio. Still, Rstudio with workflowr is not recommended if your work often involves analyzing data with bioinformatics pipelines developed by the group, although the excellent research website examples are still relevant regardless of the lab notebook software used.

Frequent pull request(s)

As all of our computational work happens on Github, it is essential for lab members to fork project repositories and send frequent pull requests (PRs) to communicate progress. Lab members should

  1. aim at a minimum of two PRs a week.
  2. make sure that all contents to be discussed during an in person meeting should be sent via PR beforehand.

Why are frequent PRs important?

  • PRs are not just progress tracker, but also an opportunity to develop communication skills by carefully editing notebooks to ensure clear communication.
  • Submitting lab notebooks in the suggested format via PRs will allow me to clarify goals and action items for your work in an organized way (by commenting to the PR). Goal-setting and task clarification are crucial to project success and PRs help steer that.
  • Frequent PRs help to manage my workload, allowing me to quickly review a limited amount of work each day, keep up with your pace, and provide timely feedback.
  • PRs in the past week also make our weekly in-person meetings more efficient, allowing us to focus on unresolved problems or learning from each other.

What if I don’t have enough updates in terms of completed codes and analysis?

You can always send in a PR with [WIP] in the title of the PR for Work In Progress. I will then glance through it without merging it, until you keep pushing to it and remove the [WIP] then I will fully review and merge.

Even if you spent your day reading literature and took your own notes elsewhere, you can always use a dedicated lab notebook in your project repository eg under writeup folder to keep a list of relevant literature for your project and add some high level comments to each of the papers. This type of PR can also inform and educate me of your literature research.

Here is an example of (mostly) daily pull requests from Hao on the Brain xQTL project. As another example, Xuewei has a nice weekly update routine with extremely well documented methodology, experiments and analysis as well as Github Issues discussions in this private repo.

Github issues

Instant messages via slack is a great way to communicate quick Q&A between teammates on a project. However, when nobody in the team has an obvious solution to a problem at the moment, it is best to use github Issues to keep track of all discussions up to the point, in order to revisit them with later attempts to solve. Even if the problem may have already been solved for the time being, it is also useful to document the complete thread of discussions on issues with non-trivial and/or suboptimal solutions on github. To elaborate: