Download data from dbGaP
A brief documentation to how dbGaP website is accessed & how to download data.
Contributors
Gao Wang, Dongyue Xie
Data directory
Files downloaded from dbGaP need to be decypted & extracted to usable formats. Files are first to be downloaded to some temporary directory, then processed and saved into a more permanent location. It is therefore suggested that on RCC Midway, for example, you download data to a temporary folder under /scratch/midway2/$USER
then extract them to shared computational space /project2/mstephens/
for sharing with the group. This documentation assumes this workflow.
dbGaP access
-
Visit dbGaP, click on
Controlled Access Data
, then login with your eRA Commons account. You’ll see the dataset previously applied for and approved under the tabAuthorized Access -> My Requests
. -
In the column
Actions
clickRequest Files
for the dataset you want to download. You should see two possibilites in two tabs:- “Phenotype and Genotype files”
- “SRA data (reads and reference alignments)”
Download “Phenotype and Genotype files”
Click on the tab Phenotype and Genotype files
, then click / select data file names. After files you need are selected, create the data download request. You should be directed to a download instruction page. You should also receive an email confirmation which contains a link to this page.
For example, to download V8 eQTL release, use the Request for Exchange Area
data then check the +
sign to unfold directory and select data to download, in my case:
gtex/exchange/GTEx_phs000424/exchange/analysis_releases/GTEx_Analysis_2017-06-05_v8/eqtl
The download instruction page offers several options but I find it most straightforward to use Linux command tool ascp
.
ascp
can be download for Linux then install using bash
command. It requires root to install which we don’t have on RCC Midway. But it is easy to hack the installer: just open it with a text editor, eg vim
and edit the first 10 lines of code to remove root requirement and change installation path, then save and run bash
to install. You should find two files under a bin
folder in the path you just edited to the installer: 1) the executable ascp
and 2) the license file aspera-license
. If you don’t know how to hack it, you can use a version I got on Midway /project2/mstephens/software/ascp
(the license file sits in the same directory as ascp
). Additionally you need an ssh key which comes from the installation but again I provide it at /project2/mstephens/software/asperaweb_id_dsa.openssh
. Having ascp
executable alone without these additional files is not going to work.
Now you can run the download command. On the instruction page the command might look like:
"%ASPERA_CONNECT_DIR%\bin\ascp" -QTr -l 300M -k 1 -i "%ASPERA_CONNECT_DIR%\etc\asperaweb_id_dsa.openssh" -W A7A8C74EB00A14C92826A8EC64785904771847404B340486B977DD32F5F3701B399760E2328AA4F43ADB2A6E66A2EFA22D dbtest@gap-upload.ncbi.nlm.nih.gov:data/instant/gaowang/72550 .
but in our case you with ascp
set up as previously discussed, you should change it to:
/project2/mstephens/software/ascp -QTr -l 300M -k 1 -i /project2/mstephens/software/asperaweb_id_dsa.openssh -W A7A8C74EB00A14C92826A8EC64785904771847404B340486B977DD32F5F3701B399760E2328AA4F43ADB2A6E66A2EFA22D dbtest@gap-upload.ncbi.nlm.nih.gov:data/instant/gaowang/72550 /scratch/midway2/dyxie/dbGaP/
where in this example /scratch/midway2/dyxie/dbGaP/
is your previous setup for download directory. Then you will find your downloaded file under a folder 72550
under that directory, as implied in commands above.
Note: we need absolute path to asperaweb_id_dsa.openssh
not relative path such as ./
– that won’t work! Ask Kushal he learned the lesson the hard way.
The downloaded data have .ncbi_enc
file extensions. We need sra-toolkit
to decode the data next.
Install sra-toolkit
If use Anaconda on midway2, then it does not allow user to install sra-toolkit
. So we need to install the conda by ourthelves:
First download and install Miniconda3
if you haven’t done that before. Replace %Download address%
to the download link that you can find here:
wget %Download address%
Then run
bash Miniconda3-latest-Linux-x86_64.sh -bfp /scratch/midway2/$USER/miniconda3
echo "export PATH=/scratch/midway2/$USER/miniconda3/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
Check if using the right conda
which conda
Then the sra tool can be installed via:
conda install -c bioconda sra-tools
Set dbGaP repository key for the project
In order to decrypt, a dbGaP repository key is also required. The key is provided in a file with suffix “.ngc”. It can be obtained from two places in PI’s dbGaP account.
- The first place is the project page under “My Projects” tab, through a link named “get dbGaP repository key” in the “Actions” column. The key downloaded from here is valid to all downloaded data under the project.
- The second place is the download page under “Downloads” tab, through a link named “get dbGaP repository key in the “Actions” column.
I get the key from the first place. Then config the key by running command
vdb-config --import ~/prj_3253_D17102.ngc
After configuration, manually change the mkfg
file:
vim ~/.ncbi/user-settings.mkfg
Then change the root to the data folder you store the data. For example, here, the download directory is /scratch/midway2/dyxie/dbGaP/
. See below the change I made to point to that directory.
15c17
< /repository/user/protected/dbGaP-3253/root = "/home/dyxie/ncbi/dbGaP-3253"
---
> /repository/user/protected/dbGaP-3253/root = "/scratch/midway2/dyxie/dbGaP/"
Finally, run
cd /path/to/downloaded/data # in the above setting it is: /scratch/midway2/dyxie/dbGaP/72250
vdb-decrypt --decrypt-sra-files . /path/to/destination -v # we can use somewhere under /project2/mstephens for the otput destination directory
Then all the *.ncbi_enc
files under /path/to/downloaded/data
will be decrypted and the original *.ncbi_enc
files will be automatically removed.
Download “SRA data (reads and reference alignments)”
Before this, please follow instructions above in sections “Install sra-toolkit
” and “Set dbGaP repository key for the project”.
Then, click ‘SRA RUN selector’, then check ‘data-type(Run)’ from the Filters List on the left side. Then check rna
.
Select all the sra files you want to download and click ‘Selected’ in the Select session. Now download the Accession List
. The downloaded list is named SRR_Acc_List.txt
, in which are a list of SRR IDs.
To download these files and convert to BAM files, run the folloiwng command, replace xyz
with the ID.
cd /path/to/download/dir
prefetch xyz
sam-dump xyz.sra | samtools view -Sb > /path/to/destination/xyz.bam