Download data from dbGaP
A brief documentation to how dbGaP website is accessed & how to download data.
Gao Wang, Dongyue Xie
Files downloaded from dbGaP need to be decypted & extracted to usable formats. Files are first to be downloaded to some temporary directory, then processed and saved into a more permanent location. It is therefore suggested that on RCC Midway, for example, you download data to a temporary folder under
/scratch/midway2/$USER then extract them to shared computational space
/project2/mstephens/ for sharing with the group. This documentation assumes this workflow.
Visit dbGaP, click on
Controlled Access Data, then login with your eRA Commons account. You’ll see the dataset previously applied for and approved under the tab
Authorized Access -> My Requests.
In the column
Request Filesfor the dataset you want to download. You should see two possibilites in two tabs:
- “Phenotype and Genotype files”
- “SRA data (reads and reference alignments)”
Download “Phenotype and Genotype files”
Click on the tab
Phenotype and Genotype files, then click / select data file names. After files you need are selected, create the data download request. You should be directed to a download instruction page. You should also receive an email confirmation which contains a link to this page.
For example, to download V8 eQTL release, use the Request for
Exchange Area data then check the
+ sign to unfold directory and select data to download, in my case:
The download instruction page offers several options but I find it most straightforward to use Linux command tool
ascp can be download for Linux then install using
bash command. It requires root to install which we don’t have on RCC Midway. But it is easy to hack the installer: just open it with a text editor, eg
vim and edit the first 10 lines of code to remove root requirement and change installation path, then save and run
bash to install. You should find two files under a
bin folder in the path you just edited to the installer: 1) the executable
ascp and 2) the license file
aspera-license. If you don’t know how to hack it, you can use a version I got on Midway
/project2/mstephens/software/ascp (the license file sits in the same directory as
ascp). Additionally you need an ssh key which comes from the installation but again I provide it at
ascp executable alone without these additional files is not going to work.
Now you can run the download command. On the instruction page the command might look like:
"%ASPERA_CONNECT_DIR%\bin\ascp" -QTr -l 300M -k 1 -i "%ASPERA_CONNECT_DIR%\etc\asperaweb_id_dsa.openssh" -W A7A8C74EB00A14C92826A8EC64785904771847404B340486B977DD32F5F3701B399760E2328AA4F43ADB2A6E66A2EFA22D email@example.com:data/instant/gaowang/72550 .
but in our case you with
ascp set up as previously discussed, you should change it to:
/project2/mstephens/software/ascp -QTr -l 300M -k 1 -i /project2/mstephens/software/asperaweb_id_dsa.openssh -W A7A8C74EB00A14C92826A8EC64785904771847404B340486B977DD32F5F3701B399760E2328AA4F43ADB2A6E66A2EFA22D firstname.lastname@example.org:data/instant/gaowang/72550 /scratch/midway2/dyxie/dbGaP/
where in this example
/scratch/midway2/dyxie/dbGaP/ is your previous setup for download directory. Then you will find your downloaded file under a folder
72550 under that directory, as implied in commands above.
Note: we need absolute path to
asperaweb_id_dsa.openssh not relative path such as
./ – that won’t work! Ask Kushal he learned the lesson the hard way.
The downloaded data have
.ncbi_enc file extensions. We need
sra-toolkit to decode the data next.
If use Anaconda on midway2, then it does not allow user to install
sra-toolkit. So we need to install the conda by ourthelves:
First download and install
Miniconda3 if you haven’t done that before. Replace
%Download address% to the download link that you can find here:
wget %Download address%
bash Miniconda3-latest-Linux-x86_64.sh -bfp /scratch/midway2/$USER/miniconda3 echo "export PATH=/scratch/midway2/$USER/miniconda3/bin:\$PATH" >> ~/.bashrc source ~/.bashrc
Check if using the right conda
Then the sra tool can be installed via:
conda install -c bioconda sra-tools
Set dbGaP repository key for the project
In order to decrypt, a dbGaP repository key is also required. The key is provided in a file with suffix “.ngc”. It can be obtained from two places in PI’s dbGaP account.
- The first place is the project page under “My Projects” tab, through a link named “get dbGaP repository key” in the “Actions” column. The key downloaded from here is valid to all downloaded data under the project.
- The second place is the download page under “Downloads” tab, through a link named “get dbGaP repository key in the “Actions” column.
I get the key from the first place. Then config the key by running command
vdb-config --import ~/prj_3253_D17102.ngc
After configuration, manually change the
Then change the root to the data folder you store the data. For example, here, the download directory is
/scratch/midway2/dyxie/dbGaP/. See below the change I made to point to that directory.
15c17 < /repository/user/protected/dbGaP-3253/root = "/home/dyxie/ncbi/dbGaP-3253" --- > /repository/user/protected/dbGaP-3253/root = "/scratch/midway2/dyxie/dbGaP/"
cd /path/to/downloaded/data # in the above setting it is: /scratch/midway2/dyxie/dbGaP/72250 vdb-decrypt --decrypt-sra-files . /path/to/destination -v # we can use somewhere under /project2/mstephens for the otput destination directory
Then all the
*.ncbi_enc files under
/path/to/downloaded/data will be decrypted and the original
*.ncbi_enc files will be automatically removed.
Download “SRA data (reads and reference alignments)”
Before this, please follow instructions above in sections “Install
sra-toolkit” and “Set dbGaP repository key for the project”.
Then, click ‘SRA RUN selector’, then check ‘data-type(Run)’ from the Filters List on the left side. Then check
Select all the sra files you want to download and click ‘Selected’ in the Select session. Now download the
Accession List. The downloaded list is named
SRR_Acc_List.txt, in which are a list of SRR IDs.
To download these files and convert to BAM files, run the folloiwng command, replace
xyz with the ID.
cd /path/to/download/dir prefetch xyz sam-dump xyz.sra | samtools view -Sb > /path/to/destination/xyz.bam