Diving into the SRA abyss
Access to data remains a major hurdle for reproducibility in science
despite the increased availability of large-scale repositories for
online data storage, and even journal policies that require data archiving.
In this post I’ll try to make an argument for archiving data in the SRA through
demonstration of the
sratools wrapper from the
which makes it easy and elegant to download data from the SRA
when working in Python/Jupyter.
There are many reasons not to upload data to the SRA and to instead dump it on a generic archive like DRYAD, chief among them being that uploading data to SRA is super time-consuming and difficult. It requires entering pages upon pages of meta-data by hand into arcane web forms or spread-sheets to define numerous objects that are continually referenced by redundant names or prefix tags (e.g., SUB, SAMN, SRX, SRP, PRJNA, BioSamples and BioProjects), and which have a relational structure that defies understanding (e.g., 1 SRA can have 4 SRXs which produce data for 96 SRRs from 96 SRSs for 4 SRPs; See table below, which I reference frequently when trying to understand this stuff.) And so it should be no surprise that people often forgo data archiving.
|SRX||Experiment||Metadata about library, platform, selection.|
|SRR||Run||The actual sequence data for an experiment.|
|SRP||Study||Metadata about project (BioProject).|
|SRS||Sample||Metadata about the physical Sample (BioSample)|
|SRZ||Analysis||Mapped/aligned reads file (BAM) & metadata.|
|SRA||Submission||Metadata of other 5 linked objects.|
Why use the SRA?
The distinguishing advantage of using the SRA, however, is that grouping data with its associated meta-data allows later users to write scripts to access the data and properly format it for downstream analyses based on the information encoded in the meta-data (e.g., replicate samples are present from multiple lanes of sequencing, or paired reads need to be split into separate files). The idea of reusing genomic data seems obvious, but currently seems less appreciated than for previous types of data like Sanger sequences, which have been fairly uniformly archived in Genbank. Perhaps this is because data generation has been highly incentivized over data reuse in recent years.
For the purpose of creating reproducible code for a published study, then, the ideal example would include a single code block written at the beginning of a document to query and download all of the sequence data for a project while also providing information about the format of the data. This would be a huge advance over what is commonly available today, which is typically just a verbal instruction to “go get the raw data” from an accession ID before you start. Instead, reproducible code should itself include the necessary code (and software) to download the data and make it ready for downstream analysis.
What about sra-tools?
For several years there has been a proprietary tool developed by NCBI for this purpose, called sra-tools. Historically, I avoided using it because it was atrociously difficult to access and install, and thus I didn’t want to have to provide complex installation instructions at the beginning of my own documents just so users could download data from a link. But, now that it’s 2017, like most good software the sra-tools package has become available through conda, making it easy enough for anyone to install and use. To install the sra-tools along with some associated tools for searching genome databases (called entrez-tools), simply run the following command with conda.
>>> conda install -c bioconda sra-tools >>> conda install -c bioconda entrez-direct
The key program in
sra-tools to download data from SRA is called
And the main tool in
entrez-tools is called
esearch, which queries meta-data
from NCBI. If you wanted to get the data from a single Run Accession (SRR)
you can download its fastq data with
fastq-dump and the ID:
>>> fastq-dump SRR1754715
If you want to download data for an entire project, however, you need to
do some complex bash scripting. For example, below we query the “Study
accession” (SRP) to extract its metadata and from that
we extract info using
cut to grab just the
“Run accessions (SRRs)”. It is a bit convoluted just to get the SRRs
associated with an SRP.
>>> esearch -db sra -query SRP021469 | efetch --format runinfo | cut -d ',' -f 1
1 Run SRR1754715 SRR1754720 SRR1754721 SRR1754722 SRR1754723 SRR1754724 SRR1754725 SRR1754726 SRR1754727 SRR1754728 SRR1754729 SRR1754730 SRR1754731
To pass the resulting SRRs to the
fastq-dump command to be downloaded
you could write something like the following, which I’ve broken
down into smaller bits so you can see what each call is doing
before piping to the next:
>>> esearch -db sra -query SRP021469 | \ ## search SRA for SRP021469 efetch --format runinfo | \ ## parse the runinfo cut -d ',' -f 1 | \ ## split on ',' & grab 1st thing grep SRR | \ ## grab lines starting with 'SRR' xargs fastq-dump --gzip \ ## run fastq-dump on all SRR #s --outdir fastqs/ \ ## gzip compress & put into dir
By default this will name files according to their SRR accession IDs. To rename them you can pass in other names, or grab them from the SRP metadata, which often includes the taxonomic names and collection IDs. For example, to rename the files based on their original collection IDs we’ll grab the 30th element from runinfo. The command below shows what this looks like for the example SRP accession.
>>> esearch -db sra -query SRP021469 | efetch --format runinfo | cut -d ',' -f 30
The following is printed to stdout:
30 SampleName 29154_superba 30556_thamno 41954_cyathophylloides 41478_cyathophylloides 39618_rex 40578_rex 38362_rex 35855_rex 33588_przewalskii 33413_thamno 32082_przewalskii 30686_cyathophylla 35236_rex
OK, so now we want to download the sequence data and rename the files
according to the
SampleName info that we parse from the metadata.
I broke the command into two separate calls to simplify it a bit.
We will pass the
--accession flag to
tells it we want to rename the file and as an argument to it
we provide the
## store elements 1 & 30 into a string/list >>> vars=$(esearch -db sra -query SRP021469 | efetch --format runinfo | cut -d ',' -f 1,30 | grep SRR) ## pass each pair of elements to fastq-dump >>> for var in $vars do SRR=$(echo $var | cut -d ',' -f 1); ACC=$(echo $var | cut -d ',' -f 2); fastq-dump $SRR --accession $ACC --outdir fastqs/ --gzip; done
We now have the resulting data files, success!
>>> ls -l fastqs/
-rw-rw-r-- 1 deren deren 42M Feb 17 20:00 29154_superba.fastq.gz -rw-rw-r-- 1 deren deren 86M Feb 17 20:00 30556_thamno.fastq.gz -rw-rw-r-- 1 deren deren 136M Feb 17 20:01 41954_cyathophylloides.fastq.gz -rw-rw-r-- 1 deren deren 128M Feb 17 20:02 41478_cyathophylloides.fastq.gz -rw-rw-r-- 1 deren deren 50M Feb 17 20:03 39618_rex.fastq.gz -rw-rw-r-- 1 deren deren 100M Feb 17 20:04 40578_rex.fastq.gz -rw-rw-r-- 1 deren deren 82M Feb 17 20:05 38362_rex.fastq.gz -rw-rw-r-- 1 deren deren 84M Feb 17 20:06 35855_rex.fastq.gz -rw-rw-r-- 1 deren deren 61M Feb 17 20:06 33588_przewalskii.fastq.gz -rw-rw-r-- 1 deren deren 40M Feb 17 20:06 33413_thamno.fastq.gz -rw-rw-r-- 1 deren deren 58M Feb 17 20:07 32082_przewalskii.fastq.gz -rw-rw-r-- 1 deren deren 80M Feb 17 20:08 30686_cyathophylla.fastq.gz -rw-rw-r-- 1 deren deren 108M Feb 17 20:09 35236_rex.fastq.gz
Problems with this approach
It’s not very clear/readable
If your goal is simply to attain a few fastq files it really seems like overkill to have to install a large suite of tools and then write a large complicated script that pipes the tools together just to download a few files from online. A simple
wget command could likely this in one line. While it is nice to have
the meta-data available, this approach does not make the meta-data
very easy to read.
The download location is hard-coded
A very annoying aspect of the sra-tools is that it downloads files in
format first and then converts from that format into
The reason for this is to avoid errors in the download if it interrupted.
But crazy enough it actually downloads these large sra files into
a new directory that it creates in
/home/user/ncbi/ without telling you that it’s
doing it, and then it leaves these multiple GB size files behind in that
directory. Crazy. This is most troublesome when working on HPC machines where your
home directory may not even have sufficient space to temporarily hold these files.
Changing the location to another directory, like a scratch space, is not at all
easy, and requires using another tool from sra-tools called
It’s ugly, really ugly, and one of the reasons I became interested in
developing a better way.
A simpler (Pythonic) way?
I do almost all of my work these days in Jupyter notebooks, and so I aimed to write a simple wrapper around sra-tools + entez-tools that would function in a Pythonic way, and overcome some of the problems listed above. This tool is available through conda and distributed with the ipyrad analysis toolkit.
## install ipyrad and associated tools >>> conda install -c ipyrad ipyrad >>> conda install -c eaton-lab toytree
First import the ipyrad analysis tools (renamed as
ipa) and then initiate
an sratools object with a Study accession ID (SRP).
You can also provide an argument for “workdir” which is the location
where your fastq files (and temporary .sra files) will be downloaded to,
and which will be created if it doesn’t yet exist.
## import ipyrad analysis tools >>> import ipyrad.analysis as ipa ## initalize an sratools object >>> sra = ipa.sratools(accession="SRP021469", workdir="rawdata")
By providing just the SRP ID you can now query information about the study and have it returned as a nice DataFrame. Below I request fields 1,4,6,29, and 30. The result is returned as a Pandas DataFrame object which is easy to read and manipulate. From these fields you can see the Run IDs, the number of reads (spots), the fact that the data are single-end (i.e., no “spots with mates”), the ScientificNames and the SampleName provided by the study authors.
## view meta-data for all fields or a subset of fields >>> sra.fetch_runinfo((1,4,6,29,30))
Run spots spots_with_mates ScientificName SampleName 0 SRR1754715 696994 0 Pedicularis superba 29154_superba 1 SRR1754720 1452316 0 Pedicularis thamnophila 30556_thamno 2 SRR1754721 2199613 0 Pedicularis cyathophylloides 41954_cyathophylloides 3 SRR1754722 2199740 0 Pedicularis cyathophylloides 41478_cyathophylloides 4 SRR1754723 822263 0 Pedicularis rex 39618_rex 5 SRR1754724 1707942 0 Pedicularis rex 40578_rex 6 SRR1754725 1391175 0 Pedicularis rex 38362_rex 7 SRR1754726 1409843 0 Pedicularis rex 35855_rex 8 SRR1754727 1002923 0 Pedicularis przewalskii 33588_przewalskii 9 SRR1754728 636625 0 Pedicularis thamnophila 33413_thamno 10 SRR1754729 964244 0 Pedicularis przewalskii 32082_przewalskii 11 SRR1754730 1253109 0 Pedicularis cyathophylla 30686_cyathophylla 12 SRR1754731 1803858 0 Pedicularis rex 35236_rex
To download the data now you simply need to use the
As with all other tools in the
ipyrad.analysis toolkit existing files
with the same name will not be overwritten unless you use the
argument, and the work can be optionally parallelized by providing an
ipyparallel Client object as an argument (
ipyrad docs for more details).
If the data are paired-end they will be automatically split into separate files
for R1 and R2. You can indicate the field you wish to use for file
names with the
name_fields= argument, or provide multiple fields to name
files with multiple values. For example, to name files by their ScientificName
+ Run ID separated by an underscore we would run the following, which will
print a progress bar while it runs:
>>> sra.run(name_fields=(30,1), name_separator="_")
[####################] 100% Downloading fastq files | 0:02:53 | 13 fastq files downloaded to /home/deren/rawdata
And you can see the files were written to the proper location and renamed as we wished.
>>> ls -lthr rawdata/
total 1.1G -rw-rw-r-- 1 deren deren 42M Aug 28 19:10 29154_superba_SRR1754715.fastq.gz -rw-rw-r-- 1 deren deren 51M Aug 28 19:10 39618_rex_SRR1754723.fastq.gz -rw-rw-r-- 1 deren deren 61M Aug 28 19:10 33588_przewalskii_SRR1754727.fastq.gz -rw-rw-r-- 1 deren deren 85M Aug 28 19:11 35855_rex_SRR1754726.fastq.gz -rw-rw-r-- 1 deren deren 59M Aug 28 19:11 32082_przewalskii_SRR1754729.fastq.gz -rw-rw-r-- 1 deren deren 41M Aug 28 19:11 33413_thamno_SRR1754728.fastq.gz -rw-rw-r-- 1 deren deren 129M Aug 28 19:11 41478_cyathophylloides_SRR1754722.fastq.gz -rw-rw-r-- 1 deren deren 101M Aug 28 19:12 40578_rex_SRR1754724.fastq.gz -rw-rw-r-- 1 deren deren 80M Aug 28 19:12 30686_cyathophylla_SRR1754730.fastq.gz -rw-rw-r-- 1 deren deren 86M Aug 28 19:12 30556_thamno_SRR1754720.fastq.gz -rw-rw-r-- 1 deren deren 109M Aug 28 19:12 35236_rex_SRR1754731.fastq.gz -rw-rw-r-- 1 deren deren 82M Aug 28 19:12 38362_rex_SRR1754725.fastq.gz -rw-rw-r-- 1 deren deren 136M Aug 28 19:12 41954_cyathophylloides_SRR1754721.fastq.gz
Et voilà. A huge benefit of this approach is actually hidden under the hood,
which is that it automatically temporarily sets the location for the
files to be downloaded to the “workdir” location, and we remove them right
after they are converted into .fastq files. Therefore you will never run
into problems with hidden sra files filling up your drive space, as is often
the case with sra-tools.
Another strength of this approach is that you can clearly see and show the full meta-data from which the files are being drawn, nicely displayed as a Pandas DataFrame. This is a great thing to stick right at the beginning of a reproducible science document, like a Jupyter-notebook, to show exactly where your data came from and how you renamed the files.