Accessing sequence data for reproducible science

Accessing sequence data for reproducible science

Diving into the SRA abyss

Access to data remains a major hurdle for reproducibility in science despite the increased availability of large-scale repositories for online data storage, and even journal policies that require data archiving. In this post I’ll try to make an argument for archiving data in the SRA through demonstration of the sratools wrapper from the ipyrad.analysis toolkit, which makes it easy and elegant to download data from the SRA when working in Python/Jupyter.

There are many reasons not to upload data to the SRA and to instead dump it on a generic archive like DRYAD, chief among them being that uploading data to SRA is super time-consuming and difficult. It requires entering pages upon pages of meta-data by hand into arcane web forms or spread-sheets to define numerous objects that are continually referenced by redundant names or prefix tags (e.g., SUB, SAMN, SRX, SRP, PRJNA, BioSamples and BioProjects), and which have a relational structure that defies understanding (e.g., 1 SRA can have 4 SRXs which produce data for 96 SRRs from 96 SRSs for 4 SRPs; See table below, which I reference frequently when trying to understand this stuff.) And so it should be no surprise that people often forgo data archiving.

Prefix Accession Name Definition
SRX Experiment Metadata about library, platform, selection.
SRR Run The actual sequence data for an experiment.
SRP Study Metadata about project (BioProject).
SRS Sample Metadata about the physical Sample (BioSample)
SRZ Analysis Mapped/aligned reads file (BAM) & metadata.
SRA Submission Metadata of other 5 linked objects.

Why use the SRA?

The distinguishing advantage of using the SRA, however, is that grouping data with its associated meta-data allows later users to write scripts to access the data and properly format it for downstream analyses based on the information encoded in the meta-data (e.g., replicate samples are present from multiple lanes of sequencing, or paired reads need to be split into separate files). The idea of reusing genomic data seems obvious, but currently seems less appreciated than for previous types of data like Sanger sequences, which have been fairly uniformly archived in Genbank. Perhaps this is because data generation has been highly incentivized over data reuse in recent years.

For the purpose of creating reproducible code for a published study, then, the ideal example would include a single code block written at the beginning of a document to query and download all of the sequence data for a project while also providing information about the format of the data. This would be a huge advance over what is commonly available today, which is typically just a verbal instruction to “go get the raw data” from an accession ID before you start. Instead, reproducible code should itself include the necessary code (and software) to download the data and make it ready for downstream analysis.

What about sra-tools?

For several years there has been a proprietary tool developed by NCBI for this purpose, called sra-tools. Historically, I avoided using it because it was atrociously difficult to access and install, and thus I didn’t want to have to provide complex installation instructions at the beginning of my own documents just so users could download data from a link. But, now that it’s 2017, like most good software the sra-tools package has become available through conda, making it easy enough for anyone to install and use. To install the sra-tools along with some associated tools for searching genome databases (called entrez-tools), simply run the following command with conda.

>>> conda install -c bioconda sra-tools
>>> conda install -c bioconda entrez-direct

The key program in sra-tools to download data from SRA is called fastq-dump. And the main tool in entrez-tools is called esearch, which queries meta-data from NCBI. If you wanted to get the data from a single Run Accession (SRR) you can download its fastq data with fastq-dump and the ID:

>>> fastq-dump SRR1754715

If you want to download data for an entire project, however, you need to do some complex bash scripting. For example, below we query the “Study accession” (SRP) to extract its metadata and from that we extract info using efetch and cut to grab just the “Run accessions (SRRs)”. It is a bit convoluted just to get the SRRs associated with an SRP.

>>> esearch -db sra -query SRP021469 | efetch --format runinfo | cut -d ',' -f 1

To pass the resulting SRRs to the fastq-dump command to be downloaded you could write something like the following, which I’ve broken down into smaller bits so you can see what each call is doing before piping to the next:

>>> esearch -db sra -query SRP021469 |      \  ## search SRA for SRP021469
      efetch --format runinfo |             \  ## parse the runinfo
      cut -d ',' -f 1 |                     \  ## split on ',' & grab 1st thing
      grep SRR |                            \  ## grab lines starting with 'SRR'
      xargs fastq-dump --gzip               \  ## run fastq-dump on all SRR #s
                       --outdir fastqs/     \  ## gzip compress & put into dir


By default this will name files according to their SRR accession IDs. To rename them you can pass in other names, or grab them from the SRP metadata, which often includes the taxonomic names and collection IDs. For example, to rename the files based on their original collection IDs we’ll grab the 30th element from runinfo. The command below shows what this looks like for the example SRP accession.

>>> esearch -db sra -query SRP021469 | efetch --format runinfo | cut -d ',' -f 30

The following is printed to stdout:


OK, so now we want to download the sequence data and rename the files according to the SampleName info that we parse from the metadata. I broke the command into two separate calls to simplify it a bit. We will pass the --accession flag to fastq-dump which tells it we want to rename the file and as an argument to it we provide the SampleName.

## store elements 1 & 30 into a string/list
>>> vars=$(esearch -db sra -query SRP021469 |  
           efetch --format runinfo |           
           cut -d ',' -f 1,30 |                  
           grep SRR)         

## pass each pair of elements to fastq-dump
>>> for var in $vars
        SRR=$(echo $var | cut -d ',' -f 1);
        ACC=$(echo $var | cut -d ',' -f 2);
        fastq-dump $SRR --accession $ACC --outdir fastqs/ --gzip;

We now have the resulting data files, success!

>>> ls -l fastqs/
-rw-rw-r-- 1 deren deren  42M Feb 17 20:00 29154_superba.fastq.gz
-rw-rw-r-- 1 deren deren  86M Feb 17 20:00 30556_thamno.fastq.gz
-rw-rw-r-- 1 deren deren 136M Feb 17 20:01 41954_cyathophylloides.fastq.gz
-rw-rw-r-- 1 deren deren 128M Feb 17 20:02 41478_cyathophylloides.fastq.gz
-rw-rw-r-- 1 deren deren  50M Feb 17 20:03 39618_rex.fastq.gz
-rw-rw-r-- 1 deren deren 100M Feb 17 20:04 40578_rex.fastq.gz
-rw-rw-r-- 1 deren deren  82M Feb 17 20:05 38362_rex.fastq.gz
-rw-rw-r-- 1 deren deren  84M Feb 17 20:06 35855_rex.fastq.gz
-rw-rw-r-- 1 deren deren  61M Feb 17 20:06 33588_przewalskii.fastq.gz
-rw-rw-r-- 1 deren deren  40M Feb 17 20:06 33413_thamno.fastq.gz
-rw-rw-r-- 1 deren deren  58M Feb 17 20:07 32082_przewalskii.fastq.gz
-rw-rw-r-- 1 deren deren  80M Feb 17 20:08 30686_cyathophylla.fastq.gz
-rw-rw-r-- 1 deren deren 108M Feb 17 20:09 35236_rex.fastq.gz

Problems with this approach

It’s not very clear/readable
If your goal is simply to attain a few fastq files it really seems like overkill to have to install a large suite of tools and then write a large complicated script that pipes the tools together just to download a few files from online. A simple wget command could likely this in one line. While it is nice to have the meta-data available, this approach does not make the meta-data very easy to read.

The download location is hard-coded
A very annoying aspect of the sra-tools is that it downloads files in .sra format first and then converts from that format into .fastq fiels. The reason for this is to avoid errors in the download if it interrupted. But crazy enough it actually downloads these large sra files into a new directory that it creates in /home/user/ncbi/ without telling you that it’s doing it, and then it leaves these multiple GB size files behind in that directory. Crazy. This is most troublesome when working on HPC machines where your home directory may not even have sufficient space to temporarily hold these files. Changing the location to another directory, like a scratch space, is not at all easy, and requires using another tool from sra-tools called vdb-config. It’s ugly, really ugly, and one of the reasons I became interested in developing a better way.

A simpler (Pythonic) way?

I do almost all of my work these days in Jupyter notebooks, and so I aimed to write a simple wrapper around sra-tools + entez-tools that would function in a Pythonic way, and overcome some of the problems listed above. This tool is available through conda and distributed with the ipyrad analysis toolkit.

## install ipyrad and associated tools
>>> conda install -c ipyrad ipyrad
>>> conda install -c eaton-lab toytree

First import the ipyrad analysis tools (renamed as ipa) and then initiate an sratools object with a Study accession ID (SRP). You can also provide an argument for “workdir” which is the location where your fastq files (and temporary .sra files) will be downloaded to, and which will be created if it doesn’t yet exist.

## import ipyrad analysis tools
>>> import ipyrad.analysis as ipa

## initalize an sratools object
>>> sra = ipa.sratools(accession="SRP021469", workdir="rawdata")

By providing just the SRP ID you can now query information about the study and have it returned as a nice DataFrame. Below I request fields 1,4,6,29, and 30. The result is returned as a Pandas DataFrame object which is easy to read and manipulate. From these fields you can see the Run IDs, the number of reads (spots), the fact that the data are single-end (i.e., no “spots with mates”), the ScientificNames and the SampleName provided by the study authors.

## view meta-data for all fields or a subset of fields
>>> sra.fetch_runinfo((1,4,6,29,30))
           Run    spots spots_with_mates                ScientificName              SampleName  
0   SRR1754715   696994                0           Pedicularis superba           29154_superba  
1   SRR1754720  1452316                0       Pedicularis thamnophila            30556_thamno  
2   SRR1754721  2199613                0  Pedicularis cyathophylloides  41954_cyathophylloides  
3   SRR1754722  2199740                0  Pedicularis cyathophylloides  41478_cyathophylloides  
4   SRR1754723   822263                0               Pedicularis rex               39618_rex  
5   SRR1754724  1707942                0               Pedicularis rex               40578_rex  
6   SRR1754725  1391175                0               Pedicularis rex               38362_rex  
7   SRR1754726  1409843                0               Pedicularis rex               35855_rex  
8   SRR1754727  1002923                0       Pedicularis przewalskii       33588_przewalskii  
9   SRR1754728   636625                0       Pedicularis thamnophila            33413_thamno  
10  SRR1754729   964244                0       Pedicularis przewalskii       32082_przewalskii  
11  SRR1754730  1253109                0      Pedicularis cyathophylla      30686_cyathophylla  
12  SRR1754731  1803858                0               Pedicularis rex               35236_rex

To download the data now you simply need to use the .run() command. As with all other tools in the ipyrad.analysis toolkit existing files with the same name will not be overwritten unless you use the force=True argument, and the work can be optionally parallelized by providing an ipyparallel Client object as an argument (ipyclient=; see ipyrad docs for more details). If the data are paired-end they will be automatically split into separate files for R1 and R2. You can indicate the field you wish to use for file names with the name_fields= argument, or provide multiple fields to name files with multiple values. For example, to name files by their ScientificName + Run ID separated by an underscore we would run the following, which will print a progress bar while it runs:

>>>,1), name_separator="_")
[####################] 100%  Downloading fastq files | 0:02:53 |  
13 fastq files downloaded to /home/deren/rawdata

And you can see the files were written to the proper location and renamed as we wished.

>>> ls -lthr rawdata/
total 1.1G
-rw-rw-r-- 1 deren deren  42M Aug 28 19:10 29154_superba_SRR1754715.fastq.gz
-rw-rw-r-- 1 deren deren  51M Aug 28 19:10 39618_rex_SRR1754723.fastq.gz
-rw-rw-r-- 1 deren deren  61M Aug 28 19:10 33588_przewalskii_SRR1754727.fastq.gz
-rw-rw-r-- 1 deren deren  85M Aug 28 19:11 35855_rex_SRR1754726.fastq.gz
-rw-rw-r-- 1 deren deren  59M Aug 28 19:11 32082_przewalskii_SRR1754729.fastq.gz
-rw-rw-r-- 1 deren deren  41M Aug 28 19:11 33413_thamno_SRR1754728.fastq.gz
-rw-rw-r-- 1 deren deren 129M Aug 28 19:11 41478_cyathophylloides_SRR1754722.fastq.gz
-rw-rw-r-- 1 deren deren 101M Aug 28 19:12 40578_rex_SRR1754724.fastq.gz
-rw-rw-r-- 1 deren deren  80M Aug 28 19:12 30686_cyathophylla_SRR1754730.fastq.gz
-rw-rw-r-- 1 deren deren  86M Aug 28 19:12 30556_thamno_SRR1754720.fastq.gz
-rw-rw-r-- 1 deren deren 109M Aug 28 19:12 35236_rex_SRR1754731.fastq.gz
-rw-rw-r-- 1 deren deren  82M Aug 28 19:12 38362_rex_SRR1754725.fastq.gz
-rw-rw-r-- 1 deren deren 136M Aug 28 19:12 41954_cyathophylloides_SRR1754721.fastq.gz

Et voilà. A huge benefit of this approach is actually hidden under the hood, which is that it automatically temporarily sets the location for the .sra files to be downloaded to the “workdir” location, and we remove them right after they are converted into .fastq files. Therefore you will never run into problems with hidden sra files filling up your drive space, as is often the case with sra-tools.

Another strength of this approach is that you can clearly see and show the full meta-data from which the files are being drawn, nicely displayed as a Pandas DataFrame. This is a great thing to stick right at the beginning of a reproducible science document, like a Jupyter-notebook, to show exactly where your data came from and how you renamed the files.