ipyrad -- interactive assembly and analysis of RAD-seq data

And, a primer on Python, Jupyter, and reproducible science

Skills in this course:

  • introduction to RAD-seq assembly
  • ipyrad command line (CLI)
  • ipyrad Python code (API)
  • introduction to jupyter
  • introduction to parallel computing in Python

Introduction to RAD-seq assembly

  • Short reads (usually 50-150bp) single or paired.
  • Loci usually align perfectly, not tiled into contigs.
  • SNP data including full sequence data.
  • usually ~1e3 - 1e6 loci.
  • phased SNPs within loci, not phased between loci
  • anonymous (denovo) or spatial-located (reference-mapped)

Available assembly software

  1. Standard reference-mapping approaches (BWA + Picard + GATK + ...)
  2. STACKS
  3. pyRAD)
  4. TASSEL-UNEAK
  5. ipyrad

Advantages to using ipyrad over the other methods:

  1. Provides denovo, reference, and denovo-reference hybrid assembly methods
  2. Includes alignment steps to allow for indel variation
  3. Fast and massively parallelizable (hundreds/thousands of cores)
  4. Low memory footprint, e.g., compared to stacks.
  5. Branching methods support reproducibility and exploring parameter settings
  6. Python API supports integration with Jupyter and scripting.

ipyrad online documentation

The ipyrad command-line (CLI)

And introduction to the ipyrad setup and parameter settings.

In [3]:
%%bash

ipyrad -n tutorial
  New file 'params-tutorial.txt' created in /home/deren/websites/eaton-lab/slides/MBL

In [6]:
%%bash

cat params-tutorial.txt
------- ipyrad params file (v.0.7.5)--------------------------------------------
tutorial                       ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./                             ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5, 5                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8, 8                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
20, 20                         ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8, 8                           ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, v                        ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Phylogenetic gene/species tree inference

Liu et al. 2015