How can we most accurately reconstruct the
evolutionary history of organisms from their genomes?

How can genomics (i.e., evolutionary inference)
be used to reconstruct historical
ecological interactions among species?

Phylogenomic sampling


Characterize evolutionary history from a subset of sampled genomes (individuals).



Phylogenomic sampling


Characterize whole genomes from a subset of sequenced markers.


Genealogical variation


It is important to examine evolutionary history across the entire genome.

Historical introgression/admixture


It is important to examine evolutionary history across the entire genome.

How can we most accurately reconstruct the
evolutionary history of organisms from their genomes?

Software development

Phylogenomic inference methods

Missing data in RAD-seq and other methods

Missing data in RAD-seq and other methods

ipyrad-analysis toolkit

Filter or impute missing data; easily distribute massively parallel jobs.

 
  import ipyrad.analysis as ipa

  # initiate an analysis tool with arguments
  tool = ipa.pca(data=data, ...)

  # run job (distribute in parallel)
  tool.run()

  # examine results
  ...
                    

ipyrad-analysis toolkit

Filter or impute missing data; easily distribute massively parallel jobs.

 
  tool = ipa.pca(data=data, imap=imap, minmap=minmap, mincov=0.5)
  tool.run(nreplicates=20, seed=123)
  tool.draw(2, 3);
                    

Window_extracter: extract, filter, format.

Reference mapped RAD loci can be "spatially binned" to form larger loci.

 
  import ipyrad.analysis as ipa

  # initiate an analysis tool with arguments
  tool = ipa.window_extacter(
      data=data,
      scaffold_idx=0, 
      start=0, 
      end=1000000,
  )

  # writes a phylip file
  tool.run()
                    

Window_extracter: extract, filter, format.

Reference mapped RAD loci can be "spatially binned" to form larger loci.

Window_extracter: extract, filter, format.

Reference mapped RAD loci can be "spatially binned" to form larger loci.

Herbicide resistance among Amaranthus species.


Phylogenomic methods: gene trees

Gene tree inference


Faces two distinct problems:
(1) individual loci (e.g., RAD tags) are often insufficiently informative;
(2) most "loci" do not contain a single gene tree.

But is it a good idea to concat-alesce?

Data from different parts of the genome have different coalescent histories (different sampled ancestors due to recombination). Nearby (linked) regions share more ancestors and thus are correlated, but this also decays over time. How large are real loci that share the same gene tree?

We can investigate this question using simulations (e.g., msprime and ipcoal).

ipcoal: simulate genealogies and loci in spp. trees

At any reasonable phylogenetic scale the size of non-recombined loci is small!

 
  import ipcoal

  # simulate data with demographic parameters
  model = ipcoal.Model(tree=newick, Ne=5e5, mut=1e-8, recomb=1e-9)

  # simulate n loci of a given length
  model.sim_loci(nloci=1, nsites=500)
                    

Decaying Phylogenomic methods: SNPs

Phylogenomic methods: SNPs

The Phylogenetic invariants framework


Phylogenetic invariants are SNP patterns that are expected to occur in equal proportions given the existence of a split/edge on a tree (e.g., AAGG = GGAA). Some invariant patterns have been of particular interest as a metric for quantifying admixture (e.g., BABA, ABBA).

Identifying invariant patterns for a tree can be difficult, but is easy for small trees. Most invariant methods focus on 4-taxon trees.

The SVDquartets species tree inference method organizes SNP patterns into a matrix where algebraic symmetries make it easy to identify the correct unrooted topology for any four taxa.

SVDquartets reduces SNP matrices to categorical results


SVDquartets reduces SNP matrices to categorical results


SVDquartets reduces SNP matrices to categorical results


Shortcoming of the SVDquartets approach

By reducing the quantative SNP frequency data to a categorical relevant information for inferring admixture is lost (e.g., ABBA-BABA; Durand et al. 2011).

Moving beyond ABBA-BABA

Higher level invariants (e.g., 5-taxon patterns) also exist and can provide even more information about admixture (Eaton et al. 2012, 2015). A difficult part of applying 4 or 5-taxon tests to larger tree, however, is that summarizing the results of many non-independent tests has not been automated and becomes difficult.

Stacked matrices for inferring admixture edges


Stacked matrices for inferring admixture edges


Stacked matrices for inferring admixture edges


Stacked matrices for inferring admixture edges


Stacked matrices for inferring admixture edges


Stacked count matrices

Unique fingerprint for different admixture scenarios

simcat: inference of admixture edges from machine
learning on phylogenetic invariants


ExtraTrees Classifier (scikit-learn) is trained on simulated SNP count data and invariant features extracted from stacked matrices: ABBA-BABA (Durand et al. 2011); Hils statistics (Kubatko and Chifman 2019).

User inputs a species tree estimate to simcat, and it tests all admixture edge placements, and simulates over variable edge lengths, and demographic parameters.


Patrick McKenzie
Eaton lab PhD student

simcat: machine learning on phylogenetic invariants

Dimensionality reduction (t-SNE) shows how simulations with different admixture edges fall into distinct clusters differentiated by their relative SNP count frequencies and extracted features. Variation in edge lengths and Ne have very little effect since the informative features are invariants.

simcat: machine learning on phylogenetic invariants

Dimensionality reduction (t-SNE) shows how simulations with different admixture edges fall into distinct clusters differentiated by their relative SNP count frequencies and extracted features. Variation in edge lengths and Ne have very little effect since the informative features are invariants.

simcat: machine learning on phylogenetic invariants

The classifier is able to infer the correct placement of admixture edges at >95% accuracy with only 20K SNPs and <1 day of training for smallish trees (<= 8 taxa).

The strengths of this approach are both practical and theoretical


In practice: SNPs are easier to obtain than informative gene trees. And thousands of genome-wide SNPs are a better datatype for detecting introgression than a few dozen or hundred gene trees, since introgression may be present in only a small fraction of the genome.

In theory:, informative gene trees do not exist for distantly related taxa (concat-alescence), which can bias gene tree based methods. Only SNPs accurately reflect distinct genealogical histories.

Coming soon: Software tool for application to empirical data.

How can we use genomics to reconstruct historical
ecological interactions among species?

Floral diversity in Pedicularis

Pedicularis L. in China

Species rich:
>600 species worldwide, approximately 300 endemic to Hengduan.
We collected >60 species from 100 locations in 2018.

Morphologically diverse:
Spectacular floral diversity and abundant homoplasy; similar forms have evolved repeatedly.

Complex history of assembly:
Mountain uplift over millions of years, glacial cycles over thousands of years, river and mountains barriers, lead to constantly shuffling communities (and species interactions).

Reproductive interference

Negative fitness consequences imposed by one organism on another by disrupting successful reproduction.

Morphological terminology

The beak of the galea directs pollen placement and pickup

Elongate styles

Elongate styles have evolved multiple times (Ree 2005) and facilitate pollen competition among species (Tong and Huang 2016).

Reproductive character displacement

Hypothesis: Differences among populations (within species) are a result of interspecific interactions driving character displacement in local communities.

Experiment: Does style length variation affect gene flow?

Larger pollen grows faster and further, consistent with 10X higher migration from long to short style populations inferred from genomic data.

Acknowledgements

Richard Ree
Dave Boufford
Huang Shuang-Quan
De-Zhu Li
Patrick McKenzie
Jared Meek
Kasi Molina-Velez
Sandra Hoffberg
Isaac Overcast

NSF DEB
NSF DDIG
NSF EAPSI
Columbia Lenfest Award