Characterize evolutionary history from a subset of sampled genomes (individuals).
Characterize whole genomes from a subset of sequenced markers.
It is important to examine evolutionary history across the entire genome.
Introgression is common throughout the history of many lineages.
But there are many potential sources of error in these analyses...
Filter or impute missing data; easily distribute massively parallel jobs.
import ipyrad.analysis as ipa
# initiate an analysis tool with arguments
tool = ipa.pca(data=data, ...)
# run job (distribute in parallel)
tool.run()
# examine results
...
Filter or impute missing data; easily distribute massively parallel jobs.
tool = ipa.pca(data=data, imap=imap, minmap=minmap, mincov=0.5)
tool.run(nreplicates=20, seed=123)
tool.draw(2, 3);
Reference mapped RAD loci can be "spatially binned" to form larger loci.
import ipyrad.analysis as ipa
# initiate an analysis tool with arguments
tool = ipa.window_extacter(
data=data,
scaffold_idx=0,
start=0,
end=1000000,
)
# writes a phylip file
tool.run()
Reference mapped RAD loci can be "spatially binned" to form larger loci.
Reference mapped RAD loci can be "spatially binned" to form larger loci.
Faces two distinct problems:
(1) individual loci (e.g., RAD tags) are often insufficiently informative;
(2) most "loci" do not contain a single gene tree.
Faces two distinct problems:
(1) individual loci (e.g., RAD tags) are often insufficiently informative;
(2) most "loci" do not contain a single gene tree (even short RAD tags).
We have a recent preprint on this topic: phylogenetic half-life
At any reasonable phylogenetic scale the size of non-recombined loci is small!
import ipcoal
# simulate data with demographic parameters
model = ipcoal.Model(tree=newick, Ne=5e5, mut=1e-8, recomb=1e-9)
# simulate n loci of a given length
model.sim_loci(nloci=1, nsites=500)
Phylogenetic invariants are SNP patterns that are expected to occur in equal proportions given the existence of a split/edge on a tree (e.g., AAGG = GGAA). Some invariant patterns have been of particular interest as a metric for quantifying admixture (e.g., BABA, ABBA).
Identifying invariant patterns for a tree can be difficult, but is easy for small trees. Most invariant methods focus on 4-taxon trees.
The SVDquartets species tree inference method organizes SNP patterns into a matrix where algebraic symmetries make it easy to identify the correct unrooted topology for any four taxa.
By reducing the quantative SNP frequency data to a categorical relevant information for inferring admixture is lost (e.g., ABBA-BABA; Durand et al. 2011).
Higher level invariants (e.g., 5-taxon patterns) also exist and can provide even more information about admixture (Eaton et al. 2012, 2015). A difficult part of applying 4 or 5-taxon tests to larger tree, however, is that summarizing the results of many non-independent tests has not been automated and becomes difficult.
Unique fingerprint for different admixture scenarios
ExtraTrees Classifier (scikit-learn) is trained on simulated SNP count data and invariant features extracted from stacked matrices: ABBA-BABA (Durand et al. 2011); Hils statistics (Kubatko and Chifman 2019).
User inputs a species tree estimate to simcat, and it tests all admixture edge placements, and simulates over variable edge lengths, and demographic parameters.
Patrick McKenzie
Eaton lab PhD student
The classifier is able to infer the correct placement of admixture
edges at >95% accuracy with only 20K SNPs and <1 day of training for smallish trees (<= 8 taxa).
In practice: SNPs are easier to obtain than informative gene trees. And millions of SNPs may be able to detect more subtle patterns than hundreds of genes.
In theory:, informative gene trees do not exist for distantly related taxa (concat-alescence). And, phylogenetic invariants assume a general Markov Model that does not require optimizing parameters (even branch lengths!). It is a topology test.
Coming soon: Software tool for application to empirical data.
Species rich:
>600 species worldwide, approximately 300 endemic to Hengduan.
We collected >100 species (>2100 specimens) from >300 locations in 2018-2019.
Morphologically diverse:
Spectacular floral diversity and abundant homoplasy;
similar forms have evolved repeatedly (Ree 2005)
Complex history of assembly:
Mountain uplift over millions of years, glacial cycles over
thousands of years, river and mountains barriers, lead to
constantly shuffling communities (and species
interactions).
Negative fitness consequences imposed by one organism on another by disrupting successful reproduction.
Phenotypic overdispersion (limiting similarity) and
phylogenetic
randomness (homoplasy)
in assembly of Pedicularis species into communities (Eaton & Ree 2012).
Divergent selection drives greater differences between populuations in sympatry than allopatry (e.g., benthic/limnetic sticklebacks) to reduce competition for limited resources.
The difficulty for Pedicularis is that there are so many species that each interacts with: who is the focal competitor? We need a community model of character displacement.
Does interspecific competition/interference drive floral divergence?
Is floral divergence associated with genetic divergence/speciation?
Elongate styles have evolved multiple times (Ree 2005) and facilitate pollen competition among species (Tong and Huang 2016).
Hypothesis: Differences among populations (within species) are a result of interspecific interactions driving character displacement in local communities.
110 individuals from 15 targeted locations.
RAD-seq (original) PstI enzyme, ~5M reads per sample;
ipyrad min50 assembly: 20K loci, 21% missing, 286K SNPs
Style length does not correlate strongly with genetic clades.
Migration estimates are highly asymmetic (migrate-n; 100 loci): the southern-most clade (Yunnan) is a sink of gene flow. Very different relation from tree-based inference.
Lande (1976):
Selection pulls
the mean phenotype towards a local optimum, while
Gene Flow homogenizes phenotypes among populations,
and they evolve by stochastic
Drift.
Felsenstein (2002):
Eigen decomposition of the known migration matrix
yields a transformation to get independent trait means
(no covariances) and expected variances.
1. Focal phenotype (style length) measure across 15 populations.
2. Migration matrix estimated from
RAD data to model expected covariation of focal phenotypes.
3. Local biotic variables
phenotypic and phylogenetic distance to other Pedicularis in
each community. We will model local optima as a function of these
measurable variables.
Implementation: Bayesian hierarchical regression model in pyMC3: Fit residuals between observed and transformed trait means with biotic variables (allowing different slopes for different species).
The phylogenetic model is a poor fit to the data whereas the phenotypic nearest neighbor distance model fits well. Posterior distribution of model parameters:
P. cranolopha has a longer style when co-occurring with closer relatives; supports gametophytic "arms-race" hypothesis.
Larger pollen grows faster and further, consistent with 10X higher migration from long to short style populations inferred from genomic data.
Taxonomically challenging; split into species/subspecies based on style length, pubescence, and presence of a "forked beak".
Hybrid zones: contact between populations with "forked beak" and without.
Hybrid zones: contact between populations with "forked beak" and without.
1
2
2
3
The ipyrad-analysis toolkit makes it easy to deal
with missing data, file formats, replication, and reproducibility.
SNP-based tree and network inference methods are the future. While genealogies are real, informative gene trees are not (except plastids).
In Pedicularis species interactions drive reproductive character
displacement which likely accelerates floral evolution and diversification.
Richard Ree
Dave Boufford
Huang Shuang-Quan
De-Zhu Li
Patrick McKenzie
Jared Meek
Kasi Molina-Velez
Sandra Hoffberg
Isaac Overcast
NSF DEB
NSF DDIG
NSF EAPSI
Columbia Lenfest Award