Deconstructing spatial genealogical variation across genomes

Deren Eaton
Ecology, Evolution, and Environmental Biology, Columbia University

Genealogical variation

Genomes are composed of a mosaic of segments inherited from different ancestors,
each separated by past recombination events.

Consequently, genealogical relationships vary spatially across genomes.

Multispecies coalescent assumptions

The multispecies coalescent (MSC) describes the expected distribution of unlinked genealogies, as a function of demographic model parameters (N$_e$, $\tau$, topology).

Multispecies coalescent assumptions

The multispecies coalescent (MSC) describes the expected distribution of unlinked genealogies, as a function of demographic model parameters (N$_e$, $\tau$, topology).

The expected distribution of linked genealogical variation is poorly characterized.

What is the expectation for the distribution of
linked genealogical variation?

How does it relate to demographic model parameters?

Why care about local genealogical variation?

  • Subsampling unlinked loci effectively discards >99% genomic info.
  • Ignoring linkage introduces bias (concatalescence; Gatesy 2013).
  • Local ancestry is informative about selection and introgression.

Why care about local genealogical variation?

(Martin & Belleghem 2017)

Why care about local genealogical variation?

  • Subsampling unlinked loci effectively discards >99% genomic info.
  • Ignoring linkage introduces bias (concatalescence; Gatesy 2013).
  • Local ancestry is informative about selection and introgression.
  • We lack a null expectation for spatial genealogical variation.

Outline: Multispecies Sequentially Markov Coalescent

  • Background: SMC' waiting distances from Deng et al. (2021).
  • Introduce new analytical solutions for MS-SMC' waiting distances.
  • Validate estimates against to stochastic coalescent simulations.
  • Likelihood framework for fitting MSC models from linked genealogies.
  • Future directions.

Sequentially Markov Coalescent (McVean and Cardin, 2005)

An approximation of the coalescent with recombination

Given a starting genealogy a change to the next genealogy is modeled as a Markov process — a single transition — which enables a tractable likelihood framework.

Process: recombination occurs w/ uniform probability anywhere on a tree (t$_{1}$), creating a detached subtree, which re-coalesces above t$_{1}$ with an ancestral lineage.

SMC' is widely used in HMM methods

PSMC (Li & Durbin 2011), MSMC (Schiffels & Durbin 2014), use pairwise coalescent times between sequential genealogies to infer changes in N$_e$ through time.

ARGweaver (Rasmussen et al. 2014) and ARGweaver-D (Hubisz & Siepel 2020) use an SMC'-based conditional sampling method to infer ARGs from sequence data.

Currently, we extract a fairly limited amount of
spatial information from genomes.

Categorical event outcomes under the SMC'

(a) no-change; (b-c) tree-change; and (d) topology-change.

(Deng et al. 2021)

Estimating waiting distances under the SMC'

Expected Tree and Topology Distances represent new spatial genetic information.

Estimating waiting distances under the SMC'

Expected Tree and Topology Distances represent new spatial genetic information.

Estimating waiting distances under the SMC'

Expected Tree and Topology Distances represent new spatial genetic information.

A multispecies extension to estimating waiting distances

Barriers to coalescence and variable N$_e$ among species tree intervals.

Patrick McKenzie
PhD student

Extending SMC' waiting distance estimation

Genealogy embedding table with piecewise constant coal rates in
all intervals between coal events or population intervals.

MS-SMC' analytical solutions

\[ \mathbb{P}(\text{tree-unchanged} | \mathcal{S}, \mathcal{G}, b, t_r) = \int_{t_r}^{t^u_b} \frac{1}{2N(\tau)} e^{-\int_{t_r}^\tau \frac{A(s)}{2N(s)}ds} d\tau \]
\[ \mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G},b) = \frac{1}{t^u_b-t^l_b} \int_{t_b^l}^{t_b^u} \mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G},b,t)dt \]
\[ \mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G}) = \sum_{b \in \mathcal{G}} \left[\frac{t^u_b - t^l_b}{L(\mathcal{G})}\right] \mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G},b) \]

Branch specific SMC' probabilities

Unlike single-pop models which exhibit monotonic probabilities over the length of a branch, MSC models exhibit variable rates (both $k$ and N$_e$ can change).

Exponentially distributed waiting distances

Expected number of sites until a recombination event is observed.

\[ \lambda_{r} = L(\mathcal{G}) \times r \]
\[ \lambda_{n} = L(\mathcal{G}) \times r \times \mathbb{P}(\text{tree-unchanged} | \mathcal{S},\mathcal{G}) \]
\[ \lambda_{g} = L(\mathcal{G}) \times r \times \mathbb{P}(\text{tree-changed} | \mathcal{S},\mathcal{G}) \]
\[ \lambda_{t} = L(\mathcal{G}) \times r \times \mathbb{P}(\text{topology-changed} | \mathcal{S},\mathcal{G}) \]

Exponentially distributed waiting distances

Validation:

Analytical results match expectation of stochastic coalescent simulations.

Validation:

In single population model (Deng et al.) N$_e$ only affects edge lengths.

Validation:

In an MSC model N$_e$ affects probability of tree/topology change as well as edge lengths.

Validation:

Waiting distances vary with N$_e$ more in structured models: more information.

Likelihood framework

Given an observed/proposed ARG (genealogies and interval lengths)
get expected waiting distance for each ($\lambda_i$)...

... and calculate likelihood of MSC model $\mathcal{(S)}$ from exponential probability densities.

\[ \mathcal{L}(\mathcal{S} | \Lambda_g, X_g) = \sum_i{log (\lambda_i e ^{-\lambda_i x_i}}) \]

Likelihood surface: single N$_e$

Topology-changes are more informative than tree-changes; optima at true sim. values.
Example: loci=50, length=0.1Mb, recomb=2e-9, samples-per-lineage=4.

Joint inference of multiple MSC model parameters

Metropolis Hastings MCMC converges on correct w/ increasing data.
Example: loci=50, length=0.1Mb, recomb=2e-9, samples-per-lineage=4.

Summary: Multispecies Sequentially Markov Coalescent

  • We extended method of Deng et al. (2021) to MSC models
  • Analytical solutions for E[waiting distance] to tree or topology change
  • Validated: accurate against stochastic coalescent simulations
  • Waiting distances provide more information in MSC-type models than in a single population coalescent.
  • We can estimate MSC model parameters from linked genome data!
  • Topology-changes are easily detactable in sequence data.

Future directions

  • Manuscript on biorxiv (hopefully within days!)
  • Implemented at https://github.com/eaton-lab/ipcoal/ (docs coming!)
  • Software to analyze genetic data is a future development
  • Extensions to Multispecies Network Coalescent.
  • and more...