Genomes are composed of a mosaic of segments inherited from different ancestors,

each separated by past recombination events.

Consequently, genealogical relationships vary spatially across genomes.

The multispecies coalescent (MSC) describes the expected distribution
of *unlinked* genealogies,
as a function of demographic model parameters (N$_e$, $\tau$, topology).

The multispecies coalescent (MSC) describes the expected distribution
of *unlinked* genealogies,
as a function of demographic model parameters (N$_e$, $\tau$, topology).

The expected distribution of *linked*
genealogical variation is poorly characterized.

How does it relate to demographic model parameters?

- Subsampling unlinked loci effectively discards >99% genomic info.
- Ignoring linkage introduces bias (
*concatalescence*; Gatesy 2013). - Local ancestry is informative about selection and introgression.

(Martin & Belleghem 2017)

- Subsampling unlinked loci effectively discards >99% genomic info.
- Ignoring linkage introduces bias (
*concatalescence*; Gatesy 2013). - Local ancestry is informative about selection and introgression.
- We lack a null expectation for spatial genealogical variation.

- Background:
*SMC' waiting distances*from Deng et al. (2021). - Introduce new analytical solutions for
*MS-SMC' waiting distances.* - Validate estimates against to stochastic coalescent simulations.
- Likelihood framework for fitting MSC models from linked genealogies.
- Future directions.

*An approximation of the coalescent with recombination*

Given a starting genealogy a change to the next genealogy is modeled as a Markov process — a single transition — which enables a tractable likelihood framework.

Process: recombination occurs w/ uniform probability anywhere on a tree (t$_{1}$), creating a detached subtree, which re-coalesces above t$_{1}$ with an ancestral lineage.

*PSMC* (Li & Durbin 2011), *MSMC* (Schiffels & Durbin 2014),
use pairwise coalescent times between sequential genealogies to infer
changes in N$_e$ through time.

*ARGweaver* (Rasmussen et al. 2014) and *ARGweaver-D* (Hubisz & Siepel 2020)
use an SMC'-based conditional sampling method to infer ARGs from sequence data.

spatial information from genomes.

(a) no-change; (b-c) tree-change; and (d) topology-change.

(Deng et al. 2021)

*Expected Tree and Topology Distances represent new spatial genetic information.*

*Expected Tree and Topology Distances represent new spatial genetic information.*

*Expected Tree and Topology Distances represent new spatial genetic information.*

*Barriers to coalescence and variable N$_e$ among species tree intervals.*

Patrick McKenzie

PhD student

*Genealogy embedding table with piecewise constant coal rates in
all intervals between coal events or population intervals.*

\[
\mathbb{P}(\text{tree-unchanged} | \mathcal{S}, \mathcal{G}, b, t_r) =
\int_{t_r}^{t^u_b} \frac{1}{2N(\tau)} e^{-\int_{t_r}^\tau \frac{A(s)}{2N(s)}ds} d\tau
\]

\[
\mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G},b) =
\frac{1}{t^u_b-t^l_b} \int_{t_b^l}^{t_b^u}
\mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G},b,t)dt
\]

\[
\mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G}) =
\sum_{b \in \mathcal{G}}
\left[\frac{t^u_b - t^l_b}{L(\mathcal{G})}\right]
\mathbb{P}(\textrm{tree-unchanged} | \mathcal{S},\mathcal{G},b)
\]

Unlike single-pop models which exhibit monotonic probabilities over the length of a branch, MSC models exhibit variable rates (both $k$ and N$_e$ can change).

*Expected number of sites until a recombination event is observed.*

\[ \lambda_{r} = L(\mathcal{G}) \times r \]

\[
\lambda_{n} = L(\mathcal{G}) \times r \times
\mathbb{P}(\text{tree-unchanged} | \mathcal{S},\mathcal{G})
\]

\[
\lambda_{g} = L(\mathcal{G}) \times r \times
\mathbb{P}(\text{tree-changed} | \mathcal{S},\mathcal{G})
\]

\[
\lambda_{t} = L(\mathcal{G}) \times r \times
\mathbb{P}(\text{topology-changed} | \mathcal{S},\mathcal{G})
\]

*Analytical results match expectation of stochastic coalescent simulations.*

*In single population model (Deng et al.) N$_e$ only affects edge lengths.*

*In an MSC model N$_e$ affects probability of tree/topology change as well as edge lengths.*

*Waiting distances vary with N$_e$ more in structured models: more information.*

*Given an observed/proposed ARG (genealogies and interval lengths)
get expected waiting distance for each ($\lambda_i$)...
*

*... and calculate likelihood of MSC model $\mathcal{(S)}$ from exponential probability densities.*

\[ \mathcal{L}(\mathcal{S} | \Lambda_g, X_g) = \sum_i{log (\lambda_i e ^{-\lambda_i x_i}}) \]

*Topology-changes are more informative than tree-changes; optima at true sim. values.
Example: loci=50, length=0.1Mb, recomb=2e-9, samples-per-lineage=4.*

*Metropolis Hastings MCMC converges on correct w/ increasing data.*

Example: loci=50, length=0.1Mb, recomb=2e-9, samples-per-lineage=4.

- We extended method of Deng et al. (2021) to MSC models
- Analytical solutions for E[waiting distance] to tree or topology change
- Validated: accurate against stochastic coalescent simulations
- Waiting distances provide
*more*information in MSC-type models than in a single population coalescent. - We can estimate MSC model parameters from
*linked genome data!* - Topology-changes are easily detactable in sequence data.

- Manuscript on biorxiv (hopefully within days!)
- Implemented at https://github.com/eaton-lab/ipcoal/ (docs coming!)
- Software to analyze genetic data is a future development
- Extensions to Multispecies Network Coalescent.
- and more...