Sequentially Markov Coalescent Likelihood¶
The Sequentially Markov Coalescent (SMC) can be used to calculate the likelihood of the waiting distance between recombination events across the length of a chromosome. The Multispecies Sequentially Markov Coalescent (MS-SMC) extends this calculation to the likelihood of the waiting distance between multiple types of recombination events across a chromosome, given a parameterized species tree. These event types include (1) any recombination event; (2) a recombination event causing a change to a coalescent time in a genealogy (tree-change); and (3) a recombination event causing a change to the topology of a genealogy (topology-change).
Here we demonstrate several methods in the ipcoal.smc
package for applying SMC calculations to linked genealogical trees in a tree sequence.
import ipcoal
from ipcoal.smc.src.utils import *
# dist, dist, tree, tree
dtree, dtopo, gidxs, tidxs = get_ms_smc_data_from_model(model)
# should return interval lengths too
S, G, I = get_test_data()
# ...
tree_spans, topo_spans, topo_idxs, gtrees = get_waiting_distance_data_from_model(model)
tldr;¶
The likelihood of the waiting distances in a treesequence can be calculated using get_smc_loglik
.
# example dataset
S, G, I, DISTS, GMASK, TMASK = ipcoal.smc.get_test_data(nloci=1, nsites=1e5)
How does the idxs arg work? Does it combine lengths together between the masks?
ipcoal.smc.get_ms_smc_loglik(S, G, I, recombination_rate=2e-9, lengths=X, event_type=0)
ipcoal.smc.get_ms_smc_loglik(S, G, I, recombination_rate=2e-9, lengths=X, event_type=0, idxs=GMASK)
ipcoal.smc.get_ms_smc_loglik(S, G, I, recombination_rate=2e-9, lengths=X, event_type=0, idxs=TMASK)
# get example dataset ipcoal.Model
model = ipcoal.msc.get_test_model()
# simulate an ARG
model.sim_trees(nloci=1, nsites=1e5)
# organize data into S, G, I, X
S = model.tree
G = model.df.genealogy
I = model.get_imap_dict()
X = model.df.nbps
ipcoal.smc.TreeEmbedding(S, G, I,
# get likelihood of G | S
ipcoal.smc.get_ms_smc_loglik(S, G, I, recombination_rate=2e-9, lengths=X, event_type=0)
1513.7757929783581
# get likelihood of G | S
ipcoal.smc.get_ms_smc_loglik(S, G, I, recombination_rate=2e-9, lengths=X, event_type=0)
1513.7757929783581
Test dataset¶
A MSC data set is composed of a species tree (demographic model), one or more genealogies that can be embedded in that species tree, and an imap dictionary that maps species tree tip names to genealogy tip names. We can generate a simple test dataset for studying the MSC model using msc.get_test_data
. This returns a species tree (S), a collection of embedded genealogies (G), and a mapping dictionary (I). These are described further below.
# get (S, G, I) for an example dataset
S, G, I = ipcoal.msc.get_test_data(nloci=100, nsites=1, seed=123)
Genealogy embedding table¶
The first step of a multispecies coalescent analysis involves decomposing the data (S, G, I) into a table. Let's start with our example test dataset. This includes a species tree (S) composing four lineages and collection of genealogies (G) each composing seven samples. Each genealogy has 3 samples from lineage A, 2 from lineage B, and 1 from C and D. This is described in the mapping dictionary (I), which maps species names to lists of genealogy tip names. We can pass these three variables to the function draw_embedded_genealogy
to create a visualization of the genealogy embedding for the first genealogy.
# draw the first genealogy embedded
ipcoal.draw.draw_embedded_genealogy(S, G[0], I);
In this case, the species tree contains all of the information to describe a demographic model. This includes the topology for lineages A, B, C, and D, their divergence times, and an effective population size parameter for each species tree interval. These data are stored in the ToyTree object as node data.
# show the species tree data
S.get_node_data(["idx", "name", "height", "Ne"])
idx | name | height | Ne | |
---|---|---|---|---|
0 | 0 | A | 0.0 | 100000 |
1 | 1 | B | 0.0 | 100000 |
2 | 2 | C | 0.0 | 100000 |
3 | 3 | D | 0.0 | 100000 |
4 | 4 | 200000.0 | 100000 | |
5 | 5 | 400000.0 | 100000 | |
6 | 6 | 600000.0 | 100000 |
The genealogy embedding table contains the information of how a genealogy fits into a species tree container. Our goal is to calculate the likelihood of observing this genealogy given the species tree parameters. Thus, we need to calculate the probability of coalescence at each unit of time. Fortunately, this problem is simplified by recognizing that these probabilities are constant within some intervals of time. In other words, the probability of coalescent is piece-wise constant. Knowing this, we can separate out these intervals where it is constant and compute the probabilities just within each one.
The probability of coalescence is calculated following the Kingman coalscent for k samples as $\frac{k(k - 1)}{4N}$. This is a product of the rate of coalescence given the effective population size in that interval $\frac{1}{2N}$, and the combinatorial number of ways that 2 samples could have coalesced given k samples in that interval $\frac{k(k-1)}{2}$. Thus, the probability of a coalescence event is greater when k is larger, and when $N_e$ is smaller.
# get embedding table for the first genealogy
ipcoal.msc.get_genealogy_embedding_table(S, G[0], I)
start | stop | st_node | neff | nedges | dist | gidx | edges | |
---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 8.348442e+04 | 0 | 100000.0 | 3 | 8.348442e+04 | 0 | [0, 1, 2] |
1 | 83484.416529 | 2.000000e+05 | 0 | 100000.0 | 2 | 1.165156e+05 | 0 | [2, 7] |
2 | 0.000000 | 6.205383e+04 | 1 | 100000.0 | 2 | 6.205383e+04 | 0 | [3, 4] |
3 | 62053.832461 | 2.000000e+05 | 1 | 100000.0 | 1 | 1.379462e+05 | 0 | [8] |
4 | 0.000000 | 4.000000e+05 | 2 | 100000.0 | 1 | 4.000000e+05 | 0 | [5] |
5 | 0.000000 | 6.000000e+05 | 3 | 100000.0 | 1 | 6.000000e+05 | 0 | [6] |
6 | 200000.000000 | 2.838717e+05 | 4 | 100000.0 | 3 | 8.387172e+04 | 0 | [2, 7, 8] |
7 | 283871.722216 | 4.000000e+05 | 4 | 100000.0 | 2 | 1.161283e+05 | 0 | [7, 9] |
8 | 400000.000000 | 4.257337e+05 | 5 | 100000.0 | 3 | 2.573368e+04 | 0 | [5, 7, 9] |
9 | 425733.683785 | 6.000000e+05 | 5 | 100000.0 | 2 | 1.742663e+05 | 0 | [5, 10] |
10 | 600000.000000 | 7.091705e+05 | 6 | 100000.0 | 3 | 1.091705e+05 | 0 | [5, 6, 10] |
11 | 709170.541694 | 7.423249e+05 | 6 | 100000.0 | 2 | 3.315435e+04 | 0 | [10, 11] |
12 | 742324.888479 | inf | 6 | 100000.0 | 1 | inf | 0 | [12] |
Each interval of the table delimits a gene tree node (coalescence) or species tree node (interval breakpoint). Each lists its start
and stop
time (in generations), the species tree node ID (st_node
), effective population size (neff
, i.e., $N_e$), number of genealogy edges in that interval (nedges
; i.e., $k$), and the interval length (stop
- start
). From this we have all the information needed to compute the probability of coalescence.
To compute the likelihood of our observation, we model each coalescent event as a exponential waiting time given the rate of coalescence. For each species tree interval, we also must compute the likelihood that no coalescence occurs during the remaining end of the interval. This set of calculations is the multispecies coalescent (MSC) likelihood. Below we describe some ways it is implemented.
Calculate MSC likelihood¶
The function get_msc_loglik
returns the summed log-likelihood of all trees in G, given the model S, and their mapping in I.
# log-likelihood of 100 unlinked genealogies
ipcoal.msc.get_msc_loglik(S, G, I)
7973.467153781378
Note, this function is generally intended for educational purposes. It is easy to use and understand as it takes the data objects (S, G, I) directly and returns a value. However, in practice users will want to use the much faster implementation below. For comparision, we show the speed for this implementation in the cell below.
%%timeit
ipcoal.msc.get_msc_loglik(S, G, I)
51.8 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Faster implementations¶
A much faster implementation is also provided in get_msc_loglik_from_embedding
. This makes use of the numba
library to perform just-in-time compilation. This is a workaround for writing very fast Python code that runs at a similar speed to compiled languages like C. To use this method you should first generate the genealogy embedding data as a numpy float array like below, and then pass the array to the likelihood function. You can see from the %%timeit
call below which measures the running time for the cell, that this implementation runs about >250X faster.
# get the embedding and encoding as separate arrays
a0, a1 = ipcoal.msc.get_genealogy_embedding_arrays(S, G, I)
# log-likelihood of 100 unlinked genealogies
ipcoal.msc.get_msc_loglik_from_embedding(a0)
7973.467153781378
%%timeit
ipcoal.msc.get_msc_loglik_from_embedding(a0)
165 μs ± 14.4 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Parameter optimization¶
The problem of parameter optimization is outside of the scope of this documentation, as there are many approaches. These typically fall into the categories of either Maximum Likelihood or Bayesian MCMC. In either case, we must propose a set of parameters, calculate the likelihood, and then propose a different set of parameters based on some proposal mechanism, and either accept or reject the new parameter set based on some acceptance criterion.
Note that some types of proposals are very easy to implement, such as proposing different Ne values which requires only changing their value in the embedding array; while other types of moves require greater effort, such as proposing a different divergence time value, which can require re-embedding the genealogies (i.e., re-creating the embedding arrays). Efficient algorithms can be written for the latter, but we have not put great effort into this yet.
# [todo] show example of changing Ne for optimization
Linked trees¶
While the MSC model was originally developed to describe an unlinked distribution of trees, it can also be applied to a sample of linked trees. In this case the key difference is that each tree is not treated equally. Instead, trees should be weighted relative to their contribution to the data being studied. ...
# [todo] show example using the dists argument of `get_msc_loglik_from_embedding`.
Utilities¶
Here we provide a short description of several functions available from the ipcoal.msc
subpackage.
get_test_data¶
This returns an example dataset composed of a species tree (ToyTree), gene tree (ToyTree), and imap dictionary (dict). The species tree describes a 4-tip imbalanced tree (((A,B),C),D);
with equal spaced internodes and a root height of 1e6 generations. Each species tree interval has Ne=1e5. Each genealogy samples 3 tips from 'A', 2 from 'B', and 1 from 'C' and 'D'.
# to get the example species tree, genealogy, and mapping data
ipcoal.msc.get_test_data()
(<toytree.ToyTree at 0x724a8c7a8770>, <toytree.ToyTree at 0x724a8cdad760>, {'A': ['0', '1', '2'], 'B': ['3', '4'], 'C': ['5'], 'D': ['6']})
You can optionally select the number of simulated genealogies using nloci
. By default their lengths are each 1, such that the simulated genealogies are unlinked, but you can set nsites
to make them into longer treesequences. Under the hood this will initialize an ipcoal.Model
object using the example species tree and all other arguments at defaults. It will then call Model.sim_trees
using the nloci
and nsites
args. This is primarily used for setting up simple tests.
# to simulate multiple loci, or longer loci, using the example dataset
ipcoal.msc.get_test_data(nloci=2, nsites=100, seed=123)
(<toytree.ToyTree at 0x724a8c88a9f0>, [<toytree.ToyTree at 0x724a8cdada00>, <toytree.ToyTree at 0x724a8c88ae10>], {'A': ['A_0', 'A_1', 'A_2'], 'B': ['B_0', 'B_1'], 'C': ['C_0'], 'D': ['D_0']})
get_test_model¶
This returns an ipcoal.Model
object initialized with the example 4-tip species tree model from get_test_data
. This is primarily used for testing.
ipcoal.msc.get_test_model()
<ipcoal.model.Model at 0x724a8c888e30>
get_embedded_genealogy¶
This is a convenience function to initialize an ipcoal.Model
as ipcoal.Model(tree=S, **kwargs)
where you can pass additional kwargs to initialize the model (such as nsamples). It then calls sim_trees
to simulate and return one genealogy and an imap dict. One example usage for this is that you have a species tree and just want to sample one or more random genealogies to embed and visualize in one simple step, like below.
# overrides Ne and nsamples to return a genealogy and imap from S
g, i = ipcoal.msc.get_embedded_genealogy(S, Ne=2e5, nsamples=2, seed_trees=1234)
# draw the genealogy embedded in S
ipcoal.draw.draw_embedded_genealogy(S, g, i);
get_genealogy_embedding_table¶
The genealogy embedding table is returned as a pandas.DataFrame
. This is easy to read and interpret. If you enter a single genealogy then the returned table will be composed only the intervals in that genealogy, but if you enter multiple genealogies then each is listed sequentially in the table, labeled by a different genealogy index (gidx column).
ipcoal.msc.get_genealogy_embedding_table(S, G[0], I).shape
(13, 8)
ipcoal.msc.get_genealogy_embedding_table(S, G, I).shape
(1300, 8)
get_genealogy_embedding_arrays¶
The genealogy embedding array is similar to the genealogy embedding table, just formatted differently. It is less user-friendly to read and interpret, but much faster for operations, and so this is what is generally used under the hood. It separates the first 7 columns of the genealogy embedding table from a second array which contains info on which edges are represented in each interval. The latter array is not needed for MSC calculations. In contrast to the 2D genealogy embedding table, the genealogy embedding array is 3D (ntrees, nrows, ncolumns), storing each genealogy as a separate 2D embedding array.
arr, _ = ipcoal.msc.get_genealogy_embedding_arrays(S, G, I)
arr.shape
(100, 13, 7)