The Model class¶
ipcoal is an object-oriented library for which the main object users interact with is the Model
class. To create a Model
you must provide at least one or more parameter arguments to ipcoal.Model()
which will return an object that you can store as a variable. This stores the information that defines the population demographic model and substitution model, as well as many other options that can modify how simulations are performed. Simulation methods can be called from this object, in addition to several other methods for visualization and analysis.
import ipcoal
import toytree
import msprime
# initialize a Model object
ipcoal.Model(Ne=1000, nsamples=10)
<ipcoal.model.Model at 0x7ad2b4304260>
Demography (tree)¶
The demographic structure of a population can be defined in ipcoal using the Model
class argument tree. The name of this argument is used to indicate that a tree object can be used to define the parameters of a demographic model. These parameters include the topology (relationships) among populations, the divergence times between populations, and the effective population sizes (Ne) of each population.
A simple Model¶
The simplest demographic model is of a single population with constant Ne, which has no topology or divergence times, and only one Ne parameter. This can be created easily be setting the tree argument to its default value of None. As shown in the second example, you will usually initialize a Model
with many additional arguments, which will be described throughout the rest of this document.
# a single population model with constant Ne
ipcoal.Model(tree=None, Ne=1000)
<ipcoal.model.Model at 0x7ad21e2a7e60>
# a single population model with constant Ne and other args
ipcoal.Model(Ne=1000, nsamples=5, mut=1e-8, recomb=1-9)
<ipcoal.model.Model at 0x7ad21e2f3890>
Using toytree¶
An example of a more complex model is a species tree topology with 10 lineages, which requires 9 divergence time parameters and 19 Ne parameters. Setting each of these manually, for example by referring to lineages by an integer label, can be very challenging and error-prone. Instead, the approach we use in ipcoal is to create demographic models using ToyTree
objects, where it is easy to set data values and visualize them to confirm their accuracy. See the toytree
library documentation to learn more about the features for creating and setting data on trees. Here we provide some examples. You can create a tree topology using one of several approaches in toytree
which return a ToyTree
class object. This includes the random module (toytree.rtree
) and the data parsing module (toytree.tree
). You can further modify a tree by explicitly setting node heights or edge lengths on it, or by using a number of tree modification functions (toytree.mod
), which can be used to do things like stretch or compress edge lengths.
# get an imbalanced tree with 3 tips
sptree = toytree.rtree.imbtree(ntips=3)
# or, get a random tree w/ equal spaced nodes and set root height
sptree = toytree.rtree.unittree(ntips=3, treeheight=1e5)
# or, define a custom tree from newick string
sptree = toytree.tree("((A:1,B:1):1,C:2);")
# set the branch lengths on a tree explicitly
sptree = sptree.set_node_data("height", {3: 5e4, 4: 1e5})
# or, set the branch lengths by stretching the tree
sptree = sptree.mod.edges_scale_to_root_height(1e5)
Labels and units¶
Once your species tree relationships and divergence times are set, you can visualize the tree using its .draw()
function call. The edge lengths of a species tree that will be used to define a demographic model should be units of generations. The tips of the tree do not have to align at the present. The tips of the tree will have name labels, whereas internal nodes do not by default, unless set by the user. All nodes have numeric labels which count up from 0 to nnodes-1, in a tips-first then post-order traversal order (see toytree idxorder documentation). The visualization below uses the special tree_style (ts) option type "p" to designate a suite of styling arguments that will show the internal numeric labels on nodes, in addition to the tip labels and the vertical scale bar. Here we are drawing the tree that was created above, which has a root height of 1e5 generations.
# draw the tree to examine the relationships and div times
sptree.draw(ts='p');
A complex model¶
Now that you are sure that the topology and divegence times on your tree are set correctly, you can enter this tree as an argument to initialize a Model
object with this defined demography. By default, this will sample one gene copy per lineage (tip) of the species tree, but this can be modified (see nsamples
).
# initialize a Model from a species tree
ipcoal.Model(tree=sptree, Ne=1000)
<ipcoal.model.Model at 0x7ad21def3860>
Effective population size (Ne)¶
The effective population size is the key parameter used in population demographic modeling. It determines the probability of coalescence and thus determines the distribution of coalesce times among gene copies in a population. Values of Ne can be set as a constant, or at different values for different populations and ancestral populations in a structured model, such as a species tree.
Ne values are diploid¶
By default, in ipcoal Ne represents the number of diploid individuals in a population. Thus, a value of Ne=1000 indicates that there are 2000 haploid gene copies. The coalescent is an approximation of the WrightFisher model, where the probability that any two gene copies share a common ancestor (i.e., coalesce) one generation ago is 1/2N. Because we can model this as an exponentially distributed waiting time, we know that the expected time until the first coalescent event is therefore 2N. Let's verify this by checking that genealogies simulated in a population with Ne=10K have an average time until first coalescence of 20K generations (i.e., 2 times Ne units of generations).
# simulate many independent trees with 2 samples from a population
NE = 10_000
mod = ipcoal.Model(Ne=NE, nsamples=2, seed_trees=123)
mod.sim_trees(10_000)
# get mean root node height of all trees
trees = [toytree.tree(i) for i in mod.df.genealogy]
avg_time = sum(i.treenode.height for i in trees) / len(trees)
# return result
f"Avg time to first coalescence = {avg_time:.2f} generation, or {avg_time/NE:.2f}Ne"
'Avg time to first coalescence = 20196.12 generation, or 2.02Ne'
Setting Ne¶
When initializing a Model
object you can set Ne using the Ne
argument. This accepts argumnets as an integer, float, dictionary, or None. Setting an integer or float will assign this value to all populations in the model. If a dictionary is entered as {lineage: Ne_value, ...} you can set different Ne values to each lineage, selected by name or tree node index. In this case you must enter a key for every population.
# set constant Ne to a single population
ipcoal.Model(Ne=1000)
<ipcoal.model.Model at 0x7ad21def3d10>
# set constant Ne to all Nodes on a species tree
ipcoal.Model(tree=sptree, Ne=1000)
<ipcoal.model.Model at 0x7ad21e5e1040>
# set variable Ne values to Nodes on a species tree
ipcoal.Model(tree=sptree, Ne={0: 1000, 1: 1000, 2: 1000, 3: 2000, 4: 2000})
<ipcoal.model.Model at 0x7ad21c66a900>
Setting Ne on trees¶
Finally, another way to set up complex demographic models is to set all data explicitly on the tree object. To use this option you must enter the Ne argument to ipcoal.Model
as None. Ne values will then be automatically extracted from the tree input, if present. An error will be raised if all nodes do not have an Ne values set. Note that if an Ne argument (besides None) is entered to ipcoal.Model
it will override any Ne values set on a tree. To set Ne data on a tree object we use the set_node_data
function in toytree
, which can be used to set any abitrary named feature on nodes of a tree. Here the feature name "Ne" is special, as ipcoal will look for and recognize it if present. This approach can be more convenient than entering Ne values as a dictionary, like above, because the set_node_data
function has some helpful features such as setting a default value to nodes that are not specified, and because the values set to nodes can be easily visualized on trees.
# set variable Ne to each Node
sptree = sptree.set_node_data("Ne", {0: 1e4, 1: 2e4, 2: 3e4, 3: 5e4, 4: 1e5})
# set variable Ne to some Nodes and others to a default
sptree = sptree.set_node_data("Ne", {0: 4e4, 1: 3e4}, default=1e4)
# visualize Ne as edge widths projected between size 2-10
sptree.draw('p', edge_widths=("Ne", 2, 10));
# finally, we can enter the tree to set Ne values in the Model
ipcoal.Model(tree=sptree)
<ipcoal.model.Model at 0x7ad21c6766f0>
Variable Ne within intervals¶
ipcoal is not currently designed to implement more complex scenarios, such as Ne values that change linearly, or exponentially, from the beginning to the end of an interval. These should be simple enough to implement, but our focus thus far has been primarily on species tree scenarios which have constant Ne within each interval. To set up more complex demographic models I would recommend using msprime
for now. If you are interested in helping to implement this please reach out.
Sampling (nsamples)¶
A key feature of the coalescent model, which involves studying the process of evolution backwards in time, is that we do not need to simulate the history of an entire population in order to study its evolution. Instead, we can examine only the history for a set of sampled gene copies in that population. In this way, the number of samples plays an important role in many characteristics of the coalescent. This can be set using the nsamples
argument to ipcoal.Model
. This can be entered as an integer, which sets the number of gene copies to sample from every population, or as a dict, where different numbers of gene copies can be sampled from each population. Here, a sample represents a haploid gene copy. Note that it is still possible to simulate diploid sequences in ipcoal by sampling an even number of samples in each population. The implementation of joining pairs of samples into diploids only takes place within downstream analyses, such as writing sequences. To demonstrate the nsamples
argument we show several examples of initialized Model
objects below, where we show the imap dictionary of each, which maps population names to lists of sample names.
ipcoal.Model(Ne=1000, nsamples=5).get_imap_dict()
{'p': ['p_0', 'p_1', 'p_2', 'p_3', 'p_4']}
ipcoal.Model(sptree, nsamples=2).get_imap_dict()
{'A': ['A_0', 'A_1'], 'B': ['B_0', 'B_1'], 'C': ['C_0', 'C_1']}
ipcoal.Model(sptree, nsamples={"A": 1, "B": 2, "C": 3}).get_imap_dict()
{'A': ['A_0'], 'B': ['B_0', 'B_1'], 'C': ['C_0', 'C_1', 'C_2']}
Finally, using visualizations we can validate the setting for different numbers of samples in each population.
mod = ipcoal.Model(sptree, nsamples={"A": 3, "B": 2, "C": 1}, seed_trees=123)
mod.sim_trees(1)
mod.draw_demography(0);
Gene flow (admixture_edges)¶
In both ipcoal and toytree you can define an admixture event with one or more 4-part tuples. This designates the (source, destination, time, rate/prop). If the time is a single value then an admixture event occurs by transferring a proportion of one population to another; whereas if time is entered as a tuple of (time_min, time_max) it occurs as a migration rate over this period of time. In simulations you will need to provide all four arguments, but for tree drawings you only need to provide the first two and the latter two will be automated to make it look nice. So let’s try first to draw an admixture edge on a tree.
This is done by supplying the event tuple to the admixture_edges
tree drawing argument. Here the source and destination refer to the index (idx) numeric labels of each node (which you find by drawing the tree). You can see that the admixture edge is not simply a horizontal line, but it follows the edges towards the tips of the source node and towards the ancestors from the destination node. This is to make the direction of inheritance of the introgression clear. In this case, a tuple of (1, 2) indicates that introgression occurs from 1 to 2 backwards in time; meaning that some alleles that arose in the ancestor of node 2 have introgressed into lineage 1 at the present.
# draw the admixture_edges scenario to validate its set up
sptree.draw(ts='p', admixture_edges=[(1, 2)]);
Admixture¶
Now let’s define an admixture event to a Model
object. The proportion here is set to 0.1 (10% admixture). The timing of the event can be entered either as a integer, representing the time in generations at which it occurs, or as a float, in which case the timing will be modeled as occuring at the set percentage of the way back in time along the time at which the two edges both existed. The latter argument is useful for some automation scenarios so you do not need to find when two edges coexisted in time. Both are demonstrated below. In the first case I added a red line at t=20K gens to highlight that the coalescent times of introgressed trees occur at least 20K generations back, as expected.
# admixture of 10% from 1->2 at time=20K
amodel = ipcoal.Model(sptree, Ne=1e4, admixture_edges=[(1, 2, 20000, 0.1)], seed_trees=123)
# simulate 100 unlinked genealogies under this model
amodel.sim_trees(100)
# visualize discordance as a cloud tree
canvas, axes, marks = amodel.draw_cloud_tree(scale_bar=True);
# highlighting admixture timing at t=20K gens
axes.hlines(20000, style={"stroke": "red", "stroke-dasharray": "5,2"});
Here is a similar scenario but we have set the timing of the introgression to occur further back in time, at 50% of the way back along the interval during which edges 1 and 2 both exist.
# admixture of 10% from 1->2 at time=50% of way back on shared history
amodel = ipcoal.Model(sptree, Ne=1e4, admixture_edges=[(1, 2, 0.5, 0.1)], seed_trees=123)
# simulate 100 unlinked genealogies under this model
amodel.sim_trees(100)
# visualize discordance as a cloud tree
amodel.draw_cloud_tree(scale_bar=True);
Migration¶
Next is an example where we model a migration instead of an admixture event. The difference is that instead of an instantaneous mixing event, we model a migration rate from one population to another over a period of time. Similar to above, the period of time can be set as a distinct start and end time as integer generations, or it can be set as a float between (0-1) to indicate the proportion of time along their shared time interval. Both are demonstrated below.
# migration at rate 1e-5 from 1->2 from time 10K-20K gens ago
amodel = ipcoal.Model(sptree, Ne=1e4, admixture_edges=[(1, 2, (10000, 20000), 2e-5)], seed_trees=123)
# simulate 100 unlinked genealogies under this model
amodel.sim_trees(100)
# visualize discordance as a cloud tree
amodel.draw_cloud_tree(scale_bar=True);
# migration at rate 2e-5 from 1->2 from time 10%-50% of their time
amodel = ipcoal.Model(sptree, Ne=1e4, admixture_edges=[(1, 2, (0.1, 0.5), 2e-5)], seed_trees=123)
# simulate 100 unlinked genealogies under this model
amodel.sim_trees(100)
# visualize discordance as a cloud tree
amodel.draw_cloud_tree(scale_bar=True);
Finally, you can set more than one migration or admixture event at a time by simply entering them as multiple tuples. In this example we set bi-directional migration between B and C.
# migration at rate 1e-5 from 1<->2 from time 0%-50% of their time
amodel = ipcoal.Model(
sptree, Ne=1e4, seed_trees=123,
admixture_edges=[
(1, 2, (0.0, 0.5), 1e-5),
(2, 1, (0.0, 0.5), 1e-5),
])
# simulate 100 unlinked genealogies under this model
amodel.sim_trees(100)
# visualize discordance as a cloud tree
c, a, m = amodel.draw_cloud_tree(scale_bar=True);
mut (mutation rate)¶
The mutation rate is defined in units of mutations per site per generation. This can be entered as a float value or as a Map which sets a different rate for different intervals of a genome. The mutation rate defines the probability of mutation events over the length of a simulation. The outcome of these events is a consequence of the substitution model, which defines the number of states and their transition rates.
model = ipcoal.Model(Ne=1000, nsamples=5, mut=2e-8, recomb=0, seed_trees=123, seed_mutations=123, store_tree_sequences=True)
model.sim_loci(nloci=1, nsites=10000)
model.draw_genealogy(0, show_substitutions=True, layout='d');
Next we can see that by cranking up the mutation rate by an order of magnitude the number of mutations on each branch is much greater.
model = ipcoal.Model(Ne=1000, nsamples=5, mut=2e-7, recomb=0, seed_trees=123, seed_mutations=123, store_tree_sequences=True)
model.sim_loci(nloci=1, nsites=10000)
model.draw_genealogy(0, show_substitutions=True, layout='d');
Mutation RateMap¶
The mutation rate can also be set to vary over different intervals of a chromosome by using an msprime.RateMap
object to define the mutation rate. This take N args to define the start and end positions of intervals, and N-1 rate args. In the example below we simulate a 30Kb locus with zero mutations in the first 1/3, a low mutation rate in the middle 1/3, and high mutation rate on the last 1/3. You can see that conditional on the different lengths of the intervals covered by each genealogy in the simulated tree sequence below, many more mutations occurred on the genealogies associated with the latter half of the chromosome.
# define a mutation Map over 3 intervals from 0-30K
rate_map = msprime.RateMap(
position=[0, 10_000, 20_000, 30_000],
rate=[0, 1e-9, 1e-7],
)
# set the rate_map for mutations
model = ipcoal.Model(nsamples=10, Ne=10000, mut=rate_map, store_tree_sequences=True, seed_trees=123, seed_mutations=123)
model.sim_loci(1, 30_000)
model.draw_tree_sequence(width=700);
subst_model¶
The subst_model
arg defines the msprime.MutationModel
underlying the simulation of mutations. See the msprime/tskit mutation simulation documentation for a detailed description. Here we provide some simple examples. The default mutation model is Jukes-Cantor 1969 with equal rates and equilibrium frequencies for A,C,G,T. More complex models such as HKY and GTR can be created as custom parameterized models using msprime.MutationModel
as shown below. Similarly, other models can be defined that use a custom set of states and transition rates, such as the pam and binary models shown below.
# the default subst model is "JC69"
model = ipcoal.Model(Ne=1000, subst_model="JC69")
print(f"states: {model.subst_model.alleles}")
print(f"root distribution: {model.subst_model.root_distribution}")
print(f"transition matrix:\n{model.subst_model.transition_matrix}")
states: ['A', 'C', 'G', 'T'] root distribution: [0.25 0.25 0.25 0.25] transition matrix: [[0. 0.33333333 0.33333333 0.33333333] [0.33333333 0. 0.33333333 0.33333333] [0.33333333 0.33333333 0. 0.33333333] [0.33333333 0.33333333 0.33333333 0. ]]
Other simple models that can be set by name include "binary", "pam", "infinite_alleles", and "blosum62".
# a simple binary model
model = ipcoal.Model(Ne=1000, subst_model="binary")
print(f"states: {model.subst_model.alleles}")
print(f"root distribution: {model.subst_model.root_distribution}")
print(f"transition matrix:\n{model.subst_model.transition_matrix}")
states: ['0', '1'] root distribution: [1. 0.] transition matrix: [[0. 1.] [1. 0.]]
Using a more complex substitution model (GTR) requires settings parameters of the model using msprime.HKY
, msprime.GTR
, or msprime.MutationModel
classes.
GTR = msprime.GTR(
relative_rates=[0.2, 0.2, 0.3, 0.2, 0.2, 0.3], # {AC}, {AG}, {AT}, {CG}, {CT}, {GT}
equilibrium_frequencies=[0.3,0.2,0.2,0.3], # A, C, G, T
state_independent=False,
)
model = ipcoal.Model(Ne=10000, mut=1e-6, nsamples=5, subst_model=GTR, seed_trees=123, seed_mutations=123)
model.sim_loci(1, 100)
model.draw_seqview(scrollable=True, show_text=True);
recomb (recombination rate)¶
The recombination rate can be set as a float value or a RateMap similar to the mutation rate. It affects the per-site per-generation rate of recombination. To suppress recombination so that all sites within a locus share the same genealogical history you can set recomb=0
. The absolute value of the recombination rate is often examined relative to the mutation rate.
# suppress recombination
model = ipcoal.Model(Ne=1000, nsamples=5, recomb=0)
# set a relatively high rate of recombination (equal to mut)
model = ipcoal.Model(Ne=1000, nsamples=5, recomb=1e-8, mut=1e-8)
Recombination RateMap¶
Recombination can be set as a Map that varies over different regions of a chromosome. This can be used to represent an inversion where recombination is suppressed, or variation in recombination associated with hotpots or centromeres. In the example below we suppress recombination over the middle 1/3 of the chromosome. As a consequence, you can see that a single tree represents the ancestry over the entire center of the chromosome and accumulates many substitutions to support it.
# define recomb as a Map over 3 intervals from 0-30K
rate_map = msprime.RateMap(
position=[0, 10_000, 20_000, 30_000],
rate=[1e-8, 0., 1e-8],
)
# set the rate_map for recomb
model = ipcoal.Model(nsamples=8, Ne=10_000, recomb=rate_map, store_tree_sequences=True, seed_trees=123, seed_mutations=123)
# when recomb is a RateMap the nsites arg should be None (length is defined by the Map)
model.sim_loci(1, None)
model.draw_tree_sequence(width=800, start=0, max_trees=1000, tip_labels=False, scrollable=True);
seed_trees¶
This sets a numpy random number generator (RNG) to be used to stochastically sample coalescent times during a simulation.
seed_mutations¶
This sets a numpy random number generator (RNG) to be used to stochastically simulate the mutation process during a simulation.
store_tree_sequences¶
By default ipcoal does not store TreeSequence objects. These are very data rich objects that consume a lot of memory, and so if you don't plan to interact with them there is no need to save it. Instead, ipcoal.Model objects always parse a summary of the simulation that is stored in its .df and .seqs attributes. However, if you choose to store the tree sequences it enables further types of analyses and visualizations in ipcoal or by interacting with TreeSequences from within the Model's ts_dict
dictionary.
# default: does not store full TreeSequence objects
model = ipcoal.Model(Ne=10000)
model.sim_trees(2)
model.ts_dict
{}
# set True: stores TreeSequences in .ts_dict
model = ipcoal.Model(Ne=10000, store_tree_sequences=True)
model.sim_trees(2)
model.ts_dict
{0: <tskit.trees.TreeSequence at 0x7ad21c3cbb00>, 1: <tskit.trees.TreeSequence at 0x7ad21c2f8320>}
**kwargs¶
Most other arguments supported by msprime.sim_ancestry
or msprime.sim_mutations
are supported by the Model class. Examples include ancestry_model
, discrete_genomes
, record_full_arg
, etc.