Quick Guide¶

Welcome to the quick guide tutorial for ipcoal. This page is intended to introduce major concepts of coalescent simulations and to provide a concise overview of several types of statistical evolutionary analyses that can be performed in ipcoal. This documentation demonstrates ipcoal through a series of examples that combine analyses alongside visualizations, in the style of a jupyter notebook, where Python code can be executed interactively. The Python package toytree is installed alongside ipcoal and should typically be imported with it, like below, as the two are intended to work hand-in-hand.

Follow along in this guide to learn how to set up a parameterized demographic model; simulate coalescent genealogies and sequences; calculate likelihoods of observed data; apply phylogenetic inference tools to sequences; and compare inferred gene trees to simulated genealogies. Explore the documentation to learn more details about each of these steps and explore other features of ipcoal.

In [1]:

Copied!

import ipcoal
import toytree
import ipcoal
import toytree

The Model class¶

A core concept in evolutionary genetics is the use of population demographic models to represent a set of simplifying assumptions about how gene copies are replicated and inherited from one generation to the next. The main object in ipcoal is the ipcoal.Model class object, which represents a parameterized population demographic model (see Model class). This object is initialized with a number of parameter arguments that define not only the demographic history, but also the model of substitution, and some other more detailed optional arguments about how simulations will be performed. Once the object is initialized, it can be used to call a number of associated methods, and to access information about the model and simulation results from a number of associated attributes.

In [2]:

Copied!

# example initialization of a Model object
ipcoal.Model(Ne=1000, nsamples=5)
# example initialization of a Model object
ipcoal.Model(Ne=1000, nsamples=5)

Out[2]:

<ipcoal.model.Model at 0x7f8ee7f5ce30>

Setting Demography¶

The tree argument to the Model class is used to set up a demographic model. The simplest model is a single panmictic population with constant diploid effective population size (Ne). In ipcoal this can be initiated as a Model class with the default option of tree=None. In this case you only need to enter the Ne parameter, and set nsamples to designate the number of gene copies to simulate a genealogical history for. Throughout this documentation we will often set the seed_trees argument as well to set a random seed generator to ensure the same results each time.

In the example below we initialize a Model and call two additional methods from this object. The key concept to understand here is that the genealogy (shown with green edges) represents one randomly sampled history for a set of six gene copies from this population. The visualization, which shows a gene tree embedded in a grey rectangle, represents an embedding of the genealogy within the demographic model (i.e., the model is a container within which the genealogy must fit). In this example the gene copies coalesce to a common ancestor a little over 2000 generations ago.

In [3]:

Copied!





# setup a single-population demographic model
model = ipcoal.Model(Ne=1000, nsamples=6, seed_trees=123)

# simulate one genealogical tree under this model
model.sim_trees(nloci=1, nsites=1)

# draw the first genealogy embedded in the demographic model
model.draw_demography(idx=0);
# setup a single-population demographic model
model = ipcoal.Model(Ne=1000, nsamples=6, seed_trees=123)

# simulate one genealogical tree under this model
model.sim_trees(nloci=1, nsites=1)

# draw the first genealogy embedded in the demographic model
model.draw_demography(idx=0);

As a slightly more complex example, we can next set up a simulation that involves population structure in the form of a species tree. Here, the toytree library becomes very helpful for creating or parsing tree data to create a ToyTree data object as input. In this example we create a random 4-tip tree with a root height of 1e5 generations.

In [4]:

Copied!

# define a species tree
sptree = toytree.rtree.unittree(ntips=4, treeheight=1e5, seed=333)

# draw the tree using width to show Ne
sptree.draw(ts='p');
# define a species tree
sptree = toytree.rtree.unittree(ntips=4, treeheight=1e5, seed=333)

# draw the tree using width to show Ne
sptree.draw(ts='p');

Let's now set up a Model that uses the above tree to define a species tree scenario. Here we enter this tree object as the tree argument, from which the divergence times between lineages will be parsed. We can set an Ne value to all branches, or set different Ne values to each. We can sample the same number of gene copies from each population, or set different numbers for each. The nsamples arg here is entered as a dict where the keys correspond to node indices or names from the tree shown above. Finally, we can simulate trees and draw the demography.

You can hover your cursor over the visualization below to view additional information about each population interval in the demographic model. This will list values for interval's index ID, Ne value, and length in units of generations (t$_g$) and coalescent units (t$_c$). Here we can see by embedding the simulated genealogy into the species tree that incomplete lineages sorting occurred in several species tree intervals.

In [5]:

Copied!





# set up a demographic model using the species tree
model = ipcoal.Model(tree=sptree, Ne=3e4, nsamples={0: 3, 1: 2, 2: 2, 3: 3}, seed_trees=123)

# simulate one genealogy
model.sim_trees(1)

# draw the genealogy embedded in the species tree model
model.draw_demography(idx=0);
# set up a demographic model using the species tree
model = ipcoal.Model(tree=sptree, Ne=3e4, nsamples={0: 3, 1: 2, 2: 2, 3: 3}, seed_trees=123)

# simulate one genealogy
model.sim_trees(1)

# draw the genealogy embedded in the species tree model
model.draw_demography(idx=0);

Simulation¶

An ipcoal.Model object has three methods for coalescent simulation, sim_trees(), sim_loci(), and sim_snps. Each of these serves a different purpose, and accepts a number of arguments to modify its behavior. Under the hood, they represent different algorithms to make one or more function calls to msprime and store a summary of the results, and/or the full results as one or more tskit.TreeSequence objects. In many cases, the summarized results of ipcoal simulations (described below) are intended to provide a simpler minimalist interface to working with tree sequence simulations.

sim_trees: The sim_trees function is the fastest method. It generates only coalescent trees as a result, and does not perform mutations. It takes two arguments, nloci and nsites. In ipcoal we always treat loci as being independent of one another. You can think of them as separate chromosomes. The length of each locus is represented by some number of sites. Recombination can occur within a locus giving rise to multiple genealogical histories (See Linked Genealogies). This function is generally used to sample distributions of genealogies given the demographic model for use in investigating methods related to the coalescent.

In [6]:

Copied!

# simulate a single genealogy (i.e., for 1 locus at one site)
model.sim_trees()
# simulate a single genealogy (i.e., for 1 locus at one site)
model.sim_trees()

In [7]:

Copied!

# simulate 10 independent loci each containing one genealogy
model.sim_trees(nloci=10)
# simulate 10 independent loci each containing one genealogy
model.sim_trees(nloci=10)

In [8]:

Copied!

# simulate 1 locus of len=10000. May contain multiple trees if recomb.
model.sim_trees(nloci=1, nsites=1e4)
# simulate 1 locus of len=10000. May contain multiple trees if recomb.
model.sim_trees(nloci=1, nsites=1e4)

sim_loci: The sim_loci function is a simple extension of the sim_trees function that adds mutations to the simulated trees using the mutation rate and substitution model stored to the Model object. This method is used to generate sequence data of the requested locus length, including invariant sites, and is useful for investigating methods related to phylogenetic inference.

In [9]:

Copied!

# simulate 2 loci each 100bp in length
model.sim_loci(nloci=2, nsites=100)
# simulate 2 loci each 100bp in length
model.sim_loci(nloci=2, nsites=100)

sim_snps: The sim_snps function is the most complex. It is used to generate the requested number of unlinked SNPs. In contrast to the previous function, this one conditions the simulations on the observation of variation. There are several options that can be implemented to affect how the conditioning works, which can be toggled to trade-off potential biases versus speed. Note that the speed of this function can be very slow if both the mutation rate and coalescent times of your trees are very small. This is useful for testing many types of methods that analyze SNP data.

In [10]:

Copied!

# simulate 5 unlinked SNPs
model.sim_snps(nsnps=5)
# simulate 5 unlinked SNPs
model.sim_snps(nsnps=5)

Tree Sequences and ARGs¶

An important distinction that we highlight in ipcoal is the difference between linked and unlinked genealogical data. Unlinked data represents independent draws from a distribution, whereas linked data represents correlated draws, meaning that the next data point is influenced by the previous one. In the context of a genome, we expect that regions located on different chromosomes are independent of each other, whereas sites that are located close together on the same chromosome are not. In the context of the coalescent, the correlation among nearby regions of the genome represents that one or more samples share the same ancestors in both regions. Recombination between regions causes this similarity to decay since it has the effect of causing different genomic regions to trace back to different ancestors.

A simulated locus represents the genealogical history for the set of gene copies over the requested length of sites. The samples may share a single genealogy over the entire locus, or if recombination has occurred at a position in this locus throughout their history, then the gene copies will trace back >1 linked genealogical histories. In that case, each genealogy will have a corresponding start and end position on the chromosome/locus. This data structure, composed of an ordered set of trees and their interval lengths, is termed an ancestral recombination graph (ARG), and can also be represented as a tree sequence (the term and structure used by msprime/tskit). Below is a visualization of an example tree sequence composed of a 10Kb locus in which the history for this set of samples is distributed across five different genealogies, each separated by recombination events that occurred in their history.

In [11]:

Copied!





# simulate 1 locus w/ recombination to create a TreeSequence
model = ipcoal.Model(tree=sptree, Ne=2e4, nsamples=2, seed_trees=123, store_tree_sequences=True)
model.sim_loci(nloci=2, nsites=10000)
model.draw_tree_sequence(width=700);
# simulate 1 locus w/ recombination to create a TreeSequence
model = ipcoal.Model(tree=sptree, Ne=2e4, nsamples=2, seed_trees=123, store_tree_sequences=True)
model.sim_loci(nloci=2, nsites=10000)
model.draw_tree_sequence(width=700);

There are instances where you may be interested in simulating linked data to study the effect of recombination, or alternatively, we may sometimes wish to simulate unlinked data exclude it. Many population genetic and phylogenetic inference tools assume that data are unlinked. One useful application of ipcoal is to generate linked and unlinked datasets to explore the effect of linkage on analytical results.

The technical way in which this is implemented in ipcoal is to simulate each locus as a distinct TreeSequence object. This can be seen in the code block below, where we use the optional argument store_tree_sequences=True to store the simulated TreeSequence objects to the Model. These can then be accessed from the .ts_dict attribute. Any two trees drawn from different TreeSequence objects (i.e., different loci) are independent. Any two trees drawn from the same TreeSequence (same locus) may be linked, depending on many factors of the demographic simulation.

In [12]:

Copied!





# simulating 5 loci generates 5 independent TreeSequences
model = ipcoal.Model(tree=sptree, Ne=2e4, store_tree_sequences=True)
model.sim_loci(nloci=5, nsites=1000)
model.ts_dict
# simulating 5 loci generates 5 independent TreeSequences
model = ipcoal.Model(tree=sptree, Ne=2e4, store_tree_sequences=True)
model.sim_loci(nloci=5, nsites=1000)
model.ts_dict

Out[12]:

{0: <tskit.trees.TreeSequence at 0x7f8ee0c29b50>,
 1: <tskit.trees.TreeSequence at 0x7f8ee0b8a660>,
 2: <tskit.trees.TreeSequence at 0x7f8ee0aa2f00>,
 3: <tskit.trees.TreeSequence at 0x7f8ee0b500e0>,
 4: <tskit.trees.TreeSequence at 0x7f8f783f7c80>}

Does each locus contain only one genealogy, or can it contain multiple?

If the recombination rate is >0 then a single locus extending over more than one site can represent multiple coalescent genealogies, each separated by a recombination event that occurred in the history of the samples at that locus.

Simulation results¶

As we saw above, it is possible to store and access TreeSequence objects as the result of simulation, from which a huge wealth of information can be extracted. However, in many cases it may be preferable to discard these and instead analyze a summary of the simulation, in terms of the trees and sequences that it produces. This is the default behavior of ipcoal, where calling one of the three simulation functions will generate two results stored to the Model object, a .df dataframe and a .seqs numpy array.

dataframe (Model.df)¶

All simulation functions generate a .df dataframe as a result. This is a pandas DataFrame object that can be used to access the results. What does this table show? You can see in the example table from a simulation of unlinked data below that 10 different loci are represented, numbered 0-9 in the “locus” column. Each locus is represented by only a single site, stretching from start=0 to end=1. Each is 1bp in length and contains no SNPs since we have not simulated sequence data yet, only genealogies. A column labeling the tree index (tidx) for each row shows that all are labeled 0, meaning that each genealogy is the first (and here only) tree simulated in that locus. The final column, genealogy, contains newick strings.

In [13]:

Copied!





# simulate unlinked genealogies
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=10, nsites=1)
mod.df
# simulate unlinked genealogies
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=10, nsites=1)
mod.df

Out[13]:

	locus	end	nbps	genealogy
0	0	1	1	(p_1:287537.351532411179...
1	1	1	1	(p_3:110157.870294456748...
2	2	1	1	((p_0:1313.6091112858539...
3	3	1	1	(p_0:314684.858664838480...
4	4	1	1	(p_1:147864.053017845842...
5	5	1	1	(p_0:93326.6807769358565...
6	6	1	1	(p_3:143619.984664270363...
7	7	1	1	((p_2:70699.778029327673...
8	8	1	1	((p_0:14776.914259654118...
9	9	1	1	((p_1:17689.798992805852...

Now look at an example linked data results table. In contrast to the previous table we see that multiple rows correspond to each locus ID. The first genealogy stretches from position 0 (start=0) to position 3380 (end=3380) and it is 3380bp in length. Following down the table we can see that recombination has broken locus 0 into many smaller intervals, each represented by a different sized chunk of the locus, and by a slightly different genealogy. Each genealogy has a different tree index (tidx) indicating their order along the locus.

In [14]:

Copied!





# simulate linked genealogies
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=2, nsites=1e4)
mod.df
# simulate linked genealogies
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123)
mod.sim_loci(nloci=2, nsites=1e4)
mod.df

Out[14]:

	locus	start	end	nbps	nsnps	tidx	genealogy
0	0	0	3380	3380	23	0	((p_1:12194.366367953536...
1	0	3380	4888	1508	11	1	((p_1:12194.366367953536...
2	0	4888	6243	1355	11	2	((p_1:12194.366367953536...
3	0	6243	6811	568	4	3	(p_4:504559.306961862370...
4	0	6811	6910	99	0	4	((p_1:12194.366367953536...
5	0	6910	6971	61	1	5	((p_1:12194.366367953536...
6	0	6971	8463	1492	29	6	((p_1:12194.366367953536...
7	0	8463	10000	1537	12	7	((p_1:12194.366367953536...
8	1	0	7642	7642	22	0	((p_2:13777.964596574553...
9	1	7642	7935	293	2	1	((p_0:25416.063867531960...
10	1	7935	10000	2065	11	2	((p_0:25416.063867531960...

sequences (Model.seqs)¶

The other major result of simulations is sequences. Sequences are simulated by applying a mutation process over the edges of the simulated genealogies (using msprime.sim_mutations under the hood). These are stored as a numpy array in the .seqs attribute of a Model class object. Users can optionally interact with this array directly, but most often, they will want to use the write functions to write data to a number of different formats (see next section). The sequences are stored in an array of shape (nloci, nsamples, nsites) as uint8 integers where each integer corresponds to a character state in the .alleles dict, which is defined by the substitution model set during Model initialization.

In [15]:

Copied!





# simulate 1 locus for 5 samples at 50 sites
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123, subst_model="JC69")
mod.sim_loci(nloci=1, nsites=50)
mod.seqs.shape
# simulate 1 locus for 5 samples at 50 sites
mod = ipcoal.Model(Ne=1e5, nsamples=5, seed_trees=123, subst_model="JC69")
mod.sim_loci(nloci=1, nsites=50)
mod.seqs.shape

Out[15]:

(1, 5, 50)

In [16]:

Copied!

# sequences are stored as a 3D array
mod.seqs
# sequences are stored as a 3D array
mod.seqs

Out[16]:

array([[[0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 3, 3, 1, 3, 3, 2, 0, 3, 0, 2, 2, 3, 0, 3, 2, 0, 0,
         1, 2, 0, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 1, 3, 3, 1, 2, 0, 2, 0,
         3, 0, 1, 0, 0, 1, 1, 0]]], dtype=uint8)

In [17]:

Copied!

# the mapping of ints to alleles
mod.alleles
# the mapping of ints to alleles
mod.alleles

Out[17]:

{0: 'A', 1: 'C', 2: 'G', 3: 'T'}

Writing data¶

Simulated data can be saved to disk in a number of ways. The dataframe result can be saved as a CSV using pandas operations, like below. Sequence data can be similarly saved in a tabular format, but in most cases the reason for writing sequences will be to format them for an analysis tool. (Note: see Phylogenetic Inference methods section below for analyzing sequences without having to write sequence files). A number of methods are available from Model objects for writing sequences, including write_vcf, write_concat_to_phylip, and others methods you can explore which similarly start with write_.

In [18]:

Copied!

# simulate sequences for 10 samples in population 
model = ipcoal.Model(sptree, Ne=1e5, nsamples=10, seed_trees=123, seed_mutations=123)
model.sim_loci(nloci=100, nsites=100)
# simulate sequences for 10 samples in population 
model = ipcoal.Model(sptree, Ne=1e5, nsamples=10, seed_trees=123, seed_mutations=123)
model.sim_loci(nloci=100, nsites=100)

In [19]:

Copied!

# write .df dataframe to disk
model.df.to_csv("/tmp/test.csv")
# write .df dataframe to disk
model.df.to_csv("/tmp/test.csv")

In [20]:

Copied!

# write the full sequences to a file
model.write_concat_to_phylip(name="test", outdir="/tmp", diploid=False)
# write the full sequences to a file
model.write_concat_to_phylip(name="test", outdir="/tmp", diploid=False)

wrote concat locus (40 x 10000bp) to /tmp/test.phy

In [21]:

Copied!

# write only the SNPs data to file
model.write_vcf(name="test", outdir="/tmp", diploid=True)
# write only the SNPs data to file
model.write_vcf(name="test", outdir="/tmp", diploid=True)

wrote 309 SNPs across 96 linkage blocks to /tmp/test.vcf

In [22]:

Copied!

# get SNPs VCF as a dataframe
vcf = model.write_vcf(diploid=True)
vcf.head(10)
# get SNPs VCF as a dataframe
vcf = model.write_vcf(diploid=True)
vcf.head(10)

Out[22]:

	CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	r0_0	...	r2_0	r2_1	r2_2	r2_3	r2_4	r3_0	r3_1	r3_2	r3_3	r3_4
0	1	69	.	A	G	99	PASS	.	GT	1\|0	...	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1
1	1	76	.	T	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
2	2	4	.	A	C	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|1	1\|0	1\|0	1\|1	1\|1	1\|1	1\|1	1\|1
3	2	17	.	T	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
4	2	71	.	G	T	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
5	2	84	.	A	G	99	PASS	.	GT	1\|0	...	1\|1	1\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|1	0\|0
6	2	96	.	C	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
7	3	20	.	G	T	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|0	0\|1	0\|1	0\|0
8	3	21	.	A	T	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|0	0\|1	0\|1	0\|0
9	3	42	.	A	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0

10 rows × 29 columns

Visualization¶

As you have already seen, ipcoal includes a number of methods for visualization. Examples are listed below.

Demography: The Model.draw_demography method can be used to visualize the demographic model as a container in which genealogies can be selected to be embedded. There are many styling options for this method that can be used to modify its look. This provides a useful tool for validating demographic models, for teaching, and for gaining an intuitive understanding of the effects of your model parameters.

In [23]:

Copied!

model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=123)
model.sim_trees(1)
model.draw_demography(0);
model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=123)
model.sim_trees(1)
model.draw_demography(0);

Genealogies: The Model.draw_genealogies function is a conventient tool for plotting a grid of multiple genealogies side by side on the same axis to compare their differences. It takes a number of options supported by the toytree.Multitree.draw function, which it calls under the hood.

In [24]:

Copied!

model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=123)
model.sim_trees(10)
model.draw_genealogies(scale_bar=1e4, shared_axes=True);
model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=123)
model.sim_trees(10)
model.draw_genealogies(scale_bar=1e4, shared_axes=True);

Genealogy: The Model.draw_genealogy function is useful for selecting and plotting a single genealogy. It takes many optional styling arguments that are supported by toytree.ToyTree.draw. In addition, if you selected to store_tree_sequences=True during Model init then the placement of mutations on the edges of the genealogy can be shown as well by seting show_substitutions=True. The idx argument is used to select genealogies by index from the Model.df dataframe.

In [25]:

Copied!

model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=666, store_tree_sequences=True)
model.sim_loci(1, 1e4)
model.draw_genealogy(idx=0, show_substitutions=True);
model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=666, store_tree_sequences=True)
model.sim_loci(1, 1e4)
model.draw_genealogy(idx=0, show_substitutions=True);

TreeSequences: The Model.draw_tree_sequence function can be used to draw a tree sequence as an ordered set of trees and their interval lengths.

In [26]:

Copied!





# simulate 1 locus w/ recombination to create a TreeSequence
model = ipcoal.Model(tree=sptree, Ne=2e4, nsamples=2, seed_trees=123, store_tree_sequences=1)
model.sim_loci(nloci=2, nsites=10000)
model.draw_tree_sequence(width=700);
# simulate 1 locus w/ recombination to create a TreeSequence
model = ipcoal.Model(tree=sptree, Ne=2e4, nsamples=2, seed_trees=123, store_tree_sequences=1)
model.sim_loci(nloci=2, nsites=10000)
model.draw_tree_sequence(width=700);

Sequences: Visualizing sequences is not hugely useful, but as a teaching and validation tool it is nice to have the option.

In [27]:

Copied!

model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=123)
model.sim_snps(10)
model.draw_seqview(width=250, show_text=True);
model = ipcoal.Model(sptree, Ne=1e4, nsamples=2, seed_trees=123)
model.sim_snps(10)
model.draw_seqview(width=250, show_text=True);

Phylogenetic Inference¶

The ipcoal.phylo module contains a number of functions for efficiently performing phylogenetic inference on simulated sequences. This includes functions that call raxml-ng for inference of gene trees from individual loci, sliding windows, or concatenated sequences of multiple loci. This module also includes functions for inferring species tree or networks from distributions of trees using astral or SNaQ. (These tools can be easily installed using conda.) In the example below we call infer_raxml_ng_tree to infer a gene tree from sequences of concatenated loci. Here we use the arguments idxs to select all or a subset of loci, and the argument diploid to specify that pairs of haploid samples should be joined together to form diploid sequences (reducing the number of samples in half). This efficient approach to simulating and testing phylogenetic inference methods makes it easy to explore the effects of concatenation/recombination, substitution models/homoplasy, demographic model parameters, phasing/diploid sequences, and many other factors on the accuracy of gene tree inference. See the documentation section on Phylogenetic Inference for more examples.

In [28]:

Copied!

# simulate many loci under the species tree model
mod = ipcoal.Model(sptree, Ne=1e5, nsamples=4)
mod.sim_loci(nloci=500, nsites=1000)
# simulate many loci under the species tree model
mod = ipcoal.Model(sptree, Ne=1e5, nsamples=4)
mod.sim_loci(nloci=500, nsites=1000)

In [29]:

Copied!

# infer a gene tree for all 500 loci concatenated
concat_tree = ipcoal.phylo.infer_raxml_ng_tree(mod, idxs=None, diploid=True)
concat_tree.draw();
# infer a gene tree for all 500 loci concatenated
concat_tree = ipcoal.phylo.infer_raxml_ng_tree(mod, idxs=None, diploid=True)
concat_tree.draw();

In [30]:

Copied!

# infer a gene tree from the first 10 loci concatenated
concat_tree = ipcoal.phylo.infer_raxml_ng_tree(mod, idxs=range(10), diploid=True)
concat_tree.draw();
# infer a gene tree from the first 10 loci concatenated
concat_tree = ipcoal.phylo.infer_raxml_ng_tree(mod, idxs=range(10), diploid=True)
concat_tree.draw();

Coalescent likelihoods¶

As a statistical model for genealogical ancestry the coalescent provides two key purposes: First, it serves as a generative model to simulate genealogies as random variables given a population demographic model. We have seen many examples of this already. It provides the basis for us to study genetic variation as the mutational process integrated over a large amount of genealogical variation. The second utility of the coalescent model is as an analytical tool for evaluating observations. We can ask how likely is this model to have produced this genealogy given the model parameters? The likelihood calculation associated with this question is implemented in ipcoal in two subpackages, ipcoal.msc and ipcoal.smc, for analyses under the multispecies coalescent and sequentially Markov coalescent, respectively. Here we show a brief example. See the Likelihood section for more details.

We need three things to calculate a likelihood, a species tree with Ne values set to each edge, a gene tree, and an imap dictionary (S, G, and I). Here we fetch an example dataset using the ipcoal.msc.get_test_data function, which will return a species tree, 100 simulated genealogies, and a dict mapping species tree tip labels to genealogy tip labels.

In [31]:

Copied!

# get the example msc dataset (species tree, genealogies, imap)
S, G, I = ipcoal.msc.get_test_data(nloci=100, seed=123)

# draw the first genealogy embedded in the species tree
ipcoal.draw.draw_embedded_genealogy(S, G[0], I, container_blend=True);
# get the example msc dataset (species tree, genealogies, imap)
S, G, I = ipcoal.msc.get_test_data(nloci=100, seed=123)

# draw the first genealogy embedded in the species tree
ipcoal.draw.draw_embedded_genealogy(S, G[0], I, container_blend=True);

In [32]:

Copied!

# the imap dictionary is simply a mapping of species names to genealogy tip names
I
# the imap dictionary is simply a mapping of species names to genealogy tip names
I

Out[32]:

{'A': ['A_0', 'A_1', 'A_2'], 'B': ['B_0', 'B_1'], 'C': ['C_0'], 'D': ['D_0']}

In [33]:

Copied!

# calculate the loglikelihood of the first genealogy given the species tree
ipcoal.msc.get_msc_loglik(S, G[0], I)
# calculate the loglikelihood of the first genealogy given the species tree
ipcoal.msc.get_msc_loglik(S, G[0], I)

Out[33]:

80.28093312011262

In [34]:

Copied!

# calculate the sum loglikelihood of all genealogies given the species tree
ipcoal.msc.get_msc_loglik(S, G, I)
# calculate the sum loglikelihood of all genealogies given the species tree
ipcoal.msc.get_msc_loglik(S, G, I)

Out[34]:

7973.467153781378

We can evaluate this dataset under several different model parameter settings to find the parameters that optimize the log likelihood. This is meant as a simple demonstration, see the Likelihood section for ways to write this to run much faster. Here we can see that the true model parameters, constant Ne=1e5, has the lowest negative sum log-likelihood score, showing that the data (coalescent times) are best fit by this model.

In [35]:

Copied!





import pandas as pd

# create a dataframe to fill over multiple Ne values
ne_values = [1e3, 5e3, 1e4, 5e4, 1e5, 5e5, 1e6, 5e6]
loglik_data = pd.DataFrame(
    index=range(len(ne_values)), 
    columns=["Ne", "sum_loglik", "true_model"]
)

# iterate over Ne values to calculate sum loglik
for idx, val in enumerate(ne_values):
    
    # set the species tree Ne value
    S.set_node_data("Ne", default=val, inplace=True)

    # store calculated sum loglik given this Ne
    loglik_data.loc[idx] = val, ipcoal.msc.get_msc_loglik(S, G, I), val==1e5

# show result as ints for readability
loglik_data.astype(int)
import pandas as pd

# create a dataframe to fill over multiple Ne values
ne_values = [1e3, 5e3, 1e4, 5e4, 1e5, 5e5, 1e6, 5e6]
loglik_data = pd.DataFrame(
    index=range(len(ne_values)), 
    columns=["Ne", "sum_loglik", "true_model"]
)

# iterate over Ne values to calculate sum loglik
for idx, val in enumerate(ne_values):
    
    # set the species tree Ne value
    S.set_node_data("Ne", default=val, inplace=True)

    # store calculated sum loglik given this Ne
    loglik_data.loc[idx] = val, ipcoal.msc.get_msc_loglik(S, G, I), val==1e5

# show result as ints for readability
loglik_data.astype(int)

Out[35]:

	Ne	sum_loglik	true_model
0	1000	69542	0
1	5000	18522	0
2	10000	12440	0
3	50000	8207	0
4	100000	7973	1
5	500000	8419	0
6	1000000	8770	0
7	5000000	9683	0

	CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	r0_0	...	r2_0	r2_1	r2_2	r2_3	r2_4	r3_0	r3_1	r3_2	r3_3	r3_4
0	1	69	.	A	G	99	PASS	.	GT	1\|0	...	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1	1\|1
1	1	76	.	T	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
2	2	4	.	A	C	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|1	1\|0	1\|0	1\|1	1\|1	1\|1	1\|1	1\|1
3	2	17	.	T	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
4	2	71	.	G	T	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
5	2	84	.	A	G	99	PASS	.	GT	1\|0	...	1\|1	1\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|1	0\|0
6	2	96	.	C	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
7	3	20	.	G	T	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|0	0\|1	0\|1	0\|0
8	3	21	.	A	T	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|0	0\|1	0\|1	0\|0
9	3	42	.	A	G	99	PASS	.	GT	0\|0	...	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0