Simulation¶

An ipcoal.Model object has three methods for coalescent simulation, sim_trees, sim_loci, and sim_snps. Each of these serves a different purpose, and accepts a number of arguments to modify its behavior. Under the hood, they represent different algorithms to make one or more function calls to msprime and store a summary of the results, and/or the full results as one or more tskit.TreeSequence objects. These methods were introduced briefly in the Quick Guide. Here we provide a more detailed description of their options.

In [1]:

Copied!

import ipcoal
import toytree
import ipcoal
import toytree

Demographic Model¶

Throughout this page we will reuse the simple demographic Model object below, describing a species tree with 5 lineages, a root height of 1e6 generations, Ne=1e5 for each lineage, and two sampled genomes per lineage.

In [2]:

Copied!





# define a 5-tip species tree demographic Model
sptree = toytree.rtree.unittree(ntips=5, treeheight=1e6, seed=123)
model = ipcoal.Model(sptree, nsamples=2, Ne=1e5, seed_trees=123, seed_mutations=123, store_tree_sequences=True)

# draw an example embedded genealogy
model.sim_trees(1)
model.draw_demography(0);
# define a 5-tip species tree demographic Model
sptree = toytree.rtree.unittree(ntips=5, treeheight=1e6, seed=123)
model = ipcoal.Model(sptree, nsamples=2, Ne=1e5, seed_trees=123, seed_mutations=123, store_tree_sequences=True)

# draw an example embedded genealogy
model.sim_trees(1)
model.draw_demography(0);

Simulation args¶

nloci¶

The two core arguments to the simulations function are nloci and nsites. The first describes the number of independent TreeSequences to simulate, the latter describes the length of each TreeSequence in sites. By default these should be int values, but can be entered using scientific notation to designate large values (e.g., 1e6). Note: If you do actually want to allow simulation events to occur at non int intervals you can set the argument discrete_genome=False during Model initialization.

In [3]:

Copied!

# simulate a single genealogy (i.e., for 1 locus at one site)
model.sim_trees()
model.df
# simulate a single genealogy (i.e., for 1 locus at one site)
model.sim_trees()
model.df

Out[3]:

	locus	start	end	nbps	nsnps	tidx	genealogy
0	0	0	1	1	0	0	(((r3_0:62053.8324608124...

In [4]:

Copied!

# simulate five independent TreeSequences (each 1 site)
model.sim_trees(nloci=5)
model.df
# simulate five independent TreeSequences (each 1 site)
model.sim_trees(nloci=5)
model.df

Out[4]:

	locus	end	nbps	genealogy
0	0	1	1	(((r3_0:62053.8324608124...
1	1	1	1	(((r2_0:103037.424419652...
2	2	1	1	(((r2_0:11784.2776233241...
3	3	1	1	(((r3_0:263096.624715134...
4	4	1	1	(((r1_0:668765.649524524...

nsites¶

The nsites argument sets the length of simulated TreeSequences, which can be thought of as the length of independent chromosomes. This can be set to small or large values >=1. In fact, setting nsites=1 can often be useful to simulate many independent loci that do not include recombination. This type of distribution -- a collection of unlinked genealogies -- is the typical assumption of the multispecies coalescent. By contrast, longer sequences can contain recombination such that they compose multiple linked genealogies.

In [5]:

Copied!

# simulate 10 independent loci each containing one genealogy
model.sim_trees(nloci=10)
model.df
# simulate 10 independent loci each containing one genealogy
model.sim_trees(nloci=10)
model.df

Out[5]:

	locus	end	nbps	genealogy
0	0	1	1	(((r3_0:62053.8324608124...
1	1	1	1	(((r2_0:103037.424419652...
2	2	1	1	(((r2_0:11784.2776233241...
3	3	1	1	(((r3_0:263096.624715134...
4	4	1	1	(((r1_0:668765.649524524...
5	5	1	1	(((r3_0:40820.5599872290...
6	6	1	1	(((r2_0:217213.222983922...
7	7	1	1	((r3_0:174821.2644626926...
8	8	1	1	(((r4_0:153197.298592874...
9	9	1	1	(((r2_0:144026.443506229...

Wwhen nsites is greater than 1 and recomb is nonzero, there is an opportunity for recombination to break a locus into separate intervals with different genealogical histories, as in the example 10Kb locus below.

In [6]:

Copied!

# simulate 1 locus of len=1000. May contain multiple trees if recomb.
model.sim_trees(nloci=1, nsites=1_000)
model.df
# simulate 1 locus of len=1000. May contain multiple trees if recomb.
model.sim_trees(nloci=1, nsites=1_000)
model.df

Out[6]:

	start	end	nbps	tidx	genealogy
0	0	552	552	0	(((r4_0:52913.7662955324...
1	552	636	84	1	(((r4_0:52913.7662955324...
2	636	1000	364	2	(((r4_0:52913.7662955324...

precision¶

The precision arg can be used to set the number of floating points used to store the height nodes when written to newick format. This is generally not very useful and users should leave it at its default value (14) to store high precision node heights. However, in some cases, such as very large trees, it may be useful to store less precise node heights in order to reduce the size of stored data.

In [7]:

Copied!

# a genealogy stored with the default precision=14
model.sim_trees(1, precision=14)
model.df.genealogy[0]
# a genealogy stored with the default precision=14
model.sim_trees(1, precision=14)
model.df.genealogy[0]

Out[7]:

'(((r3_0:62053.83246081245306,r3_1:62053.83246081245306):898951.07965969515499,(r4_0:72932.80428663847852,r4_1:72932.80428663847852):888072.10783386905678):278636.08353009587154,((r2_0:152989.70899227279006,r2_1:152989.70899227279006):836161.27246919600293,(r0_0:656659.41277821129188,(r0_1:386750.85023718787124,(r1_0:75788.65763844538014,r1_1:75788.65763844538014):310962.19259874249110):269908.56254102342064):332491.56868325744290):250490.01418913470116);'

In [8]:

Copied!

# a genealogy stored with the precision=0 (int values)
model.sim_trees(1, precision=0)
model.df.genealogy[0]
# a genealogy stored with the precision=0 (int values)
model.sim_trees(1, precision=0)
model.df.genealogy[0]

Out[8]:

'(((r3_0:62054,r3_1:62054):898951,(r4_0:72933,r4_1:72933):888072):278636,((r2_0:152990,r2_1:152990):836161,(r0_0:656659,(r0_1:386751,(r1_0:75789,r1_1:75789):310962):269909):332492):250490);'

nproc¶

When simulating trees there is an option to parallelize the simulation using multiple processors on a CPU. For very large simulations this can provide a reasonable speed improvement. This is currently only implemented for sim_trees. Note that this retains the repeatability of results using random seeds.

In [9]:

Copied!

%%timeit
model.sim_trees(nloci=1000, nsites=100, nproc=4)
%%timeit
model.sim_trees(nloci=1000, nsites=100, nproc=4)

2.29 s ± 95.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]:

Copied!

%%timeit
model.sim_trees(nloci=1000, nsites=100, nproc=1)
%%timeit
model.sim_trees(nloci=1000, nsites=100, nproc=1)

4.09 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can validate that the simulation returns the exact same trees regardless of the number of cores used for parallelization.

In [11]:

Copied!





# simulate trees using proc=2 and proc=4 and compare the results
model.sim_trees(10, 10, nproc=2)
nwk_proc2 = model.df.genealogy[3]

model.sim_trees(10, 10, nproc=4)
nwk_proc4 = model.df.genealogy[3]

assert nwk_proc2 == nwk_proc4
# simulate trees using proc=2 and proc=4 and compare the results
model.sim_trees(10, 10, nproc=2)
nwk_proc2 = model.df.genealogy[3]

model.sim_trees(10, 10, nproc=4)
nwk_proc4 = model.df.genealogy[3]

assert nwk_proc2 == nwk_proc4

sim_trees¶

The sim_trees function is the fastest simulation method available. It generates only coalescent trees as a result, and does not perform mutations.

In [12]:

Copied!

model.sim_trees(5, 10)
model.df
model.sim_trees(5, 10)
model.df

Out[12]:

	locus	end	nbps	genealogy
0	0	10	10	(((r2_0:264683.059124009...
1	1	10	10	(((r2_0:239512.323652444...
2	2	10	10	(((r3_0:70368.4285205892...
3	3	10	10	(((r4_0:174626.986084489...
4	4	10	10	((r4_0:479240.4714488410...

sim_loci¶

The sim_loci function is a simple extension of the sim_trees function that adds mutations to the simulated trees using the mutation rate and substitution model stored to the Model object. This method generate sequences, including invariant sites, that can be written using the Model.write methods, and is useful for investigating methods related to phylogenetic inference. Note that simulating sequences also involves simulated genealogies, and so this includes all information that is produced by sim_trees in addition to more information.

In [13]:

Copied!

model.sim_loci(5, 10)
model.df
model.sim_loci(5, 10)
model.df

Out[13]:

	locus	end	nbps	nsnps	genealogy
0	0	10	10	0	(((r2_0:264683.059124009...
1	1	10	10	0	(((r2_0:239512.323652444...
2	2	10	10	1	(((r3_0:70368.4285205892...
3	3	10	10	0	(((r4_0:174626.986084489...
4	4	10	10	0	((r4_0:479240.4714488410...

sim_snps¶

The sim_snps function is the most complex. It is used to generate the requested number of unlinked SNPs. In contrast to the previous function, this one conditions the simulations on the observation of variation. In other words, it will generate a TreeSequence, apply mutations to it, and if no mutations are observed, it samples a new TreeSequence (unless the repeat_on_trees arg is used; see below). This method includes several options that affect how the conditioning works, which can be toggled to trade-off potential biases versus speed. Note that the speed of this function can be very slow if both the mutation rate and coalescent times of your trees are very small. This is useful for testing many types of methods that analyze SNP data.

nsnps¶

The purpose of this method is to sample unlinked snps, and so it only takes one argument, nsnps, to designate the size of the simulation, instead of nloci and nsites. Each simulated SNP will necessary occur on a different unlinked genealogy.

In [14]:

Copied!

# simulate 5 unlinked SNPs
model.sim_snps(nsnps=5)
model.df
# simulate 5 unlinked SNPs
model.sim_snps(nsnps=5)
model.df

Out[14]:

	locus	end	nbps	nsnps	genealogy
0	0	1	1	1	(((r2_0:29637.2437810995...
1	1	1	1	1	(r4_1:1923244.1055243662...
2	2	1	1	1	(((r3_0:197589.519103828...
3	3	1	1	1	(((r2_0:522052.928222848...
4	4	1	1	1	(((r3_0:52008.4828349061...

You can visualize the SNPs or write them to a file the same as other simulated sequence data.

In [15]:

Copied!

# write as VCF
model.write_vcf()
# write as VCF
model.write_vcf()

Out[15]:

	CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	r0_0	r0_1	r1_0	r1_1	r2_0	r2_1	r3_0	r3_1	r4_0	r4_1
0	1	1	.	G	T	99	PASS	.	GT	1\|1	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
1	2	1	.	G	C	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1	1\|1	0\|0
2	3	1	.	A	C	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1	1\|1	1\|1
3	4	1	.	A	T	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1
4	5	1	.	A	T	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1	0\|0	0\|0

In [16]:

Copied!

# visualize SNPs matrix
model.draw_seqview(width=150, show_text=True);
# visualize SNPs matrix
model.draw_seqview(width=150, show_text=True);

Conditioning on SNPs¶

When conditioning the simulation of SNPs, we must designate what we actually mean by this. Are we conditioning on the fact that samples must exhibit variation at the present, or that variation among samples existed at some point during the simulation? What if the a variant arose but then a subsequence mutation changed it back to its initial state such that variation existed for some time, but then was lost? What if multiple mutations occurred at a site so that it changed two or three times? We can simulate SNPs in ways that include or exclude these different scenarios by specifying the options to min_mutations, max_mutations, min_alleles, and max_alleles.

min and max alleles¶

By default min_alleles=2, and max_alleles=None. This means that a SNP must exhibit variation at the present to be kept (at least two alleles). However, multiple mutations can occur at the simulated site, we set no upper limit on the number of alleles. So we could observe a tri-allelic site. To demonstrate this, we can set min_alleles=3 in the simulation below so that we only keep at least tri-allelic sites, where at least two mutations occurred. This simultion will surely take longer than the one above, since it is conditioning one a more rare scenario, however, since the Model describes a very deep simulation, many mutations occur giving rise to these results pretty quickly.

In [17]:

Copied!

# simulate tri-allelic SNPs
model.sim_snps(nsnps=5, min_alleles=3, max_alleles=None)
model.draw_seqview(width=150, show_text=True);
# simulate tri-allelic SNPs
model.sim_snps(nsnps=5, min_alleles=3, max_alleles=None)
model.draw_seqview(width=150, show_text=True);

By contrast, we may want to exclude any alleles that are not bi-allelic. This can be done by setting max_alleles=2.

In [18]:

Copied!

# simulate only bi-allelic SNPs
model.sim_snps(nsnps=5, min_alleles=2, max_alleles=2)
model.draw_seqview(width=150, show_text=True);
# simulate only bi-allelic SNPs
model.sim_snps(nsnps=5, min_alleles=2, max_alleles=2)
model.draw_seqview(width=150, show_text=True);

min and max mutations¶

Next, we can specify our conditioning in even more detail. For example, we may wish to simulate bi-allelic SNPs, but to exclude any that may represent a case in which an allele mutated to a different allele and then mutated back. Or where two different samples mutated the ancestral allele to the same derived allele (homoplasy). This can be set using min_mutation and max_mutations.

In [19]:

Copied!

model.sim_snps(nsnps=5, min_mutations=1, max_mutations=1)
model.draw_seqview(width=150, show_text=True);
model.sim_snps(nsnps=5, min_mutations=1, max_mutations=1)
model.draw_seqview(width=150, show_text=True);

For a more strange scenario, we could select only sites that are bi-allelic, and where more than one mutation occurred. This would require that two samples mutated to the same derived allele (homoplasy). You could imagine other similar scenarios that toggle the min and max args to achieve specific aims.

In [20]:

Copied!

model.sim_snps(nsnps=5, min_alleles=2, max_alleles=2, min_mutations=2, max_mutations=2)
model.draw_seqview(width=150, show_text=True);
model.sim_snps(nsnps=5, min_alleles=2, max_alleles=2, min_mutations=2, max_mutations=2)
model.draw_seqview(width=150, show_text=True);

We can examine these simulations in more detail by accessing the TreeSequence objects from the Model.tsdict attribute, or by using visualization methods in ipcoal to draw mutations that arose during the simulation. For example, the drawing below shows the first SNP from above which differs in samples "r4_0" and "r4_1" from the other samples. This actually arose because all other samples contain the derived allele which arose convergently in both the "r3" lineage and in the ancestor of the "(r1,r2)" lineage. Interesting.

In [21]:

Copied!

# draw the genealogy and mutations of the first SNP
model.draw_tree_sequence(idx=0);
# draw the genealogy and mutations of the first SNP
model.draw_tree_sequence(idx=0);

get_tree_sequence¶

This function returns a mutated TreeSequence given the Model parameter settings. This is very similar to calling sim_loci(1, nsites), but it does not spend the extra time to parse the newick tree and sequences to populate the Model.df and Model.seqs attributes. Thus, it is just a shortcut to calling msprime.sim_ancestry and msprime.sim_mutations given the parameter settings in the Model object. This can be useful and provide a great speedup when these summaries of the data are not needed. Note that this advances the seed_trees and seed_mutations RNGs of the Model object. It can be called many times given an initialized Model to return different TreeSequences starting from the initial seeds.

In [22]:

Copied!

# get a mutated tree sequence and advance the RNGs 
ts = model.get_tree_sequence(nsites=100)
repr(ts)
# get a mutated tree sequence and advance the RNGs 
ts = model.get_tree_sequence(nsites=100)
repr(ts)

Out[22]:

'<tskit.trees.TreeSequence object at 0x728b233ed070>'

	CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	r0_0	r0_1	r1_0	r1_1	r2_0	r2_1	r3_0	r3_1	r4_0	r4_1
0	1	1	.	G	T	99	PASS	.	GT	1\|1	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0
1	2	1	.	G	C	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1	1\|1	0\|0
2	3	1	.	A	C	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1	1\|1	1\|1
3	4	1	.	A	T	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1
4	5	1	.	A	T	99	PASS	.	GT	0\|0	0\|0	0\|0	0\|0	0\|0	0\|0	1\|1	1\|1	0\|0	0\|0