Gene tree inference¶
Following the terminology used in ipcoal, gene trees refer to inferred hypotheses for the true genealogical relationships shared by a set of sampled genomes. Gene trees can be inferred from the sequence that evolve along the edges of genealogies. Our purpose here is to infer gene trees and compare them to the true genealogies.
The ipcoal.phylo
module can be used to implement common gene tree inference methods to simulated sequences under a variety of approaches to investigate sources of phylogenetic error. Some examples might include investigating how different demographic and simulation parameters affect sequence variation; how substitution models affect gene tree inference; how phased haploid versus diploid (ambiguity character) sequences affect accuracy; how separating sequences between recombination events versus concatenating them affects gene tree error; and many more.
import ipcoal
import toytree
import msprime
import numpy as np
# ipcoal.phylo.infer_raxml_ng_tree
# ipcoal.phylo.infer_raxml_ng_trees
tldr;¶
Use ipcoal.phylo.infer_raxml_ng_tree
to infer a gene tree from sequence data, or ipcoal.phylo.infer_raxml_ng_trees
to return a dataframe with a collection of inferred trees from a set of sequences. These each take an ipcoal.Model
class as input, and can select one or more loci to analyze individually, or as a concatenated sequence. In this example we infer gene trees for diploid sequences from 5 loci first as one large concatenated locus, and then also individually as separate loci.
# simulate sequences for two populations
model = ipcoal.Model("(a:50000,b:50000);", Ne=20000, nsamples=8, recomb=0, seed_mutations=123, seed_trees=123)
model.sim_loci(nloci=10, nsites=10000)
# infer a gene tree for loci 0-5 concatenated and draw the gtree
gtree = ipcoal.phylo.infer_raxml_ng_tree(model, idxs=[0, 1, 2, 3, 4], diploid=True, nboots=100)
gtree.mod.root_on_minimal_ancestor_deviation(inplace=True)
gtree.draw(ts='b', scale_bar=True, label="concat loci 0-5");
# infer gene trees separately for loci 0-5 as a dataframe
gtdf = ipcoal.phylo.infer_raxml_ng_trees(model, idxs=range(5), diploid=True, nboots=100)
gtdf
locus | start | end | nbps | nsnps | gene_tree | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 10000 | 10000 | 74 | ((b_3:1e-06,(b_0:1e-06,b... |
1 | 1 | 0 | 10000 | 10000 | 34 | (a_3:1e-06,a_0:1e-06,((a... |
2 | 2 | 0 | 10000 | 10000 | 24 | (b_1:1e-06,b_0:1e-06,(b_... |
3 | 3 | 0 | 10000 | 10000 | 38 | ((b_0:1e-06,(b_3:0.0003,... |
4 | 4 | 0 | 10000 | 10000 | 40 | (b_2:1e-06,b_1:0.000899,... |
Methods¶
infer_raxml_ng_tree_from_alignment¶
The function infer_raxml_ng_tree_from_alignment
takes an alignment str
as input and returns a ToyTree
inferred by raxml-ng
. You can toggle many arguments to raxml-ng
using this function, but not all. It's primary utility is to provide a convenient shortcut to calling raxml-ng
from Python which can also clean up the temporary files for you. This is the simplest of the three methods shown here. The examle below provides a phylip-formatted alignment as input and draws the inferred gene tree.
# a phylip string
phy = """\
4 10
A ACAACCGGTT
B ACATCCGGAA
C CCATCCGGAT
D AACACCGGTT
"""
# call raxml-ng on phylip str and draw result ToyTree
tree = ipcoal.phylo.infer_raxml_ng_tree_from_alignment(phy)
tree.draw();
infer_raxml_ng_tree¶
The function infer_raxml_ng_tree
provides quite a bit more functionality than the one above. This takes an ipcoal.Model
object as input and provides a number of arguments for extracting one or more simulated loci from the object and optionally concatenating them (using the idxs
arg), and an option to join haploid sequences into pairs as diploid sequences (using diploid=True
). You can also toggle options to raxml-ng
, including fine-tuning the parallelization. Under the hood, this function writes a phylip formatted sequence file to a temporary directory and calls raxml on the tmp file, then removes it. You can optionally set the tmpdir
as an argument.
# concatenate loci 2-4 and infer gene tree
tree = ipcoal.phylo.infer_raxml_ng_tree(model, idxs=[2, 3, 4], diploid=True)
tree.draw(scale_bar=True);
infer_raxml_ng_trees¶
The function infer_raxml_ng_trees
is similar to the one above but in addition to the raxml-ng
parallelization arguments nthreads
(used per tree) and nworkers
(num parallel boots per tree); it also includes an ipcoal-level parallelization option njobs
for distributing the individual infer_raxml_ng_tree
jobs in parallel. If you are working on a large cluster and running many thousands of small jobs you can use this to achieve much greater parallelization. Note, raxml-ng
will raise an error if you request more resources among concurrently running jobs than are available. You should very much try to avoid this, as it slows things down greatly. But, if it occurs rarely during a job and you want to avoid the error you can use the options do_not_autoscale_threads
and perf_threads
.
# return a pandas Dataframe w/ trees and stats
ipcoal.phylo.infer_raxml_ng_trees(model, diploid=True, nthreads=2, nworkers=1, nproc=5)
locus | start | end | nbps | nsnps | gene_tree | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 10000 | 10000 | 74 | (a_3:1e-06,a_0:0.000398,... |
1 | 1 | 0 | 10000 | 10000 | 34 | ((b_3:1e-06,b_1:1e-06):0... |
2 | 2 | 0 | 10000 | 10000 | 24 | (b_2:1e-06,(b_1:1e-06,b_... |
3 | 3 | 0 | 10000 | 10000 | 38 | (a_3:1e-06,a_2:1e-06,a_0... |
4 | 4 | 0 | 10000 | 10000 | 40 | (b_3:1e-06,(a_2:1e-06,a_... |
5 | 5 | 0 | 10000 | 10000 | 53 | (b_0:1e-06,(a_0:1e-06,a_... |
6 | 6 | 0 | 10000 | 10000 | 52 | (a_2:1e-06,a_0:1e-06,((a... |
7 | 7 | 0 | 10000 | 10000 | 54 | (b_1:1e-06,b_0:1e-06,b_2... |
8 | 8 | 0 | 10000 | 10000 | 29 | (((b_2:0.0004,b_3:1e-06,... |
9 | 9 | 0 | 10000 | 10000 | 53 | (a_3:1e-06,a_1:1e-06,a_2... |
Diploid sequences¶
The option diploid=True
joins haploid sequences from within the same population together to create diploid sequences. Because samples within the same population are treated as randomly mating (panmictic) the order in which they are joined should not matter. Therefore, we simply join samples sequentially ordered by their alphanumeric tip labels. For example, "a_0" and "a_1" will be joined to create a new diploid "a_0"; and "a_2" and "a_3" would be joined to create the new "a_1", and so on with sequential numbering. The effect of analyzing diploid sequences will be generally greater when heterozygosity of samples is greater.
# infer gene tree from diploid sequences
tree = ipcoal.phylo.infer_raxml_ng_tree(model, idxs=[1], diploid=True)
tree.root("~a").ladderize().draw(scale_bar=True);
# infer gene tree from diploid sequences
tree = ipcoal.phylo.infer_raxml_ng_tree(model, idxs=[1], diploid=False)
tree.root("~a").ladderize().draw(scale_bar=True);
Comparing trees¶
One of the core purposes of simulation is to compare inferred results to the ground truth. (See the Cookbooks/Examples section.) When simulating data we know the true genealogical history for our set of samples exactly at every position of their genomes. Therefore we can measure error as the differences between inferred gene trees and the true genealogies. One way to do this is to use tree distance metrics, many of which are implemented in toytree.
In the example below we simulate a tree sequence across a 100Kb locus for 5-tip species tree and sample 2 genomes per lineage. This creates 449 genealogies.
# simulate sequences for two populations
sptree = toytree.rtree.unittree(ntips=5, treeheight=1e6, seed=123)
model = ipcoal.Model(sptree, Ne=400_000, nsamples=2, seed_mutations=123, seed_trees=123)
model.sim_loci(nloci=1, nsites=100_000)
model.draw_genealogies(shared_axes=True, scale_bar=100_000);
# infer a concatenated gene tree for the locus
ctree = ipcoal.phylo.infer_raxml_ng_tree(model)
# draw the concatenation tree
ctree.draw();
Here I use the toytree.distance.get_treedist_rfg_mci
to return the Generalized Robinson-Foulds distance using Mutual Clustering Information (MCI) to measure the distance between the concatenation tree and each observed genealogy across the length of the simulated chromosome.
# measure MCI RF dist for each gene tree
tdists = [ctree.distance.get_treedist_rfg_mci(i, normalize=True) for i in toytree.mtree(model.df.genealogy)]
We can then plot the tree distances along the length of the chromosome. You can see that for most parts of the genome the local genealogy does not match the concatenation tree topology. There is also a pattern of auto-correlation -- nearby regions of the genome tend to exhibit similar trees, and thus similar tree distances.
import toyplot
canvas = toyplot.Canvas(height=300)
axes = canvas.cartesian(ylabel="Generalized RF distance", xlabel="Genome position")
axes.fill(model.df.start, tdists)
axes.x.ticks.locator = toyplot.locator.Extended(count=6)
axes.x.ticks.show = axes.y.ticks.show = True