Species tree inference¶
We currently implement a wrapper for the species tree inference tool astral3. This takes a set of trees as inputs, and thus provides a useful framework for comparing species trees inferred from true genealogies, versus a species tree inferred from error-prone gene trees. This method can be called from ipcoal.phylo.infer_astral_tree
.
import ipcoal
import toytree
import msprime
tldr;
Call the `ipcoal.phylo.infer_astral_tree` method to infer a species tree using ASTRAL-III from a set of gene trees. This function will return the result as a ToyTree object with optional support values stored to the tree data.
Example dataset¶
Here we set up a demographic model composing a species tree with 5 lineages r0-r4. The root node height is at 0.5M generations, and internal edges are set to equal lengths of ~166K generations. Each interval Ne is set to 2e5, which corresponds to an internal edge length of 0.42 coalescent units. As you can see in the visualization below, this corresponds to a small amount of ILS among the 4 sampled gene copies per lineage. We simulated 1000 loci each 1000 sites in length.
# get a 5-tip imbalanced species tree w/ equal internal edges
sptree = toytree.rtree.unittree(ntips=5, treeheight=5e5, seed=123)
# simulate 100 loci x 1000 sites under the demographic model
model = ipcoal.Model(sptree, Ne=2e5, nsamples=4, seed_trees=123, seed_mutations=123)
model.sim_loci(nloci=100, nsites=1000)
# draw the demographic model with the first genealogy embedded
model.draw_demography(idx=0, container_width=450);
Getting tree sets¶
Genealogies¶
We can apply ASTRAL to infer a species tree from coalescent simulated data in a number of ways. First, we might be interested in the species tree that can be inferred from perfectly accurate input trees. We have these in the form of the simulated genealogies that are stored in the "genealogy" column of the Model.df
dataframe.
# show the first 10 trees in the result dataframe
model.df.head(10)
locus | start | end | nbps | nsnps | tidx | genealogy | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 333 | 333 | 14 | 0 | (r1_2:605395.53255673300... |
1 | 0 | 333 | 380 | 47 | 2 | 1 | (r1_2:605395.53255673300... |
2 | 0 | 380 | 478 | 98 | 4 | 2 | (r1_2:605395.53255673300... |
3 | 0 | 478 | 531 | 53 | 3 | 3 | (r1_2:605395.53255673300... |
4 | 0 | 531 | 826 | 295 | 18 | 4 | (r1_2:605395.53255673300... |
5 | 0 | 826 | 935 | 109 | 3 | 5 | (r1_2:605395.53255673300... |
6 | 0 | 935 | 951 | 16 | 1 | 6 | (r1_2:605395.53255673300... |
7 | 0 | 951 | 1000 | 49 | 2 | 7 | (r1_2:605395.53255673300... |
8 | 1 | 0 | 217 | 217 | 12 | 0 | ((r3_0:544208.9382111437... |
9 | 1 | 217 | 250 | 33 | 2 | 1 | (((((r4_0:53193.45971929... |
However, it is important to consider how these data actually match to the expectations of the multi-species coalescent. For example, if you simulate the coalescent with a recombination rate > 0 for loci that are longer than 1 site in length, as we did here, then each locus may contain more than one genealogy. Indeed, we can see in the dataframe above that the first locus contains 8 genealogies. If we were to input all of these genealogies into ASTRAL it would technically violate an assumption of the MSC model that the trees are expected to be statistically independent. Instead, it would be more appropriate to sample one genealogy per locus, perhaps the first one, or the longest one. Below we extract the full set of genealogies, and a subset composing just one genealogy per locus. Here the trees are stored as a pd.Series
object, which can be used in downstream steps, but it could just as well be a list or any collection object. We will analyze the trees in the next section.
# get all simulated genealogies
all_genealogies = model.df.genealogy
# sample the first tree from every locus
first_genealogies = model.df.genealogy[model.df.tidx == 0]
Using a multitree drawing from toytree we can view the first several simulated genealogies and see that there is some amount of ILS in each tree. We can see also that the trees are fully resolved and the tips align at zero. This is what simulated coalescent genealogies will usually look like.
# draw the first four trees
toytree.mtree(first_genealogies).draw(shape=(1, 4), height=350);
Gene trees¶
We can also apply ASTRAL to infer a species tree from inferred gene trees. This is more similar to the analysis of real data, where gene trees may have unresolved nodes, variable branch lengths, and mismodeled substitutions rates (you could investigate other sources of error as well like missing samples or alignment errors.) Here we will use the ipcoal wrapper for raxml-ng to infer gene trees, but you could alternative use any method to infer trees that could then be loaded back in as ToyTrees from newick, nexus, or related text formats. A benefit of using the raxml wrapper in toytree is that we can easily automate the process of writing the sequence data, analyzing it, and loading back in the results. To keep things fast we do not estimate bootstrap supports for the gene trees here, but this can be easily added.
# infer gene tree for each locus from 4 haploid samples
gene_trees_haploid = ipcoal.phylo.infer_raxml_ng_trees(model, nboots=0, nthreads=4, nproc=2)
# infer gene tree for each locus from 2 diploid samples
gene_trees_diploid = ipcoal.phylo.infer_raxml_ng_trees(model, diploid=True, nboots=0, nthreads=4, nproc=2)
IMAP dict¶
When your data contains multiple individuals or gene copies per lineage it is typical to provide a mapping (imap) to tell the species tree inference software which genomes are members of the same lineage. This is entered to the ipcoal.infer_astral_tree
method as a dictionary, where the keys are species tree tip names, and the values are lists of sample names that belong to that species. It is easy to extract an imap dict for simulations performed from an ipcoal.Model
object by calling its .get_imap_dict
function, which accepts are argument for whether or not pairs samples will be joined into diploid samples.
haploid_imap = model.get_imap_dict()
haploid_imap
{'r0': ['r0_0', 'r0_1', 'r0_2', 'r0_3'], 'r1': ['r1_0', 'r1_1', 'r1_2', 'r1_3'], 'r2': ['r2_0', 'r2_1', 'r2_2', 'r2_3'], 'r3': ['r3_0', 'r3_1', 'r3_2', 'r3_3'], 'r4': ['r4_0', 'r4_1', 'r4_2', 'r4_3']}
diploid_imap = model.get_imap_dict(diploid=True)
diploid_imap
{'r0': ['r0_0', 'r0_1'], 'r1': ['r1_0', 'r1_1'], 'r2': ['r2_0', 'r2_1'], 'r3': ['r3_0', 'r3_1'], 'r4': ['r4_0', 'r4_1']}
Infer species trees¶
See the ASTRAL III documentation for further details on options that can be run.
# infer sptree from all genealogies (multiple linked trees per locus)
atree1 = ipcoal.phylo.infer_astral_tree(all_genealogies, imap=haploid_imap)
# infer sptree from unlinked genealogies (1 per locus)
atree2 = ipcoal.phylo.infer_astral_tree(first_genealogies, imap=haploid_imap)
# infer sptree from inferred gene trees haploid
atree3 = ipcoal.phylo.infer_astral_tree(gene_trees_haploid.gene_tree, imap=haploid_imap)
# infer sptree from inferred gene trees diploid
atree4 = ipcoal.phylo.infer_astral_tree(gene_trees_diploid.gene_tree, imap=diploid_imap)
Compare species trees¶
In this case, all of the trees inferred the same species tree topology, which is not surprising since we simulated a pretty large dataset with a limited amount of ILS, which is a scenario where astral can be quite accurate. However, we can see that the different datasets have resulted in difference estimates of support and branch lengths (in coalescent units).
toytree.mtree([atree1, atree2, atree3, atree4]).root("r3", "r4").draw();