MSC simulations and generation time variation (empirical)¶
The purpose of this notebook is to demonstrate ipcoal simulations on a topology inferred from empirical data. We provide recommendations for how to scale units from a time-calibrated phylogeny to use in coalescent simulations, and how to incorporate biological information about species, such as generation times and population sizes, to perform more realistic simulations.
Simulating coalescent genealogies and sequences on a parameterized species tree model using ipcoal can provide a null expectation for the amount of discordance that you expect to observe across different nodes of a species tree, and can even be used as a posterior predictive tool for phylogenetic analyses.
import numpy as np
import pandas as pd
import ipcoal
import toytree
import toyplot
colormap = toyplot.color.brewer.map("BlueRed", reverse=True)
Mammal phylogeny data set¶
In this example we use published data for mammals. We will use a time-calibrated MCC phylogeny by Upham et al. (2009) as a species tree hypothesis; we will use species geographic areas from the PanTHERIA database as a proxy for effective population sizes; and we will use generation time estimates from the Pacifici et al. (2014) data set, which imputes a lot of missing data from pantheria by using mean values among close relatives.
# load the phylogenetic data (big tree, takes a few seconds)
TREE_URL = (
"https://github.com/eaton-lab/ipcoal/blob/master/"
"notebooks/mammal_dat/MamPhy_fullPosterior_BDvr_DNAonly"
"_4098sp_topoFree_NDexp_MCC_v2_target.tre?raw=true"
)
tree = toytree.tree(TREE_URL)
print(tree.ntips, "tips in the Upham mammal tree")
4100 tips in the Upham mammal tree
# load the mammal biological data (e.g., geo range)
PANTH_URL = (
"https://github.com/eaton-lab/ipcoal/blob/master/"
"notebooks/mammal_dat/PanTHERIA_1-0_WR05_Aug2008.txt?raw=true"
)
panthdf = pd.read_csv(PANTH_URL, sep='\t')
print(panthdf.shape[0], "taxa in PanTHERIA database")
5416 taxa in PanTHERIA database
# load the generation time data
GT_URL = (
"https://github.com/eaton-lab/ipcoal/blob/master/"
"notebooks/mammal_dat/5734-SP-2-Editor.csv?raw=true"
)
gentimedf = pd.read_csv(GT_URL)
print(gentimedf.shape[0], "taxa in Pacifici gentime database")
5427 taxa in Pacifici gentime database
Filtering and selecting taxa¶
We will first trim the data down to include only taxa that are shared among all three data sources and for which there is no missing biological data. This reduces the data set to 3121 taxa. The distribution of geographic range areas is in units of kilometers2
(geogrange
) and generation times is in units of years (gentime
).
# subselect species names and geo range columns from pantheria
sppdata = panthdf.loc[:, ['MSW05_Binomial', '26-1_GR_Area_km2']]
# rename sppdata columns
sppdata.columns = ["species", "georange"]
# make column to record tree tip label names
sppdata["treename"] = np.nan
# dict map: {gen}_{spp} to {gen}_{spp}_{fam}_{order}
tipdict = {i.rsplit("_", 2)[0]: i for i in tree.get_tip_labels()}
# record whether species in pantheria is in the tree tip labels
for idx in sppdata.index:
# match data names to tree names which have underscores
name = sppdata.species[idx]
name_ = name.replace(" ", "_")
# record treename if it is in the database
if name_ in tipdict:
sppdata.loc[idx, "treename"] = tipdict[name_]
# add gentime values to all species matching to names in Pacifici data set
sppdata["gentime"] = np.nan
for idx in gentimedf.index:
# get generation time in units of years
species, gent = gentimedf.loc[idx, ["Scientific_name", "GenerationLength_d"]]
mask = sppdata.species == species
sppdata.loc[mask.values, "gentime"] = gent / 365.
# set missing data (-999) to NaN
sppdata[sppdata == -999.000] = np.nan
# remove rows where either georange or gentime is missing
mask = sppdata.georange.notna() & sppdata.gentime.notna() & sppdata.treename.notna()
sppdata = sppdata.loc[mask, :]
# reorder and reset index for dropped rows
sppdata.sort_values(by="species", inplace=True)
sppdata.reset_index(drop=True, inplace=True)
# show first ten sorted rows
sppdata.head(10)
species | georange | treename | gentime | |
---|---|---|---|---|
0 | Abeomelomys sevia | 53261.73 | Abeomelomys_sevia_MURIDA... | 1.710684 |
1 | Abrocoma bennettii | 54615.98 | Abrocoma_bennettii_ABROC... | 2.829928 |
2 | Abrocoma boliviensis | 5773.97 | Abrocoma_boliviensis_ABR... | 2.829928 |
3 | Abrocoma cinerea | 381391.02 | Abrocoma_cinerea_ABROCOM... | 2.829928 |
4 | Abrothrix andinus | 722551.83 | Abrothrix_andinus_CRICET... | 1.614762 |
5 | Abrothrix hershkovitzi | 1775.72 | Abrothrix_hershkovitzi_C... | 1.614762 |
6 | Abrothrix illuteus | 35359.55 | Abrothrix_illuteus_CRICE... | 1.614762 |
7 | Abrothrix jelskii | 506394.71 | Abrothrix_jelskii_CRICET... | 1.614762 |
8 | Abrothrix lanosus | 43016.67 | Abrothrix_lanosus_CRICET... | 1.614762 |
9 | Abrothrix longipilis | 423823.71 | Abrothrix_longipilis_CRI... | 1.614762 |
Filter the tree to include only taxa in the data table¶
# find names in tree but not in data table
names_in_data = set(sppdata.treename)
names_in_tree = set(tree.get_tip_labels())
names_to_keep = names_in_tree.intersection(names_in_data)
# only keep tree tips that are in the table
ftree = tree.mod.prune(*names_to_keep)
print(len(ftree), "tips in filtered tree (ftree)")
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[17], line 3 1 # only keep tree tips that are in the table 2 ftree = tree.mod.prune(*names_to_keep) ----> 3 print(len(ftree), "tips in filtered tree (ftree)") TypeError: object of type 'ToyTree' has no len()
Convert geographic ranges to Ne values¶
Here we generate a range of Ne values within a selected range that are scaled by the variation in geographic range area sizes among taxa. The distribution is plotted as a histrogram on a y-axis log scale. Many taxa have small Ne, few have very large Ne.
# transform georange into Ne values within selected range
max_Ne = 1000000
min_Ne = 1000
# set Ne values in range scaled by geographic ranges
Ne = max_Ne * (sppdata.georange / sppdata.georange.max())
Ne = [max(min_Ne, i) for i in Ne]
sppdata["Ne"] = np.array(Ne, dtype=int)
# show 10 random samples
sppdata.sample(10)
# plot a histogram of Ne values
a, b = np.histogram(sppdata.Ne, bins=25)
toyplot.bars((a, b), height=300, width=400, yscale="log", ylabel="bin count", xlabel="Ne");
Set Ne and g values for tip and ancestral nodes on the tree object¶
ipcoal can accept different Ne and g values to use in simulations, and the easiest way to set variable values across different parts of the tree is to map the values to the tree object that ipcoal accepts as an argument. We only have estimates of Ne and g for species that are alive today, but it would be useful to also includes estimates for ancestral nodes in the species tree. Here we use a simple ancestral state reconstruction based on Brownian motion to infer states for ancestral nodes.
# make a copy of the filtered tree
tree_ng = ftree.copy()
# dictionaries mapping names to values
dict_ne = {sppdata.treename[i]: sppdata.Ne[i] for i in range(sppdata.shape[0])}
dict_gt = {sppdata.treename[i]: sppdata.gentime[i] for i in range(sppdata.shape[0])}
# set values on nodes of the tree for all species (tips)
tree_ng = tree_ng.set_node_values("Ne", dict_ne)
tree_ng = tree_ng.set_node_values("g", dict_gt)
# estimate and set values on ancestral nodes as well.
tree_ng = tree_ng.pcm.ancestral_state_reconstruction("g")
tree_ng = tree_ng.pcm.ancestral_state_reconstruction("Ne")
Plot tree with Ne and g values¶
Let’s plot just a subset of taxa to start, since it will be much easier to visualize than trying to examine the entire tree. Here we select only the taxa in the genus Mustela. The tree plot shows variation in Ne using the thickness of edges, and generation times are shows by the color of nodes, blue to red, representing shorter to longer times. The ts='p'
drawing option automatically pulls the Ne information from the nodes of the tree to draw the edge thickness.
# make a tree copy
atree = tree_ng.copy()
# get ancestor of all tips that have 'Mustela' in their name
mrca_node_idx = atree.get_mrca_idx_from_tip_labels(wildcard="Mustela_")
# get the TreeNode object of this subtree
node = atree.get_feature_dict("idx")[mrca_node_idx]
# create as a new Toytree
subtree = toytree.tree(node)
# scale the tree height from millions of year to years
subtree = subtree.mod.node_scale_root_height(subtree.treenode.height * 1e6)
subtree.draw(
ts='p',
edge_type='p',
node_sizes=10,
node_labels=False,
node_colors=[
colormap.colors(i, 0.1, 10) for i in subtree.get_node_values('g', 1, 1)
],
width=400,
height=600,
);
Convert edge lengths from time to generations¶
Time in years is converted to units of generations by dividing by each edge length by the generation time for that edge, recorded as ngenerations/year. When this is done the crown root age of the Mustela tree is now at 2.4M generations from the furthest tip in the tree. This tree object (ttree) now contains information in its Ne values mapped to nodes and in its edge lengths to fully represent the data on population sizes and generation time differences among species and their ancestors. This is the tree we will use for our ipcoal simulations.
# divide the edge lengths (in abosolute time) by the generation time
ttree = subtree.set_node_values(
"dist",
{i.name: i.dist / i.g for i in subtree.get_feature_dict()}
)
ttree.draw(
ts='p',
edge_type='p',
tip_labels_align=True,
tip_labels=[i.rsplit("_", 2)[0] for i in ttree.get_tip_labels()],
node_labels=False,
node_sizes=0,
width=400,
height=400,
);