Rooting trees¶
Rooting or re-rooting trees orients the direction of ancestor-descendant relationships and thus provides "polarization" for the direction of evolution. Most tree inference algorithms return an unrooted tree as a result, and it is up to the researcher to select the placement of the root based on external information (e.g., outgroup designation) or analytical methods (e.g., based on edge lengths; embedding into a species tree; or gene duplication patterns).
This tutorial section provides background on how rooting or re-rooting affects a tree data structure and how to choose the edge and position on which to root a tree. It also clarifies several common misconceptions and sources of error during tree rooting.
import toytree
Take Home
A tree can be manually rooted on an outgroup using tree.root(...), or using one of several algorithms to estimate the root placement. A tree can be unrooted using tree.unroot().
A simple example of rooting or unrooting is demonstrated below.
# an example tree with outgroup (r3,r4)
tree = toytree.rtree.unittree(ntips=5, seed=123)
# modify to make an unrooted tree
utree = tree.unroot()
# root the tree on its original outgroup
rtree = utree.root("r3", "r4")
# or, root the tree on an alternative outgroup
atree = rtree.root("r2")
The treenode¶
All ToyTree objects contain a node that is designated the treenode, and which represents the top level Node
object in the collection of nodes that make up the tree hierarchy. This node exists in a tree whether it is rooted or unrooted. We use the term treenode rather than root node to refer to this top level node, since it is not always a true root node, as in the case of an unrooted tree. This can be a confusing point, but understanding it will help to make clear what the process of tree rooting actually represents. It is a little more complex than simply moving or relabeling a node, as described for the three operations below.
Rooting¶
When rooting an unrooted tree, a new node is inserted on an edge, splitting it into two (it helps me to think of it visually as pinching the edge and pulling it back to insert the node). The new node serves as the treenode. The number of nodes in the tree increases by 1.
Unrooting¶
When unrooting a rooted tree, the current treenode is removed, and an existing node in the tree (the previous treenode's left child) is designated as the treenode. The number of nodes in the tree decreases by 1.
Re-rooting¶
When re-rooting a rooted tree, the current treenode is removed and a new node is inserted on a different edge of the tree. The new node is assigned as the treenode. The number of nodes in the tree does not change.
Accessing the treenode¶
The treenode can be accessed from a ToyTree from its .treenode
attribute, or by indexing. Because it is always the last Node in idx order it can be indexed by -1.
# the .treenode is the top-level node
tree.treenode
<Node(idx=8)>
# it is also accessible as the last indexed node
tree[-1]
<Node(idx=8)>
Rooting visualized¶
You cannot verify that a tree is unrooted based simply on a visualization, since an unrooted tree can look the same as a rooted tree that contains a polytomy at its root. Thus, it is best practice to mention in a figure legend whether and how a tree is rooted. Another way of hinting that a tree is rooted versus unrooted is by using different tree layouts for visualization. This is demonstrated below.
This first set of drawing uses the default linear down ('d') layout. This places the treenode at the top of the drawing, which makes it easy to interpret how far each other node is from the treenode. This style makes sense for interpreting rooted trees (left and right), but is misleading for the unrooted tree (middle), since it gives the impression visually that the tree is rooted at node 7.
# draw the trees rooted
c, a, m = toytree.mtree([tree, utree, atree]).draw(ts='p', layout='d');
a[0].label.text = "rooted tree"
a[1].label.text = "unrooted tree"
a[2].label.text = "alt rooted tree"
The drawings below show the same trees but using the unrooted/undirected ('un') layout. This places the treenode near the center and projects edges away from it in a way that minimizes overlaps. You can more clearly see by comparing the three trees in this layout that the structure (topology) of the tree does not change during rooting. The only difference between the middle and two outer drawings is the addition of an extra node (node 8) that is inserted either between nodes 6 and 7 on the left tree, or between nodes 0 and 7 on the right tree (note: node idx labels change between trees with different rootings). An undirected layout would typically be used to visualize an unrooted tree (middle) but is not the most informative for rooted trees, since it is harder to interpret the distances of nodes from the root.
# draw the trees unrooted
c, a, m = toytree.mtree([tree, utree, atree]).draw(ts='p', layout='un');
a[0].label.text = "rooted tree"
a[1].label.text = "unrooted tree"
a[2].label.text = "alt rooted tree"
Tip
A key point is to recognize the difference between whether a tree is rooted or not, and whether a tree is drawn using an unrooted/undirected layout or not. These are two distinct things.
Rooting methods¶
toytree
currently supports three methods for rooting a tree: (1) manually; (2) by the midpoint (Farris 1972); and (3) by the minimal ancestor deviation (Tria et al. 2017). The first requires the user to designate the outgroup and optionally specify the length along the edge at which to insert the new root node. The second method automatically places the root node on an edge that is average distance from all terminal nodes. The last method calculates a set of statistics that can be used to either automatically place the root node, or to provide a score for an alternative manual placement. The first two approaches are instantaneously fast, while the last one can take a few seconds for large trees.
The most common methods, .root()
and .unroot
, are available from a ToyTree
object. These are also available in the toytree.mod
subpackage, where the other optional rooting functions are also located. Each is demonstrated with further explanation below.
Manually set outgroup¶
The .root()
function requires manually designating an outgroup. Specifically, you are designating the node for which the edge above it will be bisected by the new treenode. The clade composed of the selected node and its descendants is designated the outgroup, and the clade of everything else on the other side of the root is the ingroup. To select the edge on which to root the tree, you must designate arguments to select the node that is below it given the current directed layout of the tree. You can enter the index or name of a single Node, or, if you provide multiple Node selectors then it will find the MRCA of the selected Nodes and root on that Node's edge. In the example below I select two named tip Nodes by their str names, "r3" and "r4", which selects their mrca ancestor to root on.
# root the tree using clade (r3,r4) as outgroup
new_tree = utree.root("r3", "r4")
# show the unrooted and newly rooted trees
toytree.mtree([utree, new_tree]).draw(ts='p');
Midpoint deviation¶
Rooting on the "midpoint" assumes a clock-like evolutionary rate (i.e., branch lengths are equal to time) and may yield odd results when this assumption is violated. This algorithm finds the root position by calculating the pairwise path length between all tips in an unrooted tree, and places the treenode on an edge representing the midpoint of the longest path.
# root the tree on the global midpoint and draw it
utree.mod.root_on_midpoint().draw(ts='p');
Balanced Midpoint deviation¶
Rooting on the "balanced midpoint" also assumes a clock-like evolutionary rate. This algorithm finds the root position by minimizing the max length from the root to all tips. It runs at similar speed to midpoint deviation but should be less sensitive to the presence of outlier branches.
# root the tree on the balanced midpoint
utree.mod.root_on_balanced_midpoint().draw(ts='p');
Minimal-ancestor-deviation¶
The minimal ancestor deviation (MAD) rooting method is intended to accommodate rate heterogeneity among edges of a tree when inferring the root state of an unrooted tree. It assumes that branch lengths are additive and that the true tree is ultrametric (i.e., tip height variation results from rate heterogeneity). This method finds the point on every edge that minimizes the deviations from all pairwise midpoint rooting positions. The optimal rooting position is on the edge with the lowest MAD score, but the user can also manually select a suboptimal edge and assess its relative score compared to alternative root placements (See Inferring the root below.)
# get a rooted tree with MAD scores stored as features
tree.mod.root_on_minimal_ancestor_deviation().draw(ts='p');
Root dist¶
When rooting a tree it is important not only to select the correct edge on which to place the treenode, but also the correct position on that edge. For example, the edge could be split at its midpoint, or closer to one child node than the other. The true rooting position is not known, and so this is a place where a model-based inference can be useful. One common assumption is that the tree should be as close to ultrametric as possible, and thus a position should be selected on the edge that best aligns the tip nodes. This is approach taken by the midpoint and minimal-ancestor-deviation methods. In addition, the user can set a position manually using the manual rooting method. If the root_dist
arg is left at its default=None
setting in the root
function then the edge midpoint will be used. Below I show two manual assignments of the root dist selecting either 0.1 units or 0.6 units up from the selected Node.
# manual set rooting position 0.1 height units above clade (r3,r4)
utree.root("r3", "r4", root_dist=0.1).draw(ts='p');
# manual set rooting position 0.6 height units above clade (r3,r4)
utree.root("r3", "r4", root_dist=0.6).draw(ts='p');
Infer root w/ MAD¶
It is always best practice to include an outgroup in phylogenetic analyses, but in some cases the outgroup may be unknown or unavailable. In such cases it can be useful to apply methods for inferring the most likely root state based on edge length information. The best method for this currently implemented in toytree
is using the minimal ancestor deviation score.
The root_on_minimal_ancestor_deviation
function in toytree.mod
calculates the MAD score and the root probabilities for each edge in the tree. By default these scores are stored as edge features on the returned tree. In addition, global score info can be returned by using return_stats=True
argument. This includes the minimal_ancestor_deviation
score for the rooting edge (lower is better); the root_ambiguity_index
(whether another edge is as good as the selected one. Lower is better); and the root_clock_coefficient_of_variation
(how variable rates are, i.e., how non-ultrametric the tree is.) This is demonstrated below for an example where the rooted tree is ultrametric, and a case where it is very much not.
MAD statistics¶
When using return_stats=True
this function returns two objects, the tree and a dictionary. In this example tree (shown above) the values of each statistic are very low. A low value for the deviation, ambiguity, and variation indicates that the data do not deviate much from a molecular clock, i.e., there is high confidence in this rooting under our assumed model.
tree, stats = tree.mod.root_on_minimal_ancestor_deviation(return_stats=True)
stats
{'minimal_ancestor_deviation': 0.0, 'root_ambiguity_index': 0.0, 'root_clock_coefficient_of_variation': 1.922962686383564e-14}
Whereas in this example non-ultrametric tree the MAD, ambiguity, and clock variation are all very high.
# create a non-ultrametric tree and draw it
non_ultrametric_tree = toytree.rtree.rtree(10)
non_ultrametric_tree.draw(layout='d');
# calculate and return the global MAD stats
_, stats = non_ultrametric_tree.mod.root_on_minimal_ancestor_deviation(return_stats=True)
stats
{'minimal_ancestor_deviation': 0.2567240422048876, 'root_ambiguity_index': 0.9951447617191235, 'root_clock_coefficient_of_variation': 28.79169780067905}
Compare MAD rootings¶
Whether or not you use the argument return_stats=True
, statistics of the MAD analysis will be stored to Node's of the return ToyTree.
This is one of the real strengths of the MAD approach: it not only finds the best edge on which to root a tree, but it also reports scores for all alternative rootings, and how much better one is than another. You can examine this data stored inside the returned ToyTree object. This is returned for each edge on the tree as a "MAD" and "MAD_root_prob" score feature. For example, in the tree below, the MAD score for the correct root position is 0.0, indicating that the tree is perfectly ultrametric when rooted at this position. The MAD rooting function correctly infers that this is the best root position, and assigns it as the root. The MAD_root_prob for this edge is 0.20 (the same probability is assigned to nodes 5 and 7, since they share edge on which the root node is placed. As we saw above, the global `root_ambiguity_index' for this rooting was 0.0, meaning that the 0.201 probability for this placement is significantly better than the 0.15 root probability for the next highest scoring edge.
# get a rooted tree with MAD scores stored as features
mad_tree = tree.mod.root_on_minimal_ancestor_deviation()
# examine all features stored to the ToyTree (which now include MAD info)
mad_tree.get_node_data()
idx | name | height | dist | support | MAD | MAD_root_prob | |
---|---|---|---|---|---|---|---|
0 | 0 | r0 | 0.000000e+00 | 0.333333 | NaN | 0.500000 | 0.10066 |
1 | 1 | r1 | 0.000000e+00 | 0.333333 | NaN | 0.500000 | 0.10066 |
2 | 2 | r2 | 0.000000e+00 | 0.666667 | NaN | 0.258199 | 0.14934 |
3 | 3 | r3 | 2.220446e-16 | 0.666667 | NaN | 0.258199 | 0.14934 |
4 | 4 | r4 | 2.220446e-16 | 0.666667 | NaN | 0.258199 | 0.14934 |
5 | 5 | 3.333333e-01 | 0.333333 | NaN | 0.258199 | 0.14934 | |
6 | 6 | 6.666667e-01 | 0.333333 | NaN | 0.000000 | 0.20132 | |
7 | 7 | 6.666667e-01 | 0.333333 | NaN | 0.000000 | 0.20132 | |
8 | 8 | root | 1.000000e+00 | 0.000000 | NaN | NaN | NaN |
Finally, we could plot the MAD or MAD_root_prob scores on the edges of a tree easily, since they are stored as features to the returned rooted tree.
# plot and show the MAD rooting probability for other edges
c, a, m = mad_tree.draw('p', width=450);
mad_tree.annotate.add_edge_markers(a, "r3x1", size=14, color="lightgrey", mask=False, xshift=0)
mad_tree.annotate.add_edge_labels(a, "MAD_root_prob", mask=False, font_size=11);
Check root status¶
A method to check whether a tree is rooted based on resolution of the treenode. Note: this does not distinguish between a tree actually being rooted versus whether the treenode is a polytomy. This method simply returns a boolean for whether the root node has >2 children. It is nevertheless still quite useful.
# returns True if root node has >2 children
tree.is_rooted()
True
Unrooting¶
The unroot
function can be called to unroot a rooted tree. In an unrooted tree the treenode is always a polytomy. A rooted bifurcating tree has nnodes = (ntips * 2) - 1
, whereas an unrooted bifurcating tree has nnodes = (ntips * 2) - 2
. In other words, converting from a rooted to unrooted state removes one node (the former treenode) from the tree, and assigns an existing node as the new treenode.
# get an unrooted copy of the tree
tree.unroot()
<toytree.ToyTree at 0x7da2b3c7f170>
Features/Data and Rooting¶
The processes of rooting, unrooting, or re-rooting trees should be reversible, meaning that the operations can be performed in any order without the loss of information about the topology, branch lengths, or any associated meta-data/features. This is the goal in toytree
and it is always achieved for the case of the topology and branch lengths, but requires some user knowledge when dealing with arbitrary additional data features assigned to the tree. This is because data can be stored to a tree as either a feature of nodes, or of edges. Some data stored to a tree are intended to represent information about the edges (splits) in a tree, rather than information about the nodes. This is important as these types of data must be treated differently when doing things like re-rooting a tree, and in some cases, for visualization.
Support values¶
The way in which support values are displayed on trees is often a source of confusion. This is because support values are often plotted on the nodes of a tree, despite the fact that they are actually features of the edges of a tree. One consequence of this is that the edge which spans the treenode in a tree drawing actually may appear as if it has two separate support values, when in fact this edge only has one support value. There are a few options for how you can change this in a tree drawing to make it more clear. In the example below
# create a copy of the tree
example = tree.copy()
# assign hypothetical support values to internal splits
example = example.set_node_data("support", {5: 100, 6: 90}, default="nan")
Here is an example of how we can explicitly draw the support values using annotations to add edge markers and labels. This function recognized that the edge above nodes 5 and 7 represents the same split in the tree, and thus only one data marker is shown for the support (i.e., the support for the edge to the left of node 7 is also 100).
# draw the tree and store the drawing objects
c, a, m = example.draw();
# add node and edge annotations
example.annotate.add_node_markers(a, "o", size=16, color="lightgrey");
example.annotate.add_node_labels(a, "idx", font_size=10);
example.annotate.add_edge_markers(a, "r2x1", size=16, color="pink");
example.annotate.add_edge_labels(a, "support", font_size=10);
Note that it is possible to assign support values to both edges, and to force a visualization of them. However, this is incorrect for the case of support values. On the other hand, given the rooting position on this tree, each of these edges does have a different dist value, and if you wanted to visualize these you can do so using node marker annotations like below.
# draw the tree and store the drawing objects
c, a, m = example.draw();
# add node and edge annotations
example.annotate.add_node_markers(a, "o", size=16, color="lightgrey");
example.annotate.add_node_labels(a, "idx", font_size=10);
example.annotate.add_node_markers(a, "r2x1", size=17, xshift=-25, mask=(0, 1, 0), color="lightgrey");
example.annotate.add_node_labels(a, "dist", xshift=-25, font_size=10, mask=(0, 1, 0));
Information loss¶
Is information lost when a tree is unrooted and then re-rooted? The answer is no, as long as the proper root_dist
and edge_features
information is provided.
# unroot and re-root several times
tree.unroot().root('r2').unroot().root('r3', 'r4').draw('p');
Example from paper¶
This problem was well described in the "A critical review on the use of support values in tree viewers and bioinformatics toolkits" by Czech et al. (2017). In the cell below I parse the newick string of the example problem, which involves a tree with names assigned to both tip and internal nodes. The problem is that when the tree is rooted on a new edge the internal edge information is sometimes not propertly polarized, e.g., one or more support values are assigned to the wrong edges. Here to demonstrate that toytree
handles this case correctly I show the example and use visualizations that assign colors separately to the nodes and edges to make it easy to follow.
# unrooted tree from Czech et al...
czech = "((C,D)1,(A,(B,X)3)2,E)R;"
ctree = toytree.tree(czech, internal_labels="name")
# set data features to color nodes and edges of the unrooted tree
colors = {'1': 'red', '2': 'green', '3': 'orange'}
ctree.set_node_data("ecolor", colors, default="black", inplace=True)
ctree.set_node_data("ncolor", colors, default="black", inplace=True);
# create a reusable dict for other style options for visualization
kwargs = {
'layout': 'd',
'use_edge_lengths': False,
'node_sizes': 10,
'node_labels': 'name',
'node_labels_style': {
'font-size': 20,
'baseline-shift': 10,
'anchor-shift': 10,
}}
# draw original unrooted tree
ctree.draw(node_colors="ncolor", edge_colors="ecolor", **kwargs);
When we root the tree on the edge above "X" this changes the orientation of several nodes on the tree, such that some which were parents of another before now appear as children of that node. This has the important consequence for how the edge between these nodes is interpreted. For example, the yellow edge which previously represented information about the split separating (B,X) from every other node (e.g., see tree drawing above) now instead represents the split between X and every other node (e.g., see re-rooted tree below). Similarly, the green edge which previously represented the split between (A,B,X) versus (C,D,E) now represents the split between (B,X) versus (A,C,D,E). Although this was a correct re-orientation of Nodes during rooting, this is actually an incorrect polarization of the edge information. This is something we can fix by designating which additional features stored to the ToyTree represent edge versus node data.
# root the tree w/o indicating edge_features, error!
rtree1 = ctree.root("X")
rtree1.draw(node_colors="ncolor", edge_colors="ecolor", **kwargs);
Here to correct for the fact the "ecolor" (edge colors) is a data feature of edges, we can specify this is the list of edge_features
, which will change how they are polarized during the rooting. Now the yellow edge points down from node 3 rather than up, retaining that this edge feature represents information about the split between nodes 2 and 3. Note: the distance and support features were already automatically handled in this way.
# re-root, treating 'ecolor' but not 'ncolor' as an edge feature.
rtree2 = ctree.root("X", edge_features=["ecolor"])
rtree2.draw(node_colors="ncolor", edge_colors="ecolor", **kwargs);
Edge Features
Note that designating data features as being an _edge_feature_ is automatically handled by toytree for "edge" and "dist" features, since these are always known to apply to edges. Similarly, we know that "name" and "idx" are always features of nodes. Any other features that are added by you represent your own data, and thus it is up to you to speficy their data type during rooting. This includes features such as "MAD" if you re-root a tree after
Assign a feature to edges¶
If want to set a feature to be treated as an edge_feature
just once, and then forget it, you can set it to the edge_features
attribute of a ToyTree object. You can also check this attribute to see which features are currently being treated as edge features automatically. As we said, 'dist' and 'support' are in here. Also, when you root a tree using the minimal_ancestor_deviation
method the MAD stats are automatically added to this set. You can assign additional features here as well.
tree.edge_features
{'MAD', 'MAD_root_prob', 'dist', 'support'}
You can see in this example that you can assign "ecolor" to the edge_features
attribute of the tree, rather than indicating it during the root
call, and root
will know that this feature is of edges without having to indicate it each time.
# create a copy of the czech tree that has ecolors and ncolors
etree = ctree.copy()
# you can assign 'ecolors' to the tree's edge_features set
etree.edge_features.add("ecolor")
# root the tree (ecolors will be treated as an edge feature automatically
rtree2 = etree.root("X")
# visualize that it is correct
rtree2.draw(node_colors="ncolor", edge_colors="ecolor", **kwargs);