This notebook is associated with the following reading:
By the end of this notebook you should be able to:
This exercise uses a Python package called
toytree to read and write newick formatted data files, and to manipulate and draw tree visualizations. The goal of this exercise to learn how to read phylogenetic trees, to interpret their meaning, and to understand how this information is represented as text data. Although you will not need to learn many details of the
toytree Python library, you can find more information about it in the documentation here.
Your assigned reading quotes a philosopher of science who stated "It is impossible to really understand evolution without an ability to accurately interpret phylogenetic trees (O'Hara 1988, 1997)", and that "evolution itself is a theory of evolutionary trees". In biology generally, and in this class already, you have seen many examples of phylogenetic trees. But what does a phylogenetic tree represent? And why is it important to understand? There are actually a few common pitfalls that people fall into when reading phylogenetic trees without having spent serious time considering their meaning. Recognizing these mistakes will make you a better biologist. We'll cover some of these pitfalls in this notebook.
A phylogenetic tree is a depiction of the inferred evolutionary relationships among the units represented at the tips. Sometimes, when we have fossil or ancient DNA data, or in cases of experimental evolution (e.g., many generations of bacteria studied over decades) we can trace the evolutionary history of organisms in very fine detail. Most of the time, however, we only have data that can be sampled at the present, and from this limited data we need to infer the past. The goal of phylogenetics is to accurately reconstruct and represent past evolutionary relationships.
You are probably most familiar with the use of phylogenetic trees to represent the evolutionary relationships of species. In that case the splits in a tree represent speciation events. Extant species are represented at the tips, and nodes deeper in the tree represent extinct common ancestors of the descendant lineages.
Phylogenies can actually represent more than just species relationships though. You can use trees to represent the relationships among individuals sampled within a population, or to represent the relationships among genes, including patterns of gene duplication and how orthologs and paralogs have been inherited through speciation events.
The text below defines a
tree in newick format. When researchers are working with phylogenetic trees as their data, this is the type of data they are working with. It is simply text. This format can contain just the relationships -- described by nested parentheses like below -- or it can contain additional information such as branch lengths, which we'll see later. You can see how the nested hierarchical relationship of a phylogeny is easily represented by a nested set of parentheses.
# Create the string variable to store a newick tree. tree1 = "(gibbon, (orangutan, (gorilla, (chimp, human))));"
Below is visualization of the tree structure defined in the
toytree.tree() function returns an object (called a Toytree) which we save in the variable
tre. This object has many functions associated with it for manipulating and drawing trees. The
.draw() function returns a plot of the tree which is displayed in the cell output. The argument
tree_style='s' to the draw functions tells it to draw the figure in a particular style. Each node in the tree is labeled with a number.
# create a toytree object tre = toytree.tree(tree1) # return a tree drawing tre.draw(tree_style='s');
Rotating nodes simply changes the order in which the labels at the tips are arranged. However, it does not change the evolutionary relationships. This is because the relative branching order of the tree remains the same. Below we changed the order of human, chimp, and gorilla at the tips of tree, but you can see that the common ancestor of human and chimp is still node 5, and the their common ancestor with gorilla is still node 6. Although the drawing changed slightly, the relationships it represents remained the same.
toytree.tree(tree1).rotate_node(["gorilla", "chimp", "human"]).draw(tree_style='s');
Below are four newick strings, and the resulting four phylogenies. The ordering of parentheses in the newick string does not necessarily change the relationships, only the pattern by which they are nested one within one another. Similarly, rotating nodes on the phylogeny changes the order of the tips, but does not change the distance to a common ancestor among those tips. Thus the relationships are retained.
# read in four newick strings tre1 = toytree.tree('(gibbon,(orangutan,(gorilla,(chimp,human))));') tre2 = toytree.tree('((((human,chimp),gorilla),orangutan),gibbon);') tre3 = toytree.tree('((((human,chimp),orangutan),gorilla),gibbon);') tre4 = toytree.tree('(((gorilla,(human,chimp)),orangutan),gibbon);') # draw each tree with some arbitrary node rotating tre1.rotate_node(["human", "chimp"]).draw(use_edge_lengths=False); tre2.rotate_node(["orangutan"]).draw(use_edge_lengths=False); tre3.rotate_node(["human", "chimp", "orangutan"]).draw(use_edge_lengths=False); tre4.rotate_node(["gibbon"]).draw(use_edge_lengths=False);
Additional information such as the ages of clades is easy to include in the newick format. Below you can see that the lengths of branches are simply numeric values placed next to parentheses or tips (nodes of the tree). Below we use a different tree style for plotting (
tree_style='n') since this style will show branch length differences.
The units of this plot are not indicated. Thus we do not know if it is thousands of years, millions of years, or if the units are even meant to represent time. Branch lengths on a tree can represent different things. Sometimes we represent the number of character differences separating taxa as units of branch lengths, and these character could be counted from morphological data, or genetic substitutions, or even by counting other features of the genome such as inversions or transposable elements.
If a tree is inferred from DNA sequence data then the branch lengths could represent the number of observed DNA differences between species. Converting units of mutation substitutions into units of time is a tricky business that involves making assumptions about the rate of mutations. We will discuss this more later.
tree2 = "(gibbon:3,(orangutan:2,(gorilla:1,(chimp:0.25,human:0.25):0.75):1):1);" toytree.tree(tree2).draw(tree_style='n', scalebar=True);
A tree where the branch lengths represent modeled DNA substitutions, as opposed to time, will likely not have all the tips align perfectly at zero. This is because different lineages may have different rates of evolution, or, even if their rates are the same, some may have accumulated more mutations by chance. Below is an example of what an inferred tree might look like when the edges are substitutions instead of time. This is sometimes called a phylogram while the above tree is a chronogram. Generally, though, we refer to both as phylogenies and simply label the axes and figure legends to describe what they represent.
tree2 = "(gibbon:0.03,(orangutan:0.02,(gorilla:0.01,(chimp:0.0075,human:0.0025):0.0075):0.001):0.001);" toytree.tree(tree2).draw(tree_style='n', scalebar=True);
# example answer mytree = "(((apple,rose),(pea,peanut)),(wheat,rice));" toytree.tree(mytree).draw();
Every tree has a true root position. The root defines the direction in which evolution took place (from the past to the present). This is called the "polarity" of the tree. The problem is, for many types of phylogenetic inference we only care about counting the number of changes on the tree, which we can calculate without knowing the root. In fact, the resulting trees can be represented in unrooted form, and this can still be informative, but depending on where the true rooting is, it may change our interpretation of the relationships. Best practice is to state whether a tree is rooted or not. Often a
polytomy (unresolved split) at the base of the tree indicates whether it is rooted. The tree below is unrooted. It does not include a split telling us whether
(gorilla, (chimp, human)); is more closely related to orangutan or to gibbon, or equally related to both.
For example, what if we place the root of the tree on the gibbon branch: Then the Orangutan is more closely related to the (g(c,h)) clade. To conceptualize what it means to "place a root on the tree", think of it as taking a point on one of the branches of the tree and pinching it like it's a string, and pulling it back to form a new node.
Whereas if the root is placed on the orangutan branch then the result is different:
The root could also be on the edge that separates (o, g) from (g,(h,c)).
As you can see, knowing the root of the tree is important for interpreting the relationships of taxa. If we incorrectly put the root in the wrong place like in the examples above then we could misinterpret the relationships in the tree. Information from external source, like other phylogenies, fossil data, morphology, or the length of branches are often used to set the root of trees. When inferring a tree we can include a very distant sample (an outgroup) and place the root on that sample to ensure the tree is correctly rooted. Consider the example below where we add a walrus into the tree. We know for sure that all of the primates are more closely related to each other than they are to the walrus.
tree3 = "(walrus:0.1,(gibbon:0.03,(orangutan:0.02,(gorilla:0.01,(chimp:0.0075,human:0.0025):0.0075):0.001):0.001):0.1);" toytree.tree(tree3).root("walrus").draw(tree_style="n");