Notebook 12.1: Tree thinking

Topics:

This notebook is associated with the following reading:

  • Baum, David A., and Susan Offner. 2008. “Phylogenics & Tree-Thinking.” The American Biology Teacher 70 (4): 222–30. link.

Learning objectives:

By the end of this notebook you should be able to:

  1. Describe relationships on a phylogenetic tree.
  2. Describe ortholog/paralog relationships on a phylogenetic tree.
  3. Recognize the newick format for storing tree data.
  4. Plot a phylogenetic tree in Python.

Introduction

This exercise uses a Python package called toytree to read and write newick formatted data files, and to manipulate and draw tree visualizations. The goal of this exercise to learn how to read phylogenetic trees, to interpret their meaning, and to understand how this information is represented as text data. Although you will not need to learn many details of the toytree Python library, you can find more information about it in the documentation here.

In [3]:
import toytree

Why is phylogeny important?

Your assigned reading quotes a philosopher of science who stated "It is impossible to really understand evolution without an ability to accurately interpret phylogenetic trees (O'Hara 1988, 1997)", and that "evolution itself is a theory of evolutionary trees". In biology generally, and in this class already, you have seen many examples of phylogenetic trees. But what does a phylogenetic tree represent? And why is it important to understand? There are actually a few common pitfalls that people fall into when reading phylogenetic trees without having spent serious time considering their meaning. Recognizing these mistakes will make you a better biologist. We'll cover some of these pitfalls in this notebook.

Phylogenetic inference

A phylogenetic tree is a depiction of the inferred evolutionary relationships among the units represented at the tips. Sometimes, when we have fossil or ancient DNA data, or in cases of experimental evolution (e.g., many generations of bacteria studied over decades) we can trace the evolutionary history of organisms in very fine detail. Most of the time, however, we only have data that can be sampled at the present, and from this limited data we need to infer the past. The goal of phylogenetics is to accurately reconstruct and represent past evolutionary relationships.

What is represented at the tips?

You are probably most familiar with the use of phylogenetic trees to represent the evolutionary relationships of species. In that case the splits in a tree represent speciation events. Extant species are represented at the tips, and nodes deeper in the tree represent extinct common ancestors of the descendant lineages.

Phylogenies can actually represent more than just species relationships though. You can use trees to represent the relationships among individuals sampled within a population, or to represent the relationships among genes, including patterns of gene duplication and how orthologs and paralogs have been inherited through speciation events.

Question [1]: Look at Figures 1-3 in the paper by Baum and Offner. In this paper they are describing how to teach literacy in reading phylogenetic trees. What are they trying to depict through these three figures? Do you find this approach useful for teaching the interpretation of a phylogeny? Answer in Markdown below.
They are showing different perspectives on ancestry. From examining parent-offspring relationships, to population differentiation, to speciation events and isolation of species. It nicely depicts what a phylogeny represents when you zoom in on it.

Newick tree format

The text below defines a tree in newick format. When researchers are working with phylogenetic trees as their data, this is the type of data they are working with. It is simply text. This format can contain just the relationships -- described by nested parentheses like below -- or it can contain additional information such as branch lengths, which we'll see later. You can see how the nested hierarchical relationship of a phylogeny is easily represented by a nested set of parentheses.

In [4]:
# Create the string variable to store a newick tree.
tree1 = "(gibbon, (orangutan, (gorilla, (chimp, human))));"

Draw a tree

Below is visualization of the tree structure defined in the tree variable.

The toytree.tree() function returns an object (called a Toytree) which we save in the variable tre. This object has many functions associated with it for manipulating and drawing trees. The .draw() function returns a plot of the tree which is displayed in the cell output. The argument tree_style='s' to the draw functions tells it to draw the figure in a particular style. Each node in the tree is labeled with a number.

In [5]:
# create a toytree object
tre = toytree.tree(tree1)

# return a tree drawing
tre.draw(tree_style='s');
humanchimpgorillaorangutangibbonidx: 0 name: human dist: 1 support: 0 height: 00idx: 1 name: chimp dist: 1 support: 0 height: 01idx: 2 name: gorilla dist: 1 support: 0 height: 12idx: 3 name: orangutan dist: 1 support: 0 height: 23idx: 4 name: gibbon dist: 1 support: 0 height: 34idx: 5 name: 5 dist: 1 support: 0 height: 15idx: 6 name: 6 dist: 1 support: 0 height: 26idx: 7 name: 7 dist: 1 support: 0 height: 37idx: 8 name: 8 dist: 1 support: 0 height: 48
Question [2]: In terms of the number labels on nodes, which node represents the common ancestor of chimp and human? Which is the common ancestor of orangutan and gibbon? Which of those pairs is more closely related? See Figure 6 in Baum and Offner if you need help. Answer in Markdown below.
- Node 5 is the most recent common ancestor (mrca) of Chimp and Human - Node 7 is the mrca of orang and gibbon. - chimp and human are more closely related.

Rotating nodes

Rotating nodes simply changes the order in which the labels at the tips are arranged. However, it does not change the evolutionary relationships. This is because the relative branching order of the tree remains the same. Below we changed the order of human, chimp, and gorilla at the tips of tree, but you can see that the common ancestor of human and chimp is still node 5, and the their common ancestor with gorilla is still node 6. Although the drawing changed slightly, the relationships it represents remained the same.

In [8]:
toytree.tree(tree1).rotate_node(["gorilla", "chimp", "human"]).draw(tree_style='s');
orangutanhumanchimpgorillagibbonidx: 0 name: human dist: 1 support: 0 height: 00idx: 1 name: chimp dist: 1 support: 0 height: 01idx: 2 name: gorilla dist: 1 support: 0 height: 12idx: 3 name: orangutan dist: 1 support: 0 height: 23idx: 4 name: gibbon dist: 1 support: 0 height: 34idx: 5 name: 5 dist: 1 support: 0 height: 15idx: 6 name: 6 dist: 1 support: 0 height: 26idx: 7 name: 7 dist: 1 support: 0 height: 37idx: 8 name: 8 dist: 1 support: 0 height: 48

Test on interpreting phylogenies

Below are four newick strings, and the resulting four phylogenies. The ordering of parentheses in the newick string does not necessarily change the relationships, only the pattern by which they are nested one within one another. Similarly, rotating nodes on the phylogeny changes the order of the tips, but does not change the distance to a common ancestor among those tips. Thus the relationships are retained.

In [9]:
# read in four newick strings
tre1 = toytree.tree('(gibbon,(orangutan,(gorilla,(chimp,human))));')
tre2 = toytree.tree('((((human,chimp),gorilla),orangutan),gibbon);')
tre3 = toytree.tree('((((human,chimp),orangutan),gorilla),gibbon);')
tre4 = toytree.tree('(((gorilla,(human,chimp)),orangutan),gibbon);')

# draw each tree with some arbitrary node rotating
tre1.rotate_node(["human", "chimp"]).draw(use_edge_lengths=False);
tre2.rotate_node(["orangutan"]).draw(use_edge_lengths=False);
tre3.rotate_node(["human", "chimp", "orangutan"]).draw(use_edge_lengths=False);
tre4.rotate_node(["gibbon"]).draw(use_edge_lengths=False);
gorillahumanchimporangutangibbon
orangutanchimphumangorillagibbon
gorillachimphumanorangutangibbon
gibbonchimphumangorillaorangutan
Question [3]: Three of the rooted phylogenies above represent the same relationships among organisms, while one of them does not. Which tree (1-4) shows a different relationship? I rotated some nodes to make it more difficult. Answer in Markdown below.
- Tree 3 is different.

Edge length information (Divergence times, i.e., ages of clades)

Additional information such as the ages of clades is easy to include in the newick format. Below you can see that the lengths of branches are simply numeric values placed next to parentheses or tips (nodes of the tree). Below we use a different tree style for plotting (tree_style='n') since this style will show branch length differences.

The units of this plot are not indicated. Thus we do not know if it is thousands of years, millions of years, or if the units are even meant to represent time. Branch lengths on a tree can represent different things. Sometimes we represent the number of character differences separating taxa as units of branch lengths, and these character could be counted from morphological data, or genetic substitutions, or even by counting other features of the genome such as inversions or transposable elements.

If a tree is inferred from DNA sequence data then the branch lengths could represent the number of observed DNA differences between species. Converting units of mutation substitutions into units of time is a tricky business that involves making assumptions about the rate of mutations. We will discuss this more later.

In [10]:
tree2 = "(gibbon:3,(orangutan:2,(gorilla:1,(chimp:0.25,human:0.25):0.75):1):1);"
toytree.tree(tree2).draw(tree_style='n', scalebar=True);
humanchimpgorillaorangutangibbon023

A tree where the branch lengths represent modeled DNA substitutions, as opposed to time, will likely not have all the tips align perfectly at zero. This is because different lineages may have different rates of evolution, or, even if their rates are the same, some may have accumulated more mutations by chance. Below is an example of what an inferred tree might look like when the edges are substitutions instead of time. This is sometimes called a phylogram while the above tree is a chronogram. Generally, though, we refer to both as phylogenies and simply label the axes and figure legends to describe what they represent.

In [11]:
tree2 = "(gibbon:0.03,(orangutan:0.02,(gorilla:0.01,(chimp:0.0075,human:0.0025):0.0075):0.001):0.001);"
toytree.tree(tree2).draw(tree_style='n', scalebar=True);
humanchimpgorillaorangutangibbon0.000.010.03
Action [4]: Try to write a newick string for the relationships of six taxa (use new names, not the primates I used above) and plot it. Next, try to add branch lengths to the tree. Hint: just like in the code above, you need to store the newick string as a variable (e.g., tree1) and then load it with toytree using the `toytree.tree()` command, followed by the `.draw()` command. Look at the examples of newick strings to write yours following a similar style. The number of parentheses must match and you must put a semicolon at the end of the newick string. What source did you use to find the relationship among the organisms that you are drawing?
In [17]:
# example answer
mytree = "(((apple,rose),(pea,peanut)),(wheat,rice));"
toytree.tree(mytree).draw();
peanutpearoseapplericewheat

What is tree rooting?

Every tree has a true root position. The root defines the direction in which evolution took place (from the past to the present). This is called the "polarity" of the tree. The problem is, for many types of phylogenetic inference we only care about counting the number of changes on the tree, which we can calculate without knowing the root. In fact, the resulting trees can be represented in unrooted form, and this can still be informative, but depending on where the true rooting is, it may change our interpretation of the relationships. Best practice is to state whether a tree is rooted or not. Often a polytomy (unresolved split) at the base of the tree indicates whether it is rooted. The tree below is unrooted. It does not include a split telling us whether (gorilla, (chimp, human)); is more closely related to orangutan or to gibbon, or equally related to both.

In [18]:
toytree.tree(tree1).unroot().draw(tree_style='s');
humanchimpgorillaorangutangibbonidx: 0 name: human dist: 1 support: 0 height: 00idx: 1 name: chimp dist: 1 support: 0 height: 01idx: 2 name: gorilla dist: 1 support: 0 height: 12idx: 3 name: orangutan dist: 1 support: 0 height: 23idx: 4 name: gibbon dist: 1 support: 0 height: 24idx: 5 name: 5 dist: 1 support: 0 height: 15idx: 6 name: 6 dist: 1 support: 0 height: 26idx: 7 name: 8 dist: 4 support: 0 height: 37

For example, what if we place the root of the tree on the gibbon branch: Then the Orangutan is more closely related to the (g(c,h)) clade. To conceptualize what it means to "place a root on the tree", think of it as taking a point on one of the branches of the tree and pinching it like it's a string, and pulling it back to form a new node.

In [19]:
toytree.tree(tree1).root("gibbon").draw(tree_style='s');
humanchimpgorillaorangutangibbonidx: 0 name: human dist: 1 support: 0 height: 00idx: 1 name: chimp dist: 1 support: 0 height: 01idx: 2 name: gorilla dist: 1 support: 0 height: 12idx: 3 name: orangutan dist: 1 support: 0 height: 23idx: 4 name: gibbon dist: 1 support: 0 height: 34idx: 5 name: 5 dist: 1 support: 0 height: 15idx: 6 name: 6 dist: 1 support: 0 height: 26idx: 7 name: 7 dist: 1 support: 0 height: 37idx: 8 name: 8 dist: 1 support: 0 height: 48

Whereas if the root is placed on the orangutan branch then the result is different:

In [20]:
toytree.tree(tree1).root("orangutan").draw(tree_style='s');
humanchimpgorillagibbonorangutanidx: 0 name: human dist: 1.0000 support: 0 height: 0.00000idx: 1 name: chimp dist: 1.0000 support: 0 height: 0.00001idx: 2 name: gorilla dist: 1.0000 support: 0 height: 1.00002idx: 3 name: gibbon dist: 2.0000 support: 0 height: 1.00003idx: 4 name: orangutan dist: 0.5000 support: 0 height: 3.00004idx: 5 name: 5 dist: 1.0000 support: 0 height: 1.00005idx: 6 name: 6 dist: 1.0000 support: 0 height: 2.00006idx: 7 name: 7 dist: 0.5000 support: 1 height: 3.00007idx: 8 name: root dist: 1.0000 support: 1 height: 3.50008

The root could also be on the edge that separates (o, g) from (g,(h,c)).

In [21]:
toytree.tree(tree1).root(["orangutan", "gibbon"]).draw(tree_style='s');
humanchimpgorillagibbonorangutanidx: 0 name: human dist: 1.0000 support: 0 height: 0.00000idx: 1 name: chimp dist: 1.0000 support: 0 height: 0.00001idx: 2 name: gorilla dist: 1.0000 support: 0 height: 1.00002idx: 3 name: gibbon dist: 2.0000 support: 0 height: 0.00003idx: 4 name: orangutan dist: 1.0000 support: 0 height: 1.00004idx: 5 name: 5 dist: 1.0000 support: 0 height: 1.00005idx: 6 name: 6 dist: 0.5000 support: 0 height: 2.00006idx: 7 name: 7 dist: 0.5000 support: 1 height: 2.00007idx: 8 name: root dist: 1.0000 support: 1 height: 2.50008

As you can see, knowing the root of the tree is important for interpreting the relationships of taxa. If we incorrectly put the root in the wrong place like in the examples above then we could misinterpret the relationships in the tree. Information from external source, like other phylogenies, fossil data, morphology, or the length of branches are often used to set the root of trees. When inferring a tree we can include a very distant sample (an outgroup) and place the root on that sample to ensure the tree is correctly rooted. Consider the example below where we add a walrus into the tree. We know for sure that all of the primates are more closely related to each other than they are to the walrus.

In [22]:
tree3 = "(walrus:0.1,(gibbon:0.03,(orangutan:0.02,(gorilla:0.01,(chimp:0.0075,human:0.0025):0.0075):0.001):0.001):0.1);"
toytree.tree(tree3).root("walrus").draw(tree_style="n");
humanchimpgorillaorangutangibbonwalrus