Tree Parsing (I/O)¶
Parsing tree data involves loading a tree topology and associated metadata from a serialized text format into a
data structure. toytree
loads trees from a variety of text formats (Newick, nexus, NHX) stored in a file, URL, or string, and returns a ToyTree
class object.
This is made incredibly simple in toytree
through the general purpose toytree.tree()
function. In most cases, you can simply call this method on your data (string, file, or URL), without having to even specify the input data type or format.
import toytree
# example newick string
DATA = "((tip1:2,tip2:2):1,tip3:3);"
# load/parse into a ToyTree
tree = toytree.tree(DATA)
tree
<toytree.ToyTree at 0x7f6e4600b640>
Take Home
You can parse almost any tree data (file, string, nexus, newick, etc) using toytree.tree().
Tree data formats¶
Below are examples of the common Newick, NHX, and Nexus tree data formats. Newick is the base format from which the other two formats are extensions. More details on parsing each format is described further below. While a few additional formats (e.g., JSON or XML) are sometimes used to store tree data, these Newick-based formats are most common.
# newick: represents a topology using nested parentheses
NEWICK0 = "((,),);"
# newick: name strings are usually present for tips as `(label,)`
NEWICK1 = "((tip1,tip2),tip3);"
# newick: names can also be present for internal nodes as `()label`
NEWICK2 = "((tip1,tip2)internal1,tip3)internal2;"
# newick: edge lengths (dists) are usually present as `()label:dist`
NEWICK3 = "((tip1:2,tip2:2):1,tip3:3);"
# newick: support values can be stored in place of internal names `()support`
NEWICK4 = "((tip1,tip2)100,tip3);"
# nhx: additional metadata is stored as key=value pairs as `()[meta]`
NHX1 = "((tip1[&trait=2],tip2[&trait=4])[&trait=3],tip3[&trait=1])[&trait=5];"
# nexus: newick/NHX data with other code blocks between (begin... end;)
NEXUS1 = """
#NEXUS
begin trees;
translate
1 apple,
2 blueberry,
3 cantaloupe,
4 durian,
;
tree tree0 = [&U] ((1,2),(3,4));
end;
"""
Parsing ToyTrees (tldr;)¶
Parsing tree data is made simple in toytree
through the general purpose toytree.tree()
function. For example, this method can parse all of the above data strings correctly without the need of any additional arguments to specify the data or metadata formats. Moreover, it can can parse these data regardless of whether they are entered as a string, or as a file path, or even a public URL. In this way, toytree.tree()
acts as a sort of swiss army knife for tree data parsing.
# parse all 7 tree data strings from above into ToyTree objects
data = [NEWICK0, NEWICK1, NEWICK2, NEWICK3, NEWICK4, NHX1, NEXUS1]
trees = [toytree.tree(i) for i in data]
trees
[<toytree.ToyTree at 0x7f6e4600be50>, <toytree.ToyTree at 0x7f6e46048280>, <toytree.ToyTree at 0x7f6e460485e0>, <toytree.ToyTree at 0x7f6e46048a90>, <toytree.ToyTree at 0x7f6e46048f40>, <toytree.ToyTree at 0x7f6e460493f0>, <toytree.ToyTree at 0x7f6e46049900>]
Newick format¶
A ToyTree
can be flexibly loaded from a range of text formats. When parsing Newick
data it is important to be aware of its limitations. Specifically, that internal node
labels are sometimes used for different purposes, to store either node names, node support
values (as int or floats), and sometimes for other forms of metadata. The toytree.tree
function will auto-detect whether these labels should be stored as names or supports based
on their values being numeric or not, however, you can also override this behavior to assign
the values to a feature name of your choice. This is demonstrated below using two examples
of Newick strings with different internal node label types (NEWICK2
and NEWICK4
, from above).
Internal labels as names¶
If any internal node labels present are non-numeric then they will be parsed and stored as "name" features of Nodes. In the example below the Newick string is parsed into a ToyTree object and its .get_node_data()
function is called to show the tree's metadata, showing that labels were assigned to 'name'.
# print newick with str labels for tips and internal nodes
print(f"Newick = {NEWICK2}")
# parse the newick string with .tree()
tree = toytree.tree(NEWICK2)
# show the tree data (labels were assigned to 'name' feature)
tree.get_node_data()
Newick = ((tip1,tip2)internal1,tip3)internal2;
idx | name | height | dist | support | |
---|---|---|---|---|---|
0 | 0 | tip1 | 0.0 | 1.0 | NaN |
1 | 1 | tip2 | 0.0 | 1.0 | NaN |
2 | 2 | tip3 | 1.0 | 1.0 | NaN |
3 | 3 | internal1 | 1.0 | 1.0 | NaN |
4 | 4 | internal2 | 2.0 | 0.0 | NaN |
Internal labels as support¶
In contrast to the example above, you can see that the internal labels here are numeric and have thus been stored as "support" features instead of "name", and the internal nodes have names set to the default empty strings. This is the typical format of a Newick string generated by phylogenetic inference software, usually representing some kind of support values. Note that tip nodes/edges do not have support values, nor does the root edge. Support values are actually features of edges, not nodes. This is important for how they are re-oriented when trees are re-rooted (see Edge Features).
# print newick with str labels for tips and int labels for internal nodes
print(f"Newick = {NEWICK4}")
# parse the newick string with .tree()
tree = toytree.tree(NEWICK4)
# show the tree data (labels assigned to 'support' for internal Node)
tree.get_node_data()
Newick = ((tip1,tip2)100,tip3);
idx | name | height | dist | support | |
---|---|---|---|---|---|
0 | 0 | tip1 | 0.0 | 1.0 | NaN |
1 | 1 | tip2 | 0.0 | 1.0 | NaN |
2 | 2 | tip3 | 1.0 | 1.0 | NaN |
3 | 3 | 1.0 | 1.0 | 100.0 | |
4 | 4 | 2.0 | 0.0 | NaN |
Internal labels explicit¶
As you've seen the use of internal Newick labels can be inconsistent, which is one of the main reasons that the extended Newick format (NHX) was developed, which will be introduced next. Nevertheless, instead of relying on the toytree.tree
function to automatically parse the internal label as a name or support value, you can optionally enter the feature name you want the values assigned to explicitly using the internal_labels
arg. For example, you could enter "name", or "support", in which case it will still be parsed as str
or float
tyeps, or you can enter any other name to store as a different feature name.
# parse the newick string with internal str labels and assign
tre0 = toytree.tree(NEWICK2, internal_labels="arbitrary")
# show the tree data where labels were assigned to "arbitrary"
tre0.get_node_data()
idx | name | height | dist | support | arbitrary | |
---|---|---|---|---|---|---|
0 | 0 | tip1 | 0.0 | 1.0 | NaN | NaN |
1 | 1 | tip2 | 0.0 | 1.0 | NaN | NaN |
2 | 2 | tip3 | 1.0 | 1.0 | NaN | NaN |
3 | 3 | 1.0 | 1.0 | NaN | internal1 | |
4 | 4 | 2.0 | 0.0 | NaN | internal2 |
NHX format¶
The extended New Hampshire format (NHX) has emerged as a more recent and popular format for tree data storage (although unfortunately the precise rules for the format are not consistently followed). In addition to the standard information in Newick data provided by parentheses (topology) and edge lengths, any additional and arbitrary metadata can be stored within square brackets.
The toytree.tree()
function will automatically detect if square brackets are present in a Newick string and parse the associated metadata. It is important to note that different programs sometimes vary in the way that they store data inside of the square brackets, and so toytree.tree
takes a number of additional optional arguments that can be entered to properly parse the NHX metadata. Below are some examples.
Finally, NHX format has the advantage over Newick in that it can distinguish between data that is assigned to Nodes versus Edges in a tree. Data on edges, such as support values, are treated differently than data on nodes, such as trait values, when re-rooting trees (See Data/Features for more on this).
# only tip Node metadata
NHX1 = "((a[&N=1],b[&N=2]),c[&N=3]);"
# only internal Node metadata
NHX2 = "((a,b)[&N=4],c)[&N=5];"
# both tip and internal Node metadata
NHX3 = "((a[&N=1],b[&N=2])[&N=4],c[&N=3])[&N=5];"
# only edge metadata
NHX4 = "((a:1[&E=1],b:1[&E=2]):1[&E=4],c:1[&E=3]);"
# both node and edge metadata
NHX5 = "((a[&N=1]:1[&E=1],b[&N=2]:1[&E=2])[&N=4]:1[&E=4],c[&N=3]:1[&E=3])[&N=5];"
# NHX1 has only tip node data mapped to feature "N"
toytree.tree(NHX1).get_node_data()
idx | name | height | dist | support | N | |
---|---|---|---|---|---|---|
0 | 0 | a | 0.0 | 1.0 | NaN | 1.0 |
1 | 1 | b | 0.0 | 1.0 | NaN | 2.0 |
2 | 2 | c | 1.0 | 1.0 | NaN | 3.0 |
3 | 3 | 1.0 | 1.0 | NaN | NaN | |
4 | 4 | 2.0 | 0.0 | NaN | NaN |
# NHX5 has all node data mapped to feature "N" and edge data to feature "E"
toytree.tree(NHX5).get_node_data()
idx | name | height | dist | support | E | N | |
---|---|---|---|---|---|---|---|
0 | 0 | a | 0.0 | 1.0 | NaN | 1.0 | 1.0 |
1 | 1 | b | 0.0 | 1.0 | NaN | 2.0 | 2.0 |
2 | 2 | c | 1.0 | 1.0 | NaN | 3.0 | 3.0 |
3 | 3 | 1.0 | 1.0 | NaN | 4.0 | 4.0 | |
4 | 4 | 2.0 | 0.0 | NaN | NaN | 5.0 |
NEXUS format¶
The NEXUS format is popular in the field of phylogenetics because it provides a flexible format for storing a variety of information -- both data and instructions -- that can be used by multiple software tools. A NEXUS file starts with a "#NEXUS" header, and then contains one or more blocks delimited by "begin" and "end;" statements. For example, a "data" block would start with "begin data" and could contain morphological or molecular data. Another block might include code instructions for the mrbayes software, which takes a NEXUS file as input with instructions for an analysis. This could then write results to a "trees" block, which contains one or more Newick or NHX strings. In this way, a NEXUS file can fully describe an analysis from data -> analysis -> trees, as in the example below.
For now, as far as toytree
is concerned, only the "trees" block is of interest, and all other block are ignored. The toytree.tree()
function will parse the tree data inside a NEXUS file just as it parses other Newick or NHX strings.
# nexus: Newick/NHX data with other code blocks between (begin... end;)
NEXUS_EXAMPLE = """
#NEXUS
begin data;
...
end;
begin mrbayes;
...
end;
begin trees;
translate
1 apple,
2 blueberry,
3 cantaloupe,
4 durian,
;
tree tree0 = [&U] ((1,2),(3,4));
end;
"""
# parse NEXUS file and show tree data
tree = toytree.tree(NEXUS_EXAMPLE)
tree.get_node_data()
idx | name | height | dist | support | |
---|---|---|---|---|---|
0 | 0 | apple | 0.0 | 1.0 | NaN |
1 | 1 | blueberry | 0.0 | 1.0 | NaN |
2 | 2 | cantaloupe | 0.0 | 1.0 | NaN |
3 | 3 | durian | 0.0 | 1.0 | NaN |
4 | 4 | 1.0 | 1.0 | NaN | |
5 | 5 | 1.0 | 1.0 | NaN | |
6 | 6 | 2.0 | 0.0 | NaN |
Parsing MultiTrees¶
Sometimes data from multiple trees are stored together in a single file, such as the results of a bootstrap analysis, or a posterior distribution of trees from a Bayesian phylogenetic inference. toytree
can parse and load all trees in a multiple tree input using the toytree.mtree
function. This returns a MultiTree
object (see MultiTree), which has methods that can apply to sets of trees, and from which individual ToyTrees
can be indexed and extracted.
# a str with Newick data separated by new lines
MULTILINE_NEWICK = """
(((a:1,b:1):1,(d:1.5,e:1.5):0.5):1,c:3);
(((a:1,d:1):1,(b:1,e:1):1):1,c:3);
(((a:1.5,b:1.5):1,(d:1,e:1):1.5):1,c:3.5);
(((a:1.25,b:1.25):0.75,(d:1,e:1):1):1,c:3);
(((a:1,b:1):1,(d:1.5,e:1.5):0.5):1,c:3);
(((b:1,a:1):1,(d:1.5,e:1.5):0.5):2,c:4);
(((a:1.5,b:1.5):0.5,(d:1,e:1):1):1,c:3);
(((b:1.5,d:1.5):0.5,(a:1,e:1):1):1,c:3);
"""
# parse with .mtree
mtree = toytree.mtree(MULTILINE_NEWICK)
mtree
<toytree.MultiTree ntrees=8>
# a Nexus str with trees in a trees block
MULTI_N5XUS = """
#NEXUS
begin trees;
translate
1 a,
2 b,
3 c,
4 d,
5 e,
;
tree 1 = [&R] (((1:1,2:1):1,(4:1.5,5:1.5):0.5):1,3:3);
tree 2 = [&R] (((1:1,4:1):1,(2:1,5:1):1):1,3:3);
tree 3 = [&R] (((1:1.5,2:1.5):1,(4:1,5:1):1.5):1,3:3.5);
tree 4 = [&R] (((1:1.25,2:1.25):0.75,(4:1,5:1):1):1,3:3);
tree 5 = [&R] (((1:1,2:1):1,(4:1.5,5:1.5):0.5):1,3:3);
tree 6 = [&R] (((2:1,1:1):1,(4:1.5,5:1.5):0.5):2,3:4);
tree 7 = [&R] (((1:1.5,2:1.5):0.5,(4:1,5:1):1):1,3:3);
tree 8 = [&R] (((2:1.5,4:1.5):0.5,(1:1,5:1):1):1,3:3);
end;
"""
# pars5 with .mtree
mtree = toytree.mtree(MULTI_N5XUS)
mtree
<toytree.MultiTree ntrees=8>
If you call toytree.mtree
on a file containing a single tree then it will simply return a MultiTree
object containing only a single ToyTree
within it. If you call toytree.tree
on a file containing multiple trees it will return the first tree in the file as a ToyTree
, but will also print a warning to make sure you know that the input contained multiple trees.
# calling .mtree on a single tree input is OK
toytree.mtree(NEWICK1)
<toytree.MultiTree ntrees=1>
# calling .tree on a multiple tree input is also OK, but raises a WARNING
toytree.tree(MULTILINE_NEWICK)
⚠️ toytree | parse:parse_tree | Data contains (8) trees. Loading first using `toytree.tree`. Use `toytree.mtree` to instead load a MultiTree.
<toytree.ToyTree at 0x7f6e4600aec0>
Loading trees from URLs¶
A convenient feature of toytree.tree
is the ability to laod tree data from a public URI. If you provide a string as input that begins with "http" then the str data of that URI will be checked for valid tree data. If so, it is returned as a tree. You can thus store your trees on any public site, such as a GitHub repo, and easily load it in without having to write out a long file path.
toytree.tree("https://eaton-lab.org/data/Cyathophora.tre")
<toytree.ToyTree at 0x7f6e45e92ad0>