Data/Features¶
A common mistake that users make when working with tree data arises from incorrectly assigning trait values to nodes of the tree. This is most prevalent when trait data are stored in a matrix or dataframe separate from the tree object itself, and operations such as re-rooting or ladderizing are applied to the tree. It is important that trees and trait data are kept in sync. To avoid this problem, we recommend using ToyTree
objects themselves as the primary data storage object in your analyses. It is very simple to assign data to nodes of a tree, and to fetch data back from a tree as a dataframe, or in various alternative formats. A recommended workflow for working with data on trees is demonstrated in this section.
import toytree
import toyplot
import numpy as np
Simple Example¶
The functions set_node_data
and get_node_data
provide a broad suite of functionality for setting data to one or more nodes on a tree and subsequently fetching the data back in a variety of formats, and in the correct order for plotting. By default the data setting function returns a modified copy of the tree with new data assigned, however, you can optionally use the argument inplace=True
to set data on the tree object inplace. In either case, a tree is returned by the function, which allows for optionally chaining it with the data getter function to subsequently return the data for one or more node features.
# an example tree
tree = toytree.rtree.unittree(ntips=5, seed=123)
# set the feature "X" to a value of 10 on all Nodes in a tree
tree.set_node_data(feature="X", default=10, inplace=True);
# get the values of "X" for all Nodes in idx traversal order
tree.get_node_data("X")
0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 dtype: int64
# chain the two functions together to set & get values for a feature
tree.set_node_data("X", default=10).get_node_data("X")
0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 dtype: int64
Features¶
In toytree
terminology a "feature" refers the name of trait for which data is stored to nodes in a tree. For example, each ToyTree
has several data features by default, such as name
, dist
, and support
. You can create and store data under any arbitrary feature name (except for a few disallowed names and characters), and, in fact when you load a tree from a newick, NHX, or NEXUS formatted data file created by a phylogenetic tree inference program, it will often contain additional metadata that will be loaded as features. Several examples of this are shown in tree parsing documentation. A ToyTree
contains a dynamic propery .features
containing all feature names currently assigned to that tree. (This includes any feature that is assigned to at least one Node in the tree, but it does not need to be assigned to every Node in the tree.)
# all feature names assigned to at least one Node in this tree
tree.features
('idx', 'name', 'height', 'dist', 'support', 'X')
Data as Node attributes¶
When storing data to a ToyTree
it is actually stored to individual Node
objects as Node attributes. This is demonstrated below where data is assigned to a feature named "Z" for two Nodes in the tree. Setting and retrieving data directly from Nodes as attributes like this is the fastest way to set/get data, and is thus especially useful for developers. However, for general toytree
usage, we recommend using the helper functions set_node_data
and get_node_data
to set and retrieve data as they provide a number of benefits, especially in terms of dealing with missing values, checking data types, and ordering data values.
# set a value for the attribute (feature) named "Z" on two Nodes
tree[0].Z = "A"
tree[1].Z = "B"
When the get_node_data
function is called without any features selected it returns a dataframe showing all features on the current tree. Here, this tree includes the five default features in addition to the new feature "X" for which we assigned a value of 10 to all Nodes above, and it also includes the attribute "Z", which has been assigned to only two Nodes in the tree. For other Nodes that do not contain a "Z" feature a default missing value of NaN (numpy.nan) is returned in the dataframe (but note, NaN is not assigned to the "Z" attribute of the other Nodes by this function).
# return a dataframe with all feature data
tree.get_node_data()
idx | name | height | dist | support | X | Z | |
---|---|---|---|---|---|---|---|
0 | 0 | r0 | 0.000000 | 0.333333 | NaN | 10 | A |
1 | 1 | r1 | 0.000000 | 0.333333 | NaN | 10 | B |
2 | 2 | r2 | 0.000000 | 0.666667 | NaN | 10 | NaN |
3 | 3 | r3 | 0.000000 | 0.666667 | NaN | 10 | NaN |
4 | 4 | r4 | 0.000000 | 0.666667 | NaN | 10 | NaN |
5 | 5 | 0.333333 | 0.333333 | NaN | 10 | NaN | |
6 | 6 | 0.666667 | 0.333333 | NaN | 10 | NaN | |
7 | 7 | 0.666667 | 0.333333 | NaN | 10 | NaN | |
8 | 8 | 1.000000 | 0.000000 | NaN | 10 | NaN |
Set Node data¶
The set_node_data
function is the general recommended way to assign data to nodes on a tree. Data can be entered using either a dictionary or sequence of values, and a number of options are available to make it easier to assign values to many nodes without having to type each out individually. A related function is also available, set_node_data_from_dataframe
, which allows setting multiple features at the same time from tabular data loaded as a pandas DataFrame. Here, however, we will focus on adding single features at a time.
Setting data by dict¶
The simplest way to enter specific data values is by using a dictionary. The keys of the dictionary can correspond to any valid Node Query to select one or more Nodes, and the corresponding value will be assigned to these Nodes. Notably, you can enter a dict selecting only a few Nodes and a feature will be assigned to those Nodes, and not to any of the other Nodes not entered in the dict. If you want to assign a default value to all other nodes you can do so using default
argument. Finally, the argument inherit
can be used to also assign a value to all descendants of a selected Node. Examples of each of these is shown below.
# set data to feature "Y" for two Nodes
data = {0: 10, 1: 20, 2: 30}
tree.set_node_data("Y", data=data).get_node_data("Y")
0 10.0 1 20.0 2 30.0 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN dtype: float64
In this example the data dictionary selects nodes using a variety of Node Queries. The first is a regular expression that matches the first four nodes in the tree, the second matches the node named "r4", and the last matches the node with int index of 8. Finally, we use the default
arg to set a value of 0 to all other Nodes not selected in the data dict. In this way, we easily assigned to all 9 nodes in the tree without having to write a value for each.
# set data to feature "Y" using a dict w/ node queries, and the default arg
data = {"~r[0-3]": 10, "r4": 20, 8: 50}
tree.set_node_data(feature="Y", data=data, default=0).get_node_data("Y")
0 10 1 10 2 10 3 10 4 20 5 0 6 0 7 0 8 50 dtype: int64
The inherit
arg provides another convenient way to assign data to Nodes in a tree based on their hierarchical relationships. For example, to assign a trait value that is inherited by all descendants of a particular node in a tree you need only to assign the value to one or more internal nodes while using the inherit=True
argument. The inherited values are assigned to nodes in order from root to tips so that you can enter values for nested clades using this argument.
# set data to feature "Y" for a clade using inherit=True
tree.set_node_data("Y", data={6: True}, inherit=True).get_node_data("Y")
0 True 1 True 2 True 3 NaN 4 NaN 5 True 6 True 7 NaN 8 NaN dtype: object
Setting data by array¶
You can alternatively set data to all Nodes in a tree by entering the values as a sequence (e.g., list, ndarray, Series) in Node idx order. Note that this requires you to have already properly ordered your input data and to be aware of the Node idx order of your current tree. Thus, this method is more error prone than assigning data by dictionary. Nevertheless, the option is available. Here we use it to assign random integer values to all Nodes by using the numpy.random
library to generate an array of random ints.
# set data using an array of random int values
data = np.random.randint(10, 20, size=tree.nnodes)
tree.set_node_data(feature="Y", data=data).get_node_data("Y")
0 14 1 11 2 19 3 10 4 16 5 11 6 17 7 14 8 15 dtype: int64
Get Node data¶
The get_node_data
function is used to retrieve feature data that has been assigned to Nodes in a tree, and to return them in the correct idx order for plotting. Data can be returned for a single feature as a pandas Series, or for multiple features as a pandas DataFrame. When a feature has not been assigned to all Nodes in a tree a default value of np.nan
will be returned for missing values, but this can be changed to any arbitrary alternative value by entering an argument for the option missing
.
Get a single feature¶
By entering the name of a feature in the tree a pandas Series will be returned with all of the Node values for that feature. Here the Series index contains Node idx labels representing the Nodes in an idxorder traversal of the tree.
# return values for feature "dist"
tree.get_node_data(feature="dist")
0 0.333333 1 0.333333 2 0.666667 3 0.666667 4 0.666667 5 0.333333 6 0.333333 7 0.333333 8 0.000000 dtype: float64
# return values for feature 'Z' which has data for only 2 Nodes
tree.get_node_data("Z")
0 A 1 B 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN dtype: object
# return values for feature 'Z' with an imputed missing value
tree.get_node_data("Z", missing="C")
0 A 1 B 2 C 3 C 4 C 5 C 6 C 7 C 8 C dtype: object
The pandas Series object is convenient to work with since it is accepted by many other toytree
functions as input, and can can be easily converted to other object types, as demonstrated below.
# convert a single trait Series to a dict
tree.get_node_data("Z", missing="C").to_dict()
{0: 'A', 1: 'B', 2: 'C', 3: 'C', 4: 'C', 5: 'C', 6: 'C', 7: 'C', 8: 'C'}
# convert a single trait Series to a numpy ndarray
tree.get_node_data("Z", missing="C").values
array(['A', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C'], dtype=object)
Get multiple features¶
By default the get_node_data
function returns a DataFrame with data for all features in a tree. In addition to the option to subselect a single feature from the tree, as shown above, you can also select a subset of features to return a DataFrame containing just those features. Finally, an important feature of this function is its application for dealing with missing data. The missing
argument can accept either a single value to assign to all missing values in the DataFrame, or a list of values to apply separately to each column.
# return Node values for all features
tree.get_node_data()
idx | name | height | dist | support | X | Z | |
---|---|---|---|---|---|---|---|
0 | 0 | r0 | 0.000000 | 0.333333 | NaN | 10 | A |
1 | 1 | r1 | 0.000000 | 0.333333 | NaN | 10 | B |
2 | 2 | r2 | 0.000000 | 0.666667 | NaN | 10 | NaN |
3 | 3 | r3 | 0.000000 | 0.666667 | NaN | 10 | NaN |
4 | 4 | r4 | 0.000000 | 0.666667 | NaN | 10 | NaN |
5 | 5 | 0.333333 | 0.333333 | NaN | 10 | NaN | |
6 | 6 | 0.666667 | 0.333333 | NaN | 10 | NaN | |
7 | 7 | 0.666667 | 0.333333 | NaN | 10 | NaN | |
8 | 8 | 1.000000 | 0.000000 | NaN | 10 | NaN |
# return values for two features, with different imputed missing values
tree.get_node_data(["support", "Z"], missing=[100, "C"])
support | Z | |
---|---|---|
0 | 100 | A |
1 | 100 | B |
2 | 100 | C |
3 | 100 | C |
4 | 100 | C |
5 | 100 | C |
6 | 100 | C |
7 | 100 | C |
8 | 100 | C |
Using features¶
One of the primary uses for assigning data to nodes on a tree is to visualize the data. Many arguments to the tree drawing functions accept values to designate the size, color, width, etc. of nodes or edges. These can be entered in three main ways: (1) by extracting the data as a Series using get_node_data
; (2) by entering the feature name directly as an argument; and (3) by using range or color mapping. The latter to cases simply provide a shorthand syntax for plotting a feature which use get_node_data
under the hood. Examples are shown below for the two features, "C" and "D", representing a discrete and continuous data set.
# set a color name as 'red' or 'blue' to all nodes for feature "C"
tree.set_node_data("C", {6: "red"}, inherit=True, default="blue", inplace=True).get_node_data("C")
0 red 1 red 2 red 3 blue 4 blue 5 red 6 red 7 blue 8 blue dtype: object
# set random float values in (0-1) to all nodes for feature "D"
tree.set_node_data("D", np.random.random(tree.nnodes), inplace=True).get_node_data("D")
0 0.773484 1 0.840667 2 0.503796 3 0.736419 4 0.374351 5 0.737745 6 0.537203 7 0.387983 8 0.136558 dtype: float64
(1) The first method for extracting data from a tree to use for plotting makes use of the get_node_data
function call. Here we call the function from the same tree object that is being plotted, and select the feature "C" of discrete data values. This returns a Series object with the values in the correct order (idxorder) for plotting on the nodes, which are then parsed as a node_colors
argument.
# draw with node colors entered from the "C" discrete data feature
tree.draw(node_sizes=15, node_mask=False, node_colors=tree.get_node_data("C"));
(2) The second method for extracting data from a tree uses a simpler syntax, entering only the feature name as a string to the node_colors
argument. Here, the draw
function will identify that "C" is a valid feature of this tree object, and it will extract the "C" feature data from the tree. Compared to the syntax above, this looks cleaner, but has the downside that you cannot enter additional options to fill a value for missing data.
# draw with node colors automatically extracted from the "C" feature
tree.draw(node_sizes=15, node_mask=False, node_colors="C");
(3) The third method uses toytree's "tuple syntax" that is used for range and color mapping (See range-mapping and color-mapping). For colors, you can enter (feature name, colormap, min-value, max-value, nan-value), to map any feature to any range of colors in a colormap. Only the first argument is required, with additional args providing style options, as shown below.
# draw with node colors extracted and colormapped from the "C" feature
tree.draw(node_sizes=15, node_mask=False, node_colors=("C",));
# draw with node colors extracted and colormapped to BlueRed from "C"
tree.draw(node_sizes=15, node_mask=False, node_colors=("C", "BlueRed"));
# draw with node colors extracted and colormapped to BlueRed from "D"
tree.draw(node_sizes=15, node_mask=False, node_colors=("D", "BlueRed"));
Node vs Edge features¶
Some data stored to a tree are intended to represent information about the edges (splits) in a tree, rather than information about the nodes. This is important as these types of data must be treated differently when doing things like re-rooting a tree, and in some cases, for visualization. (See the rooting tutorial for an example of how this is automatically handled in toytree
.) Any feature can be optionally plotted as a marker and/or label on edges of a tree rather than on nodes. This can be done in a simple way within the .draw
function by using the argument node_as_edge_data=True
, or, it can be done with many more options by using functions in the toytree.annotate
subpackage.
Examples of plotted edge features are shown below. These have a few key features in visualization: (1) values are plotted on the midpoint of edges; (2) No value is shown for the root edge, since it does not represent a true split in the tree; and (3) only one of the two edges descended from the root show a value, since these are actually the same edge, but on which the root node has been placed. As an example of this last point, a value such as a support score, or edge length, is a feature of this entire edge. Thus, the value is the same whether the tree is rooted or unrooted, as shown below.
# draw a feature as EDGE data
tree.draw(
node_mask=False,
node_labels="idx",
node_labels_style={'font-size': 18},
node_as_edge_data=True,
);
# draw a feature as EDGE data for the same tree, unrooted.
tree.unroot().draw(
node_mask=False,
node_labels="idx",
node_labels_style={'font-size': 18},
node_as_edge_data=True,
);
Annotation methods can also be used to plot edge data. See the annotation docs.
# annotate a tree with EDGE data
canvas, axes, mark = tree.draw();
tree.annotate.add_edge_labels(axes=axes, labels="idx", font_size=18, mask=False);