Node query/selection¶
Many methods in toytree
require selecting one or more nodes from a tree to operate on. This can often be challenging since most nodes in a tree usually do not have unique names assigned to them, and selecting nodes by a numeric indexing method can be error-prone if the indices change. We have tried to design the node query and selection methods in toytree
to be maximally flexible to allow for ease-of-use when selecting nodes while also trying to prevent users from making simple and common mistakes.
import toytree
# load a toytree from a newick string at a URL and root it
tree = toytree.tree("https://eaton-lab.org/data/Cyathophora.tre").root("~prz")
Select Nodes by index (idx)¶
The simplest and fastest approach to get Node
objects from a ToyTree is to select them by their idx
label. In fact, the storage of Nodes in a cached traversal order for fast recall is one of the main advantages of the ToyTree class. The tip nodes are intuitively labeled from left to right (or bottom to top, depending on the tree orientation) as idx labels from 0 to ntips - 1, and the root node is at idx label nnodes - 1.
# draw tree showing the idx labels representing the cached idxorder traversal
tree.draw('s');
Nodes can selected from a ToyTree
by indexing, slicing by idx label.
# select a single node by idx
tree[1]
<Node(idx=1, name='33588_przewalskii')>
# select a slice of nodes by idx
tree[3:5]
[<Node(idx=3, name='30556_thamno')>, <Node(idx=4, name='40578_rex')>]
# select a list of nodes by idx
tree[[3, 4, 8, 9]]
[<Node(idx=3, name='30556_thamno')>, <Node(idx=4, name='40578_rex')>, <Node(idx=8, name='38362_rex')>, <Node(idx=9, name='29154_superba')>]
# select all tip (leaf) nodes by slicing
tree[:tree.ntips]
[<Node(idx=0, name='32082_przewalskii')>, <Node(idx=1, name='33588_przewalskii')>, <Node(idx=2, name='33413_thamno')>, <Node(idx=3, name='30556_thamno')>, <Node(idx=4, name='40578_rex')>, <Node(idx=5, name='35855_rex')>, <Node(idx=6, name='35236_rex')>, <Node(idx=7, name='39618_rex')>, <Node(idx=8, name='38362_rex')>, <Node(idx=9, name='29154_superba')>, <Node(idx=10, name='30686_cyathophylla')>, <Node(idx=11, name='41954_cyathophylloides')>, <Node(idx=12, name='41478_cyathophylloides')>]
# select all internal nodes by slicing
tree[tree.ntips:tree.nnodes]
[<Node(idx=13)>, <Node(idx=14)>, <Node(idx=15)>, <Node(idx=16)>, <Node(idx=17)>, <Node(idx=18)>, <Node(idx=19)>, <Node(idx=20)>, <Node(idx=21)>, <Node(idx=22)>, <Node(idx=23)>, <Node(idx=24, name='root')>]
# select the root node
tree[-1]
<Node(idx=24, name='root')>
Select Nodes by name¶
To select nodes by name you can use the get_nodes
function. This is most useful for selecting tip nodes, since these are often the only nodes that have unique names, whereas internal nodes usually have empty name attributes. Internal nodes can be queried by using their idx labels, or, as demonstrated below, they can be selected based on tip names by using the function get_mrca_node
.
# select one node by name
tree.get_nodes("40578_rex")
[<Node(idx=4, name='40578_rex')>]
# select multiple nodes by name
tree.get_nodes("40578_rex", "38362_rex")
[<Node(idx=4, name='40578_rex')>, <Node(idx=8, name='38362_rex')>]
Using regular expressions ~¶
Regular expressions are a sequence of characters that match a pattern, and are often used in search algorithms. In toytree
there are many functions which optionally accept regular expressions as an input to allow for easily selecting multiple nodes. This can be used because the operation is intended to operate on each of these nodes individually (e.g., toytree.mod.drop_tips
) or because the operation will find the most recent common ancestor of the input nodes and operate on that edge or subtree (e.g., toytree.mod.root
, or toytree.mod.extract_subtree
; see below).
All of these functions that accept name strings as input use the get_nodes
function under the hood to find the matched nodes, and so our demonstrations below will use this function. In addition to accepting one or more individual name strings this function can also accept regular expressions as input.
To indicate that an entry should be treated as a regular expression use the ~
prefix. It will then use the Python standard library regular expression function re.search()
to find any nodes that match this query.
# match any node name containing 'prz'
tree.get_nodes("~prz")
[<Node(idx=1, name='33588_przewalskii')>, <Node(idx=0, name='32082_przewalskii')>]
# match any node name containing 855
tree.get_nodes("~855")
[<Node(idx=5, name='35855_rex')>]
# match any node name starting with a 4
tree.get_nodes("~^4")
[<Node(idx=11, name='41954_cyathophylloides')>, <Node(idx=4, name='40578_rex')>, <Node(idx=12, name='41478_cyathophylloides')>]
# match any node name ending with an 'a'
tree.get_nodes("~a$")
[<Node(idx=9, name='29154_superba')>, <Node(idx=10, name='30686_cyathophylla')>]
# match name containing a 3 followed by 8 or 9 then any chars followed by 'rex'
tree.get_nodes("~3[8,9].+rex")
[<Node(idx=7, name='39618_rex')>, <Node(idx=8, name='38362_rex')>]
Node queries¶
We define a query as a flexible type of input used to match one or more nodes. For functions which accept a query as input, an int
will be treated as a Node idx label, whereas a str
will be treated as a Node name
, and a str
starting with a ~
will be treated as a regular expression. These functions can also accept a Node
object as an input. You can even mix these arguments to select multiple nodes.
get_nodes()
¶
The function get_nodes
is used widely both by users as well as internally by other functions. It takes *query
as input meaning that it accepts any number of queries as input.
# select nodes by int idx labels, or by str names, or multiple by ~regex, or Node
tree.get_nodes(0, 1, '40578_rex', tree[8])
[<Node(idx=4, name='40578_rex')>, <Node(idx=1, name='33588_przewalskii')>, <Node(idx=8, name='38362_rex')>, <Node(idx=0, name='32082_przewalskii')>]
get_mrca_node()
¶
Many tree operations require selecting an internal node to operate on. For example, rooting a tree on a clade. This is easiest done by selecting two tip nodes by name for which the internal node target is the most recent common ancestor (mrca), and providing these as entries to the get_mrca_node
function. This function accepts query arguments the same way as get_nodes
, accepting int, str, or ~
regex entries.
Consider the example below where wish to find the internal node that is the mrca of the five tip nodes in the example tree forming the "rex" clade. We can select this node in several ways:
# if you already know its idx (e.g., by tree visualization) you can index it
tree[17]
<Node(idx=17)>
# or, you can find the mrca by knowing the tip node idxs
tree.get_mrca_node(4, 5, 6, 7, 8)
<Node(idx=17)>
# you actually only need to provide the minimal spanning nodes
tree.get_mrca_node(4, 8)
<Node(idx=17)>
# safer, however, is to enter node names, since these never change
tree.get_mrca_node("35855_rex", "40578_rex", "39618_rex", "35236_rex", "38362_rex")
<Node(idx=17)>
# again, you only need to enter the minimal required
tree.get_mrca_node("35855_rex", "38362_rex")
<Node(idx=17)>
# simpler, use a regular expression to match all names with 'rex'
tree.get_mrca_node("~rex")
<Node(idx=17)>
Efficiency/speed¶
Because matching nodes by name requires traversing over all nodes in the tree to find matches it is much slower than selecting nodes by indexing with idx labels. All of the methods are still pretty fast, the time difference only matters when writing very time-intensive code. This is demonstrated simply below.
%%timeit
# time to select a tip node by its idx (superfast)
tree[7]
152 ns ± 1.8 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%%timeit
# time to select a tip node by its name
tree.get_nodes("39618_rex")
10.8 µs ± 36.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%%timeit
# time to select an internal node (17) by its known index
tree[17]
156 ns ± 0.887 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%%timeit
# time to find mrca (17) by mrca of idx labels
tree.get_mrca_node(4, 8)
27.6 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
# time to find mrca (17) by mrca of name labels
tree.get_mrca_node("~rex")
63.4 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Best practices¶
There are many situations in which you know the tree structure will not change, and thus indexing by node idx is faster and much preferred to slower name selection. Especially when you are selecting the tip or root nodes, which have obvious numeric labels. However, in other cases it is preferable to use names when selecting nodes, such as when adding traits or labels to internal nodes for tree drawings, since it makes your code more readable and explicit.
Node Queries are everywhere¶
You will find that many functions in toytree
accept query type inputs that are used to match nodes following the node query methods described above. These are especially common in the toytree.mod
subpackage.
tree.mod.root("~prz").draw();
# drop all 'rex' samples
tree.mod.drop_tips("~rex*").draw();
# keep only the subtree connecting 'rex' samples
tree.mod.prune("~rex*").draw();