sim discrete traits
Simulating Discrete Traits (Markov Model)¶
This notebook demonstrates toytree.pcm.simulate_discrete_data() for
single and multiple replicates, including trait naming and plotting the
result on a tree.
Continuous-time Markov models¶
simulate_discrete_data() uses a continuous-time Markov chain (CTMC)
to model transitions among a fixed number of discrete states along
each branch. The model is defined by an instantaneous rate matrix
Q where each off-diagonal entry q_ij is the rate of change from
state i to state j. Diagonal entries are set so each row sums
to zero, and transition probabilities over time t are computed as
P(t) = expm(Q * t).
Supported model types differ in how Q is parameterized:
- ER (equal rates): all off-diagonal rates are the same value. This is the simplest model and assumes every state changes to every other state at the same rate.
- SYM (symmetric rates): rates are symmetric (
q_ij = q_ji), but different state pairs can have different rates. This allows variation among pairs while keeping reversibility. - ARD (all rates different): every off-diagonal rate can be
different (
q_ijandq_jimay differ). This is the most flexible model, allowing directional biases between states.
In all cases, you can provide relative_rates and state_frequencies
explicitly, or let toytree sample valid parameters for the chosen
model. The rate_scalar multiplies the relative rates to set the
overall tempo of change.
Full vs tips-only datasets¶
Empirical datasets typically include observations only at the tips
(extant taxa), which corresponds to tips_only=True. Simulations
can also return full datasets that include internal nodes, which
is useful for validating methods.
Plot trait values on nodes¶
Below we color nodes by a discrete trait using a categorical colormap. For tips-only data, internal nodes are left in a neutral color.
Full dataset (all nodes)
# Plot full dataset (all nodes)
cm = toyplot.color.brewer.map('Set2')
vals = full_series.reindex(range(tree_full.nnodes)).astype(int).to_numpy()
node_colors = [cm.colors(v, 0, 2) for v in vals]
c1, a1, m1 = tree_full.draw(layout='r', node_colors=node_colors, node_sizes=14, tip_labels=True, scale_bar=True)
c1
Tips-only dataset (internal nodes neutral)
# Plot tips-only dataset (internal nodes neutral)
cm = toyplot.color.brewer.map('Set2')
neutral = '#c7c7c7'
tip_vals = tips_series.to_dict()
colors2 = []
for node in tree_full:
if node.is_leaf():
colors2.append(cm.colors(int(tip_vals[node.name]), 0, 2))
else:
colors2.append(neutral)
c2, a2, m2 = tree_full.draw(layout='r', node_colors=colors2, node_sizes=14, tip_labels=True, scale_bar=True)
c2
import toytree
tree = toytree.rtree.unittree(12, seed=123)
Single trait (returns Series)¶
When nreplicates is 1 (or less), the function returns a pd.Series.
traits = toytree.pcm.simulate_discrete_data(
tree=tree,
nstates=3,
model='ER',
nreplicates=1,
trait_name='state',
tips_only=True,
)
traits
Plot the trait values by coloring tip labels.
color_map = {0: '#4c78a8', 1: '#f58518', 2: '#54a24b'}
tip_colors = [color_map[traits[name]] for name in tree.get_tip_labels()]
c, a, m = tree.draw(layout='r', tip_labels=True, tip_labels_colors=tip_colors, scale_bar=True)
c
Inplace storage¶
If inplace=True, the simulated trait(s) are stored on the tree
as node features instead of being returned.
tree2 = toytree.rtree.unittree(8, seed=7)
toytree.pcm.simulate_discrete_data(
tree=tree2,
nstates=2,
model='ER',
nreplicates=1,
trait_name='state',
tips_only=True,
inplace=True,
)
# Access stored feature values
tree2.get_node_data('state').head()
Model parameters and their effects¶
The parameters below control the Markov model and simulation:
relative_rates: Off-diagonal entries of Q. Larger values increase transition frequency between specific state pairs.state_frequencies: Stationary frequencies (equilibrium proportions) that influence long-run state occupancy and, for reversible models, how rates are scaled among states.root_state: Fixes the starting state at the root. This biases the entire simulation toward descendants of that root state.rate_scalar: Multiplies the relative rates, controlling the overall tempo of evolution (higher values -> more changes).
import numpy as np
tree_param = toytree.rtree.unittree(10, seed=999)
# 1) relative_rates: favor transitions 0->1 over 1->0
rates = np.array([[0, 3.0], [0.5, 0]])
traits_rates = toytree.pcm.simulate_discrete_data(
tree=tree_param,
nstates=2,
model='ARD',
relative_rates=rates,
nreplicates=1,
trait_name='rate_bias',
tips_only=True,
)
# 2) state_frequencies: favor state 0 at equilibrium
freqs = np.array([0.8, 0.2])
traits_freq = toytree.pcm.simulate_discrete_data(
tree=tree_param,
nstates=2,
model='ER',
state_frequencies=freqs,
nreplicates=1,
trait_name='freq_bias',
tips_only=True,
)
# 3) root_state: fix the root to state 1
traits_root = toytree.pcm.simulate_discrete_data(
tree=tree_param,
nstates=2,
model='ER',
root_state=1,
nreplicates=1,
trait_name='root_fixed',
tips_only=True,
)
# 4) rate_scalar: compare slow vs fast rates
traits_slow = toytree.pcm.simulate_discrete_data(
tree=tree_param,
nstates=2,
model='ER',
rate_scalar=0.2,
nreplicates=1,
trait_name='slow',
tips_only=True,
)
traits_fast = toytree.pcm.simulate_discrete_data(
tree=tree_param,
nstates=2,
model='ER',
rate_scalar=3.0,
nreplicates=1,
trait_name='fast',
tips_only=True,
)
traits_rates.head(), traits_freq.head(), traits_root.head(), traits_slow.head(), traits_fast.head()
Visual comparison¶
Below we color tips for each scenario to visualize the effect of parameter changes.
cm2 = toyplot.color.brewer.map('Set2')
def plot_tip_trait(trait_series, title):
colors = [cm2.colors(int(trait_series[name]), 0, 1) for name in tree_param.get_tip_labels()]
c, a, m = tree_param.draw(layout='r', tip_labels=True, tip_labels_colors=colors, scale_bar=True)
a.label.text = title
return c
c_rate = plot_tip_trait(traits_rates, 'relative_rates bias')
c_rate
c_freq = plot_tip_trait(traits_freq, 'state_frequencies bias')
c_freq
c_root = plot_tip_trait(traits_root, 'root_state fixed')
c_root
c_slow = plot_tip_trait(traits_slow, 'rate_scalar slow')
c_slow
c_fast = plot_tip_trait(traits_fast, 'rate_scalar fast')
c_fast
Multiple replicates (returns DataFrame)¶
When nreplicates is greater than 1, the function returns a pd.DataFrame
with one column per replicate.
traits_multi = toytree.pcm.simulate_discrete_data(
tree=tree,
nstates=2,
model='SYM',
nreplicates=3,
trait_name='trait',
tips_only=True,
)
traits_multi.head()