fit discrete traits
Fit discrete traits with an Mk model (ML)¶
This example fits a discrete-state continuous-time Markov model (Mk) on a phylogeny and estimates transition rates by maximum likelihood.
Rate models¶
"ER": equal rates (one rate shared by all allowed transitions)
"SYM": symmetric rates (q_ij = q_ji)
"ARD": all rates different (fully asymmetric)
(See the “sim discrete traits” page for more about CTMC rate matrices and simulation.)
(See the "infer ancestral states" page for computing marginal ancestral-state posteriors conditional on fitted model parameters.)
import toytree
import numpy as np
Simulate example data (tip state observations)¶
The trait "X" is simulated with 3-states evolving under equal transition rates.
Input expectations (important)¶
Data can be a pandas.Series (single trait) with the index labeled by tip names or idx labels. Or the input data can be a str feature name if the data are stored to nodes of a ToyTree. Missing values can be np.nan.
# Yule tree with expon waiting time to speciation = ~1M
tree = toytree.rtree.bdtree(ntips=50, b=1e-6, d=1e-7, seed=12345)
# simulate discrete trait w/ ~3 transitions between root-to-tip distance
tree.pcm.simulate_discrete_trait(
nstates=3,
model="ER",
trait_name="X",
state_names="ABC",
rate_scalar=3e-6,
inplace=True,
tips_only=True,
seed=123,
);
# view trait states
tree.get_node_data("X").head(10)
0 B 1 C 2 A 3 A 4 B 5 A 6 C 7 B 8 C 9 C dtype: object
Visualize traits¶
# draw with tip state colors
tree.draw(
width=650, height=350,
node_sizes=8,
node_colors=("X", "Set2"),
node_mask=(1, 0, 0),
layout='d',
scale_bar=1e6,
);
TEST¶
# tree = toytree.rtree.coaltree(50, seed=123)
# trait = tree.pcm.simulate_discrete_trait(nstates=3, model="ER", seed=1234, tips_only=True, relative_rates=3 / tree.treenode.height)
# fit = tree.pcm.infer_ancestral_states_discrete_ctmc(trait, nstates=3, model="ER")
# c, a, m = tree.draw(layout='d', scale_bar=True, width=700)
# tree.annotate.add_node_pie_markers(a, fit['data']['t0_anc_posterior'], colors="Set2", istroke_width=1);
m1 = tree.pcm.fit_discrete_ctmc("X", nstates=3, model="ER")
print(m1)
FitMarkovModelResult( model=ER, nstates=3, nparams=1 log_likelihood=-53.832 state_frequencies=[0.3333, 0.3333, 0.3333] relative_rates=<array 3x3> qmatrix=<array 3x3> fixed_rates=<array 3x3> fixed_state_frequencies=None )
m2 = tree.pcm.fit_discrete_ctmc("X", nstates=3, model="SYM")
print(m2)
FitMarkovModelResult( model=SYM, nstates=3, nparams=5 log_likelihood=-53.5611 state_frequencies=[0.3468, 0.3676, 0.2856] relative_rates=<array 3x3> qmatrix=<array 3x3> fixed_rates=<array 3x3> fixed_state_frequencies=None )
m3 = tree.pcm.fit_discrete_ctmc("X", nstates=3, model="ARD")
print(m3)
FitMarkovModelResult( model=ARD, nstates=3, nparams=8 log_likelihood=-53.5617 state_frequencies=[0.3386, 0.3461, 0.3153] relative_rates=<array 3x3> qmatrix=<array 3x3> fixed_rates=<array 3x3> fixed_state_frequencies=None )
AIC model selection¶
You can select the best model using a LRT or AIC criterion. Below is an example using AIC.
toytree.pcm.aic_table([m1, m2, m3])
| model | loglik | k | AIC | dAIC | weight_AIC | evidence_ratio_vs_best_AIC | weight | evidence_ratio_vs_best | rank_by | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ER | -53.832002 | 1 | 109.664004 | 0.000000 | 0.975410 | 1.000000 | 0.975410 | 1.000000 | AIC |
| 1 | SYM | -53.561079 | 5 | 117.122158 | 7.458154 | 0.023424 | 41.640647 | 0.023424 | 41.640647 | AIC |
| 2 | ARD | -53.561652 | 8 | 123.123304 | 13.459299 | 0.001166 | 836.853990 | 0.001166 | 836.853990 | AIC |
Parameter effects¶
Below are short examples showing how key fitting parameters influence inference.
fixed_rates¶
# Example: constrain relative rates (ARD)
rates = np.array([
[0.0, 2.0, 0.5],
[1.0, 0.0, 0.2],
[0.8, 1.5, 0.0],
])
fit_ard_fixed_rates = tree.pcm.fit_discrete_ctmc(
data="X",
nstates=3,
model='ARD',
fixed_rates=rates,
)
fit_ard_fixed_rates
FitMarkovModelResult( model=ARD, nstates=3, nparams=2 log_likelihood=-53.5613 state_frequencies=[0.3217, 0.1403, 0.538] relative_rates=<array 3x3> qmatrix=<array 3x3> fixed_rates=<array 3x3> fixed_state_frequencies=None )
fixed_state_frequencies¶
# Example: constrain state frequencies
freqs = np.array([0.7, 0.2, 0.1])
fit_freqs = tree.pcm.fit_discrete_ctmc(
data="X",
nstates=3,
model='ER',
fixed_state_frequencies=freqs,
)
fit_freqs
FitMarkovModelResult( model=ER, nstates=3, nparams=1 log_likelihood=-53.832 state_frequencies=[0.3333, 0.3333, 0.3333] relative_rates=<array 3x3> qmatrix=<array 3x3> fixed_rates=<array 3x3> fixed_state_frequencies=[0.7, 0.2, 0.1] )
root_prior¶
# Example: set a root prior
root_prior = np.array([0.05, 0.05, 0.90])
fit_root = tree.pcm.fit_discrete_ctmc(
data="X",
nstates=3,
model='ER',
root_prior=root_prior,
)
fit_root
FitMarkovModelResult( model=ER, nstates=3, nparams=1 log_likelihood=-53.832 state_frequencies=[0.3333, 0.3333, 0.3333] relative_rates=<array 3x3> qmatrix=<array 3x3> fixed_rates=<array 3x3> fixed_state_frequencies=None )
Fossil constraints: with vs without¶
Compare inference with and without internal-node observations.