Orthologs are homologous gene copies that diverged from a common ancestral gene during speciation events. After speciation, lineages evolve independently such that orthologous gene copies accumulate different mutations over time. This process can continue through many speciation events. As mutations arise in the periods between speciation they leave behind a record of the existence of the common ancestral sequence. In this way, the mutational process constructs a record that we can eventually use to reconstruct the order and timing of speciation events. By comparing the orthologous gene sequences of the descendant lineages alive today we can contruct a phylogeny as a snapshot of the past.
In this notebook we will not dive too deep into the algorithmic and methodological complexity of phylogenetic inference. In its most basic form, the goal of molecular phylogenetics is to count the number of character differences (e.g., substitutions) between multiple orthologous sequences, and to build a tree that minimizes the number of steps needed to explain those differences. We can do this using simple parsimony methods (the fewest steps is the best), or with methods that model DNA substitution as a probabilistic process (e.g., maximum likelihood).
The details of how various phylogenetic software tools differ, and what assumptions they make, is important to learn and relevant if you use phylogenetics for research. And we'll get to some of those details later. But for now, we'll focus on the applied end of things. How do I infer a phylogenetic tree in practice? We'll turn our attention back to the more theoretical and statistical details later.
There are three basic steps involved in inferring a phylogenetic tree. We'll complete a practical example from start to finish in this notebook. Then at the end you will try to replicate this in an organism of your choice.
import os
import subprocess
import toytree
import requests
import pandas as pd
To find individual genes or gene regions sampled from particular species the best place to look is the NCBI Genbank sequence archive. This database includes both Sanger sequenced DNA regions that were targeted by researchers for sequencing using primers, as well as genes or gene regions from sequenced and annotated genomes. In an earlier notebook we used API tools to access fasta data directly from the database. Here we'll use that code again, written as Python functions. Execute the cells to load these functions, and follow along by reading the code and comments in it. We'll mostly focus on the results that the code produces.
The esearch
API returns unique IDs for the search term that we use. Our query below is searching for a particular marker that is commonly used in plant phylogenetic studies (trnK
). This is a chloroplast gene that has an important function and so tends to be highly conserved, which makes it easy to sequence by designing primers to amplify it. However, it also contains an intron sequence nearby that can be sequenced along with it, and this region is more variable, and thus provides more phylogenetic information than the conserved part. It has historically been difficult to find genomic regions that are both easy to amplify and variable enough to work as good phylogenetic markers. The few regions that were first discovered to fit this category have been used extensively. And because of this inertia they have become what is called universal markers -- a few gene regions that have been sequenced across thousands of organisms by many research groups over several decades. These markers provide the best sampled phylogenetic data that we have, and through a community effort of collaboration and data sharing have made it possible to combine available data to build enormous phylogenies for thousands of species that help to elucidate the tree of life.
In addition to targeting the trnK
gene we're also restricting our query to a specific organism, the plant genus Pedicularis. This is a group of plants that I study. Please follow along using this search term for now, you will have a chance to explore other search terms later below.
# execute this cell to create the get_uids function
def get_uids(term, retmax=10):
"Search NCBI nucleotide database and return N sequence matches"
# make a request to esearch
res = requests.get(
url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
params={
"db": "nucleotide",
"term": term,
"sort": "length",
"retmode": "text",
"retmax": retmax,
"tool": "genomics-course",
"email": "student@university.edu",
},
)
# parse the xml output
count = res.text.split("<Count>")[1].split("</Count>")[0].strip()
# if nothing found then bail out
if not int(count):
raise ValueError("No UIDs found")
# return the list of UIDs
uids = []
ids = res.text.split("<IdList>")[1].split("</IdList>")[0].strip()
for item in ids.split("\n"):
uids.append(item[4:-5])
return uids
Our query statement to the database is stored as the variable term
below. We enter this as an argument to the get_uids()
function and tell it to return at most 20 results. The function returns a list of strings that represent unique IDs matching our query.
# the search term
term = "trnk[GENE] AND Pedicularis[ORGN]"
# call the function to get uids and store as a list
ingroups = get_uids(term=term, retmax=20)
# show the first ten values
ingroups[:10]
['1720554516', '1720554514', '1720554512', '1720554510', '1720554508', '1720554506', '1720554504', '1720554502', '63253964', '63253962']
As we learned in the last notebook, it is important to root your tree so that you know the direction of evolution when interpreting evolutionary relationships. One way to do this is to include an outgroup sample in your analysis -- a sample that you know is more distantly related to everything else. Here we'll include 2 samples from a closely related genus. To do this I changed the search term string below to look for the same gene but from a different organism. Finally we add the list results of this search (2 UIDs) with the list results of the first one (20 UIDs) to have a total of 22 samples in the final list uids
. We could combine many searches to select samples from a variety of species, or genera, or families in this way.
# the search term
term = "trnk[GENE] AND Orobanche[ORGN]"
# call the function to get uids and store as a list
outgroups = get_uids(term=term, retmax=2)
# show these uids
print(outgroups)
# add these uids to the list
uids = ingroups + outgroups
['1690492016', '1690492014']
The next function fetches fasta data from the database based on the unique IDs that we got from the previous search. In the API call here we specify the strand of DNA, which will ensure they are all returned in the same orientation, and we also specify a start and stop limit just to ensure we do not get back sequences longer than 2000 base pairs. Here automation can sometimes be dangerous because there are errors in the metadata of the database. For example, people sometimes enter the wrong strand information into the metadata when they upload the sequences, which can cause your sequences to not align well. It is important to visually inspect your data to check for these types of errors, as we will do below.
# execute this cell to create the get_fasta function
def get_fasta(uids):
"""
Fetch fasta sequence data from NCBI for a list of uids
and return as a dictionary of {name: sequence}.
"""
# make a request to efetch
res = requests.get(
url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
params={
"db": "nucleotide",
"id": ",".join(uids),
"seq_start": "1",
"seq_stop": "2000",
"strand": "1",
"retmode": "text",
"rettype": "fasta",
"tool": "genomics-course",
"email": "student@university.edu",
},
)
# split fasta string into separate fasta sequences
fastas = res.text.strip().split("\n\n")
# write to output file
with open("sequences.fasta", 'w') as out:
for fasta in fastas:
# separate name and sequence
name, sequence = fasta.split("\n", 1)
# reorder and shorten fasta names for easier reading
bits = name.split()
genus = bits[1][0]
species = bits[2]
accession = bits[0][1:].split(":")[0]
name = ">{}_{}_{}".format(genus, species, accession)
# remove line breaks from sequence
sequence = sequence.replace("\n", "")
# write to file
out.write("{}\n{}\n".format(name, sequence))
# print statement
print("Wrote {} sequences to ./sequences.fasta".format(len(fastas)))
# run the get_fasta command
fastadict = get_fasta(uids)
Wrote 22 sequences to ./sequences.fasta
We have now downloaded orthologous gene sequences for 22 samples (20 Pedicularis and 2 Orobanche). The function above printed a statement saying that it printed the data to ./sequences.fasta
. You can open the file in your file viewer to look at it. Or, we can also peek at it here in the notebook. Let's look at just the first few columns and rows of data. Here I'm going to use a cool trick in Pandas to display the DNA as a DataFrame with colored cells.
# execute to create the following functions
def highlight_dna(cell):
"A function to color cells by DNA base"
color = {"A": 'red', "T": 'blue', "C": 'green', "G": 'yellow', "-": "lightgrey"}
return 'background-color: {}'.format(color[cell])
def colored_slice(fasta, start, stop):
"returns a colored dataframe over a given slice"
# load seq data
with open(fasta) as infile:
# load sequence file as a dictionary
fdict = {}
for fa in infile.read().strip()[1:].split("\n>"):
name, seq = fa.split("\n")
fdict[name] = list(seq)[start: stop]
# make dataframe from dictionary
df = pd.DataFrame(fdict, index=range(start, stop)).T
# show dataframe as colored cells
return df.style.applymap(highlight_dna)
We can now use this function to look at a slice of this fasta file. You can scroll to the right to see more columns as well. Each column here represents one sequenced base. When all of the samples have the same base at a position this means that the site is not variable (at least given our sampling). No mutations have occurred at that site since these species diverged from each other. You can see that most columns for the Pedicularis samples are invariant, but then there is a lot of mismatch with the two outgroup samples in the bottom two rows.
Now remember, the sequences in this file have not been aligned yet, so we expect that they may look pretty bad at this stage. As you scroll to the right you'll see that the ingroup samples (Pedicularis) also fall out of phase pretty quickly. That is because there was an insertion or deletion at some point which is not being accounted for. Let's see if aligning makes this look better.
# we can't tell yet that the sequences are not aligned
colored_slice("./sequences.fasta", 0, 100)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P_metaszetschuanica_LC377635.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_szetschuanica_LC377634.1 | A | C | C | T | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_refracta_LC377633.1 | T | A | G | A | T | C | T | T | T | A | C | A | C | A | T | T | T | G | G | A | T | G | A | A | G | T | G | A | G | A | A | A | T | T | C | G | T | C | C | A | T | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T |
P_refracta_LC377632.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_refracta_LC377631.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_refracta_LC377630.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_refracta_LC377629.1 | C | T | C | G | G | C | T | T | T | T | T | A | A | G | T | G | C | G | G | C | T | A | G | A | A | T | C | T | T | T | T | A | C | A | C | A | T | T | T | G | G | A | T | G | A | A | G | T | G | A | G | A | A | A | T | T | C | G | T | C | C | A | T | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G |
P_refracta_LC377628.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_verticillata_AY949769.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_axillaris_AY949768.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_schistostegia_AY949767.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_davidii_AY949766.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | A | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | G | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_furfuracea_AY949765.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_anas_AY949764.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_integrifolia_AY949763.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_verticillata_AY949762.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_thamnophila_AY949761.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | T | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | G | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_trichocymba_AY949760.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_cranolopha_AY949759.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | A | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | T | A | T | T | T | C | A | T | T | C |
P_ternata_AY949758.1 | A | C | C | A | T | T | G | G | T | A | A | A | G | T | T | T | G | G | A | A | G | A | C | C | A | C | G | A | C | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | T | A | T | T | T | C | A | T | T | C |
O_lycoctoni_MK588421.1 | C | T | A | C | T | T | T | T | G | A | T | T | T | A | A | T | T | T | C | A | A | A | T | G | G | A | G | G | A | A | A | T | C | C | A | A | A | A | A | T | A | T | T | T | A | C | A | G | C | T | A | G | A | A | A | G | A | T | C | T | C | A | A | C | A | A | C | A | C | A | G | T | T | T | C | C | T | A | T | A | T | C | C | A | C | T | T | A | T | T | T | T | T | C | A | G | G | A | G | T |
O_flava_MK588420.1 | A | C | A | C | T | T | A | T | T | A | C | G | A | T | T | C | T | T | T | A | T | C | A | A | C | A | A | A | A | A | T | T | G | G | A | A | T | A | G | T | T | T | T | A | T | T | A | C | T | C | C | A | A | A | G | A | A | A | G | C | C | A | G | T | T | C | T | T | T | T | T | T | T | G | C | A | A | A | A | A | G | T | A | A | T | C | A | A | A | G | A | T | T | A | T | T | C | T | T | A |
The muscle alignment program is a command line tool that can be called as muscle -in infile -out outfile
. In the function below we use the Python library subprocess
to call and run the program from within Python with subprocess.call()
. This is a convenient way in which to use Python code as "glue" to tie together programs written in other languages with tasks you want to complete in Python.
The program muscle uses an algorithm to insert "-" characters into the alignment so that it minimizes differences among sites. The details of this algorithm are interesting, and there are many ways to perform alignments, but that is outside the scope of this tutorial. Below we simply call the muscle program and parse the output to write it back to a file in a way that will be easy for us to load it back in as a pandas DataFrame to look at again with color.
# execute this cell to create the align_fasta function
def align_fasta(fasta_file):
# the output aligned file name to use
aligned = fasta_file + ".aligned"
# run muscle alignment program on fasta file to produce output file
cmd = ["muscle", "-in", fasta_file, "-out", aligned]
subprocess.call(cmd)
# read in results of aligned file
with open(aligned) as infile:
fastas = infile.read().strip()[1:].split("\n>")
# remove newlines from aligned file
with open(aligned, 'w') as out:
for fasta in fastas:
# separate name and sequence
name, sequence = fasta.split("\n", 1)
# remove line breaks from sequence
sequence = sequence.replace("\n", "")
# write to file
out.write(">{}\n{}\n".format(name, sequence))
# print statement
print("Wrote aligned file to {}".format(aligned))
# run the alignment program
align_fasta("./sequences.fasta")
Wrote aligned file to ./sequences.fasta.aligned
Muscle produced a new fasta file where the data is now aligned. Let's take a look to ensure that it looks better than our previous non-aligned data. Scroll around and see that there are now grey "-" characters imputed to improve the alignment. Is likely not perfect, but hopefully it's pretty accurate. Now when there is variation at a site we expect that variation represents substitutions that have happened at a single orthologous site.
# looking in the middle of the file it's clear they are not aligned
colored_slice("./sequences.fasta.aligned", 100, 200)
100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
O_lycoctoni_MK588421.1 | T | A | T | T | T | A | C | A | G | C | T | A | G | A | A | A | G | A | T | C | T | C | A | A | C | A | A | C | A | C | A | G | T | T | T | C | C | T | A | T | A | T | C | C | A | C | T | T | A | T | T | T | T | T | C | A | G | G | A | G | T | A | T | A | T | T | T | A | T | T | C | A | T | T | T | G | C | T | C | A | T | G | A | T | C | G | T | G | G | T | T | T | C | G | G | T | A | G | A | T |
O_flava_MK588420.1 | C | G | A | T | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | T | T | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | C | A | A | A | A | A | - | - | - | - | T | T | G | G | A | A | - |
P_davidii_AY949766.1 | T | G | A | T | C | C | T | A | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | G | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | - |
P_cranolopha_AY949759.1 | T | G | A | T | C | C | T | A | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | - |
P_thamnophila_AY949761.1 | T | G | A | T | C | C | T | G | A | A | G | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | - |
P_furfuracea_AY949765.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | - |
P_axillaris_AY949768.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | - |
P_trichocymba_AY949760.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | - |
P_schistostegia_AY949767.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | G | G | G | A | - |
P_szetschuanica_LC377634.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_metaszetschuanica_LC377635.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_verticillata_AY949769.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_verticillata_AY949762.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_refracta_LC377633.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_refracta_LC377632.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_refracta_LC377631.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_refracta_LC377630.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_refracta_LC377629.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_refracta_LC377628.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_anas_AY949764.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_integrifolia_AY949763.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
P_ternata_AY949758.1 | T | G | A | T | C | C | T | G | A | A | A | G | G | G | A | A | T | A | A | A | T | G | G | A | A | A | A | A | A | T | A | G | C | A | T | G | T | C | G | T | A | T | C | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | A | T | A | G | A | G | A | G | T | T | C | T | G | T | G | G | A | - |
Similarly to how we called the program muscle
above, here we will call the program raxmlHPC
using the Python subprocess library again. The raxml software is complex and requires a number of arguments. You do not need to worry about those for now. You can look at the documentation for the software to learn more. The take home message here is that in just a minute or two this software can take our input aligned fasta file and infer a phylogeny as well as an estimate of confidence/support in that estimate. We'll discuss more later the complexities of how tree inference works.
# execute this cell to create the infer_tree command
def infer_tree(aligned_fasta):
# remove previous results if they exist
oldname = "RAxML_info.{}".format(os.path.basename(aligned_fasta))
if os.path.exists(oldname):
os.remove(oldname)
# the raxml command
cmd = [
"raxmlHPC",
"-f", "a",
"-s", aligned_fasta,
"-m", "GTRGAMMA",
"-n", os.path.basename(aligned_fasta),
"-p", "12345",
"-x", "54321",
"-w", os.path.realpath("."),
"-N", "100",
]
# print the raxml command
print("running: {}\n".format(" ".join(cmd)))
subprocess.call(cmd)
# print statement
print("Tree file written to ./RAxML_bipartitions.{}".format(
os.path.basename(aligned_fasta)))
# run the tree inference
infer_tree("./sequences.fasta.aligned")
running: raxmlHPC -f a -s ./sequences.fasta.aligned -m GTRGAMMA -n sequences.fasta.aligned -p 12345 -x 54321 -w /home/deren/Documents/genomics/2020-phylogenetics-I/notebooks -N 100 Tree file written to ./RAxML_bipartitions.sequences.fasta.aligned
Now that you've inferred a tree the next step is of course to check your results by visualization. Plot the tree. We can then compare the topology to our expectation based on morphology, or geography, or other expectations. Do all of the branch lengths look reasonable? If not, perhaps you have a bad sequence in the alignment. Visually inspect it and consider removing it.
In the tree below everything looks reasonable and likely correct given my knowledge of this group of organisms. I rooted the tree on the two outgroup sequences using the .root()
function and a wildcard selector that will match any names starting with "O_". In the final tree I also included a scalebar. You'll notice that on this tree the tips (black lines) do not all line up at 0 (the dashed lines are added just make the name labels align to make reading it easier). This is because this is not a time-scaled tree. The branch lengths are in units of substitutions per site. Some tips branches had more substitutions than others, which may reflect a higher rate of molecular evolution. Using this phylogeny we can now begin to test other evolutionary hypotheses.
# load the tree
tre = toytree.tree("./RAxML_bestTree.sequences.fasta.aligned")
# root on the outgroups
rtre = tre.root(regex="O_*")
# draw with support values
rtre.draw(width=700, tip_labels_align=True, scalebar=True);
We've now written functions to automate the minimal three steps required for inferring a phylogeny. And we can now reuse these functions to perform these tasks again with a different starting query to the database.
Every organism is not in the NCBI database for all genes. The hardest part of this challenge will be finding a good genetic marker that is sequenced across enough taxa in the group you are searching. I would recommend starting by searching the internet for the name of common phylogenetic markers used in your organism (or in animals versus plants). You can search these terms directly on NCBI https://www.ncbi.nlm.nih.gov/nuccore/. Search for terms like "universal phylogenetic marker", or do a google scholar search for "phylogeny" + the name of the organism you're interested in. Find a phylogenetic study that used a marker and try to sample that data yourself here. Search the NCBI taxonomy database if needed to find taxonomic groups to restrict your search.
Once you develop a good search term then you can insert it into the code below. If you get stuck head to the chat room. Don't try to move on to the next step until you get some UIDs printed below. If your search term does not match anything the function below will raise an error telling you that.
After you get your ingroup samples you next need to find an outgroup. This means finding an organism outside of your core group, but still pretty closely related. This will require you to do some research as well. NCBI includes a lot of information about taxonomy, as do many other sites including Wikipedia, or published phylogenetic papers.
Change the 'term' and 'oterm' search strings below to perform your own search.
# call the function to get uids and store as a list
term = "COI[GENE] AND Delphinidae[ORGN]"
ingroup = get_uids(term=term, retmax=20)
# call the function to uids from an *outgroup* and add to list
oterm = "COI[GENE] AND Phocoenidae[ORGN]"
outgroup = get_uids(term=oterm, retmax=5)
# show uids
uids = ingroup + outgroup
print(uids)
['323100323', '323100313', '323100305', '296465606', '1624685187', '1624685173', '1624685159', '1624685145', '1624685131', '1624685117', '1624685103', '1624685089', '1624685075', '1624685061', '1624685047', '1624685033', '1624685019', '1624685005', '1624684991', '1624684977', '1139695912', '1139695898', '1139695884', '1139695870', '1022717220']
# get fasta data for the UIDs
get_fasta(uids)
Wrote 25 sequences to ./sequences.fasta
Call the alignment function on your fasta file and then use the colored_slice
function to look at your alignment. Does it look good, or a complete mess? If it's a mess do not proceed to the next step. Maybe you need to try a different search term, a more restrictive definition of the gene segment you are searching for.
# run the alignment program
align_fasta("./sequences.fasta")
Wrote aligned file to ./sequences.fasta.aligned
# looking in the middle of the file it's clear they are not aligned
colored_slice("./sequences.fasta.aligned", 700, 900)
700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 880 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P_phocoena_KX079493.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | G | T | A | T | A | T | A | G | G | G | A | T | G | G | T | A | T | G | A | G | C | T | A | T | G | G | T | C | T | C | T | A | T | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | C | G | G | T | T | T | C | C | T | A | G | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | G | T | T | T | T | A | T | C |
G_macrorhynchus_MK790767.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | T | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790778.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790776.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790782.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790781.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790780.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790779.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790777.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790775.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790774.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790773.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790770.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790768.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790769.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790772.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
G_macrorhynchus_MK790771.1 | C | A | G | C | A | A | A | C | C | C | T | A | A | A | A | A | A | G | G | A | A | T | G | A | A | A | G | T | A | A | G | C | A | C | A | A | C | T | A | T | T | G | C | A | C | G | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | A | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | C | A | A | T | A | A | G | A | A | C | A | C | C | C | C | C | T | A | A | A | C | T | T | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
N_phocaenoides_KX650871.1 | C | A | G | C | A | A | A | C | C | C | T | - | A | A | A | A | A | G | G | A | A | T | A | A | A | A | G | T | A | A | G | C | A | T | A | A | T | T | A | C | A | G | C | A | C | A | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | G | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | T | A | A | C | A | A | G | A | A | C | A | C | C | C | C | T | T | A | A | G | C | C | C | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
N_phocaenoides_KX650869.1 | C | A | G | C | A | A | A | C | C | C | T | - | A | A | A | A | A | G | G | A | A | T | A | A | A | A | G | T | A | A | G | C | A | T | A | A | T | T | A | C | A | G | C | A | C | A | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | G | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | T | A | A | C | A | A | G | A | A | C | A | C | C | C | C | T | T | A | A | G | C | C | C | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
N_phocaenoides_KX650872.1 | C | A | G | C | A | A | A | C | C | C | T | - | A | A | A | A | A | G | G | A | A | T | A | A | A | A | G | T | A | A | G | C | A | T | A | A | T | T | A | C | A | G | C | A | C | A | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | G | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | T | A | A | C | A | A | G | A | A | C | A | C | C | C | C | T | T | A | A | G | C | C | C | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
N_phocaenoides_KX650870.1 | C | A | G | C | A | A | A | C | C | C | T | - | A | A | A | A | A | G | G | A | A | T | A | A | A | A | G | T | A | A | G | C | A | T | A | A | T | T | A | C | A | G | C | A | C | A | T | A | A | A | A | A | C | G | T | T | A | G | G | T | C | A | A | G | G | T | G | T | A | A | C | C | T | A | T | G | G | G | T | T | G | G | G | A | A | G | A | A | A | T | G | G | G | C | T | A | C | A | T | T | T | T | C | T | A | T | A | A | C | A | A | G | A | A | C | A | C | C | C | C | T | T | A | A | A | C | C | C | A | C | A | C | G | A | A | A | G | T | T | T | T | T | A | T | G | A | A | A | T | C | T | A | A | A | A | A | C | T | A | A | A | G | G | A | G | G | A | T | T | T | A | G | C | A | G | T | A | A | A | T | T | A | A | G | A | A | T | A | G | A | A | T | G | C | T | T | A | A | T | T |
S_guianensis_HQ918433.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | G | C | T | C | A | T | G | C | T | T | T | C | G | T | A | A | T | A | - | - | - | - | - | - | - | - | - | - | A | T | T | T | T | C | T | T | T | - | A | T | A | G | T | T | A | T | A | C | C | T | A | T | C | A | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | T | A | A | T | T | G | G | G | G | G | T | T | T | C | G | - | - | - | - | - | - | - | - | - | - | G | G | A | A | C | T | G | A | C | T | A | G | T | T | C | C | T | T |
S_guianensis_HQ918428.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | G | C | T | C | A | T | G | C | T | T | T | C | G | T | A | A | T | A | - | - | - | - | - | - | - | - | - | - | A | T | T | T | T | C | T | T | T | - | A | T | A | G | T | T | A | T | A | C | C | T | A | T | C | A | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | T | A | A | T | T | G | G | G | G | G | T | T | T | C | G | - | - | - | - | - | - | - | - | - | - | G | G | A | A | C | T | G | A | C | T | A | G | T | T | C | C | T | T |
S_guianensis_HQ918424.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | G | C | T | C | A | T | G | C | T | T | T | C | G | T | A | A | T | A | - | - | - | - | - | - | - | - | - | - | A | T | T | T | T | C | T | T | T | - | A | T | A | G | T | T | A | T | A | C | C | T | A | T | C | A | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | T | A | A | T | T | G | G | G | G | G | T | T | T | C | G | - | - | - | - | - | - | - | - | - | - | G | G | A | A | C | T | G | A | C | T | A | G | T | T | C | C | T | T |
D_sp._GU674138.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | A | G | C | T | C | A | T | G | C | C | T | T | C | G | T | A | A | T | A | - | - | - | - | - | - | - | - | - | - | A | T | T | T | T | C | T | T | T | - | A | T | A | G | T | T | A | T | A | C | C | T | A | T | C | A | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | T | A | A | T | T | G | G | A | G | G | T | T | T | T | G | - | - | - | - | - | - | - | - | - | - | G | G | A | A | C | T | G | A | T | T | A | G | T | C | C | C | C | T |
If your alignment is good then there's not much to this step. Run it and wait for it to finish.
# run the tree inference
infer_tree("./sequences.fasta.aligned")
running: raxmlHPC -f a -s ./sequences.fasta.aligned -m GTRGAMMA -n sequences.fasta.aligned -p 12345 -x 54321 -w /home/deren/Documents/genomics/2020-phylogenetics-I/notebooks -N 100 Tree file written to ./RAxML_bipartitions.sequences.fasta.aligned
# load the tree
tre = toytree.tree("./RAxML_bestTree.sequences.fasta.aligned")
# root on the outgroups
rtre = tre.root(regex="[G,N]_*")
# draw with support values
rtre.draw(width=500, tip_labels_align=True, scalebar=True);
When performing phylogenetic inference we make the assumption that the gene regions we are studying are homologous. To do so we rely on alignment algorithms to align genomic sites that are homologous. As you can see in the empirical alignments above, especially if you tried to align fairly distantly related organisms, is that sometimes genes diverge so significantly that even if we can still identify that the genes are homologous, there are few sites that can be confidently assigned as homologous, and there is a lot of room for error in alignments.
here are additional tools that can be applied to sequence alignments after they have been aligned and before they are used for phylogenetic inference that aim to mask (i.e., remove) sites that are poorly aligned to improve phylogenetic inference. One example is the tool GBLOCKs. This can apply filtering options such as "Min. seq. for flank pos.: 85%", "Max. contig. nonconserved pos.: 8", "Min. block length: 10", "Gaps in final blocks: no". Below is a function that can be applied to clean an alignment using the default GBLOCKS options.
def gblock_filter_alignment(aligned_fasta):
"""
Takes an aligned fasta file and applies GBLOCKS to filter it and
write the results to a new fasta file.
"""
# run muscle alignment program on fasta file to produce output file
cmd = [
"Gblocks", aligned_fasta,
"-t=d",
"-e=.filt",
"-d=n",
]
print("command:", " ".join(cmd))
subprocess.call(cmd)
# print statement
print("Wrote filtered fastsa file to {}".format(aligned_fasta + ".filt"))
gblock_filter_alignment("./sequences.fasta.aligned")
command: Gblocks ./sequences.fasta.aligned -t=d -e=.filt -d=n Wrote filtered fastsa file to ./sequences.fasta.aligned.filt
# run the tree inference
infer_tree("./sequences.fasta.aligned.filt")
running: raxmlHPC -f a -s ./sequences.fasta.aligned.filt -m GTRGAMMA -n sequences.fasta.aligned.filt -p 12345 -x 54321 -w /home/deren/Documents/genomics/2020-phylogenetics-I/notebooks -N 100 Tree file written to ./RAxML_bipartitions.sequences.fasta.aligned.filt
# load the tree
tre = toytree.tree("./RAxML_bestTree.sequences.fasta.aligned.filt")
# root on the outgroups
rtre = tre.root(regex="[G,N]_*")
# draw with support values
rtre.draw(width=500, tip_labels_align=True, scalebar=True);