Orthologs are homologous gene copies that diverged from a common ancestral gene during speciation events. After speciation, lineages evolve independently such that orthologous gene copies accumulate different mutations over time. This process can continue through many speciation events. As mutations arise in the periods between speciation they leave behind a record of the existence of the common ancestral sequence. In this way, the mutational process constructs a record that we can eventually use to reconstruct the order and timing of speciation events. By comparing the orthologous gene sequences of the descendant lineages alive today we can contruct a phylogeny as a snapshot of the past.
In this notebook we will not dive too deep into the algorithmic and methodological complexity of phylogenetic inference. In its most basic form, the goal of molecular phylogenetics is to count the number of character differences (e.g., substitutions) between multiple orthologous sequences, and to build a tree that minimizes the number of steps needed to explain those differences. We can do this using simple parsimony methods (the fewest steps is the best), or with methods that model DNA substitution as a probabilistic process (e.g., maximum likelihood).
The details of how various phylogenetic software tools differ, and what assumptions they make, is important to learn and relevant if you use phylogenetics for research. And we'll get to some of those details later. But for now, we'll focus on the applied end of things. How do I infer a phylogenetic tree in practice? We'll turn our attention back to the more theoretical and statistical details later.
There are three basic steps involved in inferring a phylogenetic tree. We'll complete a practical example from start to finish in this notebook. Then at the end you will try to replicate this in an organism of your choice.
import os
import subprocess
import toytree
import requests
import pandas as pd
To find individual genes or gene regions sampled from particular species the best place to look is the NCBI Genbank sequence archive. This database includes both Sanger sequenced DNA regions that were targeted by researchers for sequencing using primers, as well as genes or gene regions from sequenced and annotated genomes. In an earlier notebook we used API tools to access fasta data directly from the database. Here we'll use that code again, written as Python functions. Execute the cells to load these functions, and follow along by reading the code and comments in it. We'll mostly focus on the results that the code produces.
The esearch
API returns unique IDs for the search term that we use. Our query below is searching for a particular marker that is commonly used in plant phylogenetic studies (trnK
). This is a chloroplast gene that has an important function and so tends to be highly conserved, which makes it easy to sequence by designing primers to amplify it. However, it also contains an intron sequence nearby that can be sequenced along with it, and this region is more variable, and thus provides more phylogenetic information than the conserved part. It has historically been difficult to find genomic regions that are both easy to amplify and variable enough to work as good phylogenetic markers. The few regions that were first discovered to fit this category have been used extensively. And because of this inertia they have become what is called universal markers -- a few gene regions that have been sequenced across thousands of organisms by many research groups over several decades. These markers provide the best sampled phylogenetic data that we have, and through a community effort of collaboration and data sharing have made it possible to combine available data to build enormous phylogenies for thousands of species that help to elucidate the tree of life.
In addition to targeting the trnK
gene we're also restricting our query to a specific organism, the plant genus Pedicularis. This is a group of plants that I study. Please follow along using this search term for now, you will have a chance to explore other search terms later below.
# execute this cell to create the get_uids function
def get_uids(term, retmax=10):
"Search NCBI nucleotide database and return N sequence matches"
# make a request to esearch
res = requests.get(
url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
params={
"db": "nucleotide",
"term": term,
"sort": "length",
"retmode": "text",
"retmax": retmax,
"tool": "genomics-course",
"email": "student@university.edu",
},
)
# parse the xml output
count = res.text.split("<Count>")[1].split("</Count>")[0].strip()
# if nothing found then bail out
if not int(count):
raise ValueError("No UIDs found")
# return the list of UIDs
uids = []
ids = res.text.split("<IdList>")[1].split("</IdList>")[0].strip()
for item in ids.split("\n"):
uids.append(item[4:-5])
return uids
Our query statement to the database is stored as the variable term
below. We enter this as an argument to the get_uids()
function and tell it to return at most 20 results. The function returns a list of strings that represent unique IDs matching our query.
# the search term
term = "trnk[GENE] AND Pedicularis[ORGN]"
# call the function to get uids and store as a list
ingroups = get_uids(term=term, retmax=20)
# show the first ten values
ingroups[:10]
As we learned in the last notebook, it is important to root your tree so that you know the direction of evolution when interpreting evolutionary relationships. One way to do this is to include an outgroup sample in your analysis -- a sample that you know is more distantly related to everything else. Here we'll include 2 samples from a closely related genus. To do this I changed the search term string below to look for the same gene but from a different organism. Finally we add the list results of this search (2 UIDs) with the list results of the first one (20 UIDs) to have a total of 22 samples in the final list uids
. We could combine many searches to select samples from a variety of species, or genera, or families in this way.
# the search term
term = "trnk[GENE] AND Orobanche[ORGN]"
# call the function to get uids and store as a list
outgroups = get_uids(term=term, retmax=2)
# show these uids
print(outgroups)
# add these uids to the list
uids = ingroups + outgroups
The next function fetches fasta data from the database based on the unique IDs that we got from the previous search. In the API call here we specify the strand of DNA, which will ensure they are all returned in the same orientation, and we also specify a start and stop limit just to ensure we do not get back sequences longer than 2000 base pairs. Here automation can sometimes be dangerous because there are errors in the metadata of the database. For example, people sometimes enter the wrong strand information into the metadata when they upload the sequences, which can cause your sequences to not align well. It is important to visually inspect your data to check for these types of errors, as we will do below.
# execute this cell to create the get_fasta function
def get_fasta(uids):
"""
Fetch fasta sequence data from NCBI for a list of uids
and return as a dictionary of {name: sequence}.
"""
# make a request to efetch
res = requests.get(
url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
params={
"db": "nucleotide",
"id": ",".join(uids),
"seq_start": "1",
"seq_stop": "2000",
"strand": "1",
"retmode": "text",
"rettype": "fasta",
"tool": "genomics-course",
"email": "student@university.edu",
},
)
# split fasta string into separate fasta sequences
fastas = res.text.strip().split("\n\n")
# write to output file
with open("sequences.fasta", 'w') as out:
for fasta in fastas:
# separate name and sequence
name, sequence = fasta.split("\n", 1)
# reorder and shorten fasta names for easier reading
bits = name.split()
genus = bits[1][0]
species = bits[2]
accession = bits[0][1:].split(":")[0]
name = ">{}_{}_{}".format(genus, species, accession)
# remove line breaks from sequence
sequence = sequence.replace("\n", "")
# write to file
out.write("{}\n{}\n".format(name, sequence))
# print statement
print("Wrote {} sequences to ./sequences.fasta".format(len(fastas)))
# run the get_fasta command
fastadict = get_fasta(uids)
We have now downloaded orthologous gene sequences for 22 samples (20 Pedicularis and 2 Orobanche). The function above printed a statement saying that it printed the data to ./sequences.fasta
. You can open the file in your file viewer to look at it. Or, we can also peek at it here in the notebook. Let's look at just the first few columns and rows of data. Here I'm going to use a cool trick in Pandas to display the DNA as a DataFrame with colored cells.
# execute to create the following functions
def highlight_dna(cell):
"A function to color cells by DNA base"
color = {"A": 'red', "T": 'blue', "C": 'green', "G": 'yellow', "-": "lightgrey"}
return 'background-color: {}'.format(color[cell])
def colored_slice(fasta, start, stop):
"returns a colored dataframe over a given slice"
# load seq data
with open(fasta) as infile:
# load sequence file as a dictionary
fdict = {}
for fa in infile.read().strip()[1:].split("\n>"):
name, seq = fa.split("\n")
fdict[name] = list(seq)[start: stop]
# make dataframe from dictionary
df = pd.DataFrame(fdict, index=range(start, stop)).T
# show dataframe as colored cells
return df.style.applymap(highlight_dna)
We can now use this function to look at a slice of this fasta file. You can scroll to the right to see more columns as well. Each column here represents one sequenced base. When all of the samples have the same base at a position this means that the site is not variable (at least given our sampling). No mutations have occurred at that site since these species diverged from each other. You can see that most columns for the Pedicularis samples are invariant, but then there is a lot of mismatch with the two outgroup samples in the bottom two rows.
Now remember, the sequences in this file have not been aligned yet, so we expect that they may look pretty bad at this stage. As you scroll to the right you'll see that the ingroup samples (Pedicularis) also fall out of phase pretty quickly. That is because there was an insertion or deletion at some point which is not being accounted for. Let's see if aligning makes this look better.
# we can't tell yet that the sequences are not aligned
colored_slice("./sequences.fasta", 0, 100)
The muscle alignment program is a command line tool that can be called as muscle -in infile -out outfile
. In the function below we use the Python library subprocess
to call and run the program from within Python with subprocess.call()
. This is a convenient way in which to use Python code as "glue" to tie together programs written in other languages with tasks you want to complete in Python.
The program muscle uses an algorithm to insert "-" characters into the alignment so that it minimizes differences among sites. The details of this algorithm are interesting, and there are many ways to perform alignments, but that is outside the scope of this tutorial. Below we simply call the muscle program and parse the output to write it back to a file in a way that will be easy for us to load it back in as a pandas DataFrame to look at again with color.
# execute this cell to create the align_fasta function
def align_fasta(fasta_file):
# the output aligned file name to use
aligned = fasta_file + ".aligned"
# run muscle alignment program on fasta file to produce output file
cmd = ["muscle", "-in", fasta_file, "-out", aligned]
subprocess.call(cmd)
# read in results of aligned file
with open(aligned) as infile:
fastas = infile.read().strip()[1:].split("\n>")
# remove newlines from aligned file
with open(aligned, 'w') as out:
for fasta in fastas:
# separate name and sequence
name, sequence = fasta.split("\n", 1)
# remove line breaks from sequence
sequence = sequence.replace("\n", "")
# write to file
out.write(">{}\n{}\n".format(name, sequence))
# print statement
print("Wrote aligned file to {}".format(aligned))
# run the alignment program
align_fasta("./sequences.fasta")
Muscle produced a new fasta file where the data is now aligned. Let's take a look to ensure that it looks better than our previous non-aligned data. Scroll around and see that there are now grey "-" characters imputed to improve the alignment. Is likely not perfect, but hopefully it's pretty accurate. Now when there is variation at a site we expect that variation represents substitutions that have happened at a single orthologous site.
# looking in the middle of the file it's clear they are not aligned
colored_slice("./sequences.fasta.aligned", 100, 200)