Notebook 12.2: Phylogenetics in practice

Introduction

Orthologs are homologous gene copies that diverged from a common ancestral gene during speciation events. After speciation, lineages evolve independently such that orthologous gene copies accumulate different mutations over time. This process can continue through many speciation events. As mutations arise in the periods between speciation they leave behind a record of the existence of the common ancestral sequence. In this way, the mutational process constructs a record that we can eventually use to reconstruct the order and timing of speciation events. By comparing the orthologous gene sequences of the descendant lineages alive today we can contruct a phylogeny as a snapshot of the past.

Inferring phylogenies

In this notebook we will not dive too deep into the algorithmic and methodological complexity of phylogenetic inference. In its most basic form, the goal of molecular phylogenetics is to count the number of character differences (e.g., substitutions) between multiple orthologous sequences, and to build a tree that minimizes the number of steps needed to explain those differences. We can do this using simple parsimony methods (the fewest steps is the best), or with methods that model DNA substitution as a probabilistic process (e.g., maximum likelihood).

The details of how various phylogenetic software tools differ, and what assumptions they make, is important to learn and relevant if you use phylogenetics for research. And we'll get to some of those details later. But for now, we'll focus on the applied end of things. How do I infer a phylogenetic tree in practice? We'll turn our attention back to the more theoretical and statistical details later.

Practical tree inference

There are three basic steps involved in inferring a phylogenetic tree. We'll complete a practical example from start to finish in this notebook. Then at the end you will try to replicate this in an organism of your choice.

  1. Generate/retrieve sequence data from orthologous gene regions. For this we will use the NCBI API utilities that we learned about in an earlier session to query and fetch fasta files directly from a large online database.
  1. Align the data to identify substitutions at homologous sites. For this we will use a command line program called muscle. There are many alignment programs available, I chose this one because it's easy to use. We will input a fasta file that includes all of our sequence data, but for which each ortholog may be a different length and may start or stop at a different position in the gene. The muscle program will ouput a new fasta file where the sequences are aligned by imputing "-" symbols to improve the alignment. We'll see an example of this below.
  1. Infer a tree by entering the aligned data into a phylogenetic inference program. For this we'll enter the aligned fasta files into a program called RAxML. This is a commonly used phylogenetic inference program that uses Maximum Likelihood to infer the best tree to fit the data. As output RAxML produces a newick formatted tree file (like we learned about in the last notebook). Finally, we'll plot the tree result.
In [1]:
import os
import subprocess
import toytree
import requests
import pandas as pd
Question [5]: What are the 3 steps involved in practical phylogenetic inference? Answer concisely in Markdown below.
- collect character data (e.g., sequences) for each sample you want to study. - align the characters/sequences to identify homology. - infer a tree using one of several possible inference methods. We will also accept detailed answers about tree inference: - propose a starting tree - calculate score of the tree (e.g., parsimony or likelihood) - propose a change to the tree and repeat.

Retrieving sequence data

To find individual genes or gene regions sampled from particular species the best place to look is the NCBI Genbank sequence archive. This database includes both Sanger sequenced DNA regions that were targeted by researchers for sequencing using primers, as well as genes or gene regions from sequenced and annotated genomes. In an earlier notebook we used API tools to access fasta data directly from the database. Here we'll use that code again, written as Python functions. Execute the cells to load these functions, and follow along by reading the code and comments in it. We'll mostly focus on the results that the code produces.

Retrieve unique IDs for some number of samples

The esearch API returns unique IDs for the search term that we use. Our query below is searching for a particular marker that is commonly used in plant phylogenetic studies (trnK). This is a chloroplast gene that has an important function and so tends to be highly conserved, which makes it easy to sequence by designing primers to amplify it. However, it also contains an intron sequence nearby that can be sequenced along with it, and this region is more variable, and thus provides more phylogenetic information than the conserved part. It has historically been difficult to find genomic regions that are both easy to amplify and variable enough to work as good phylogenetic markers. The few regions that were first discovered to fit this category have been used extensively. And because of this inertia they have become what is called universal markers -- a few gene regions that have been sequenced across thousands of organisms by many research groups over several decades. These markers provide the best sampled phylogenetic data that we have, and through a community effort of collaboration and data sharing have made it possible to combine available data to build enormous phylogenies for thousands of species that help to elucidate the tree of life.

In addition to targeting the trnK gene we're also restricting our query to a specific organism, the plant genus Pedicularis. This is a group of plants that I study. Please follow along using this search term for now, you will have a chance to explore other search terms later below.

In [5]:
# execute this cell to create the get_uids function

def get_uids(term, retmax=10):
    "Search NCBI nucleotide database and return N sequence matches"
    
    # make a request to esearch 
    res = requests.get(
        url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi", 
        params={
            "db": "nucleotide",
            "term": term,
            "sort": "length",
            "retmode": "text",
            "retmax": retmax,
            "tool": "genomics-course", 
            "email": "student@university.edu",
            },
        )
    
    # parse the xml output
    count = res.text.split("<Count>")[1].split("</Count>")[0].strip()
    
    # if nothing found then bail out
    if not int(count):
        raise ValueError("No UIDs found")
    
    # return the list of UIDs
    uids = []
    ids = res.text.split("<IdList>")[1].split("</IdList>")[0].strip()
    for item in ids.split("\n"):
        uids.append(item[4:-5])
    return uids

Our query statement to the database is stored as the variable term below. We enter this as an argument to the get_uids() function and tell it to return at most 20 results. The function returns a list of strings that represent unique IDs matching our query.

In [6]:
# the search term
term = "trnk[GENE] AND Pedicularis[ORGN]"

# call the function to get uids and store as a list
ingroups = get_uids(term=term, retmax=20)

# show the first ten values
ingroups[:10]
Out[6]:
['1720554516',
 '1720554514',
 '1720554512',
 '1720554510',
 '1720554508',
 '1720554506',
 '1720554504',
 '1720554502',
 '63253964',
 '63253962']

Outgroup sample

As we learned in the last notebook, it is important to root your tree so that you know the direction of evolution when interpreting evolutionary relationships. One way to do this is to include an outgroup sample in your analysis -- a sample that you know is more distantly related to everything else. Here we'll include 2 samples from a closely related genus. To do this I changed the search term string below to look for the same gene but from a different organism. Finally we add the list results of this search (2 UIDs) with the list results of the first one (20 UIDs) to have a total of 22 samples in the final list uids. We could combine many searches to select samples from a variety of species, or genera, or families in this way.

In [7]:
# the search term
term = "trnk[GENE] AND Orobanche[ORGN]"

# call the function to get uids and store as a list
outgroups = get_uids(term=term, retmax=2)

# show these uids
print(outgroups)

# add these uids to the list
uids = ingroups + outgroups
['1690492016', '1690492014']

Retrieve fasta data for the UIDs

The next function fetches fasta data from the database based on the unique IDs that we got from the previous search. In the API call here we specify the strand of DNA, which will ensure they are all returned in the same orientation, and we also specify a start and stop limit just to ensure we do not get back sequences longer than 2000 base pairs. Here automation can sometimes be dangerous because there are errors in the metadata of the database. For example, people sometimes enter the wrong strand information into the metadata when they upload the sequences, which can cause your sequences to not align well. It is important to visually inspect your data to check for these types of errors, as we will do below.

In [8]:
# execute this cell to create the get_fasta function

def get_fasta(uids):
    """
    Fetch fasta sequence data from NCBI for a list of uids
    and return as a dictionary of {name: sequence}.
    """
    # make a request to efetch 
    res = requests.get(
        url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi", 
            params={
            "db": "nucleotide",
            "id": ",".join(uids),
            "seq_start": "1",
            "seq_stop": "2000",
            "strand": "1",
            "retmode": "text",   
            "rettype": "fasta",
            "tool": "genomics-course", 
            "email": "student@university.edu",
            },
        )
    
    # split fasta string into separate fasta sequences
    fastas = res.text.strip().split("\n\n")
    
    # write to output file
    with open("sequences.fasta", 'w') as out:
        for fasta in fastas:

            # separate name and sequence
            name, sequence = fasta.split("\n", 1)

            # reorder and shorten fasta names for easier reading
            bits = name.split()
            genus = bits[1][0]
            species = bits[2]
            accession = bits[0][1:].split(":")[0]
            name = ">{}_{}_{}".format(genus, species, accession)

            # remove line breaks from sequence
            sequence = sequence.replace("\n", "")

            # write to file
            out.write("{}\n{}\n".format(name, sequence))
    
    # print statement
    print("Wrote {} sequences to ./sequences.fasta".format(len(fastas)))
    
In [9]:
# run the get_fasta command
fastadict = get_fasta(uids)
Wrote 22 sequences to ./sequences.fasta

Looking at the fasta file

We have now downloaded orthologous gene sequences for 22 samples (20 Pedicularis and 2 Orobanche). The function above printed a statement saying that it printed the data to ./sequences.fasta. You can open the file in your file viewer to look at it. Or, we can also peek at it here in the notebook. Let's look at just the first few columns and rows of data. Here I'm going to use a cool trick in Pandas to display the DNA as a DataFrame with colored cells.

In [10]:
# execute to create the following functions

def highlight_dna(cell):
    "A function to color cells by DNA base"
    color = {"A": 'red', "T": 'blue', "C": 'green', "G": 'yellow', "-": "lightgrey"}
    return 'background-color: {}'.format(color[cell])


def colored_slice(fasta, start, stop):
    "returns a colored dataframe over a given slice"
    # load seq data 
    with open(fasta) as infile:

        # load sequence file as a dictionary
        fdict = {}
        for fa in infile.read().strip()[1:].split("\n>"):
            name, seq = fa.split("\n")
            fdict[name] = list(seq)[start: stop]

        # make dataframe from dictionary
        df = pd.DataFrame(fdict, index=range(start, stop)).T

    # show dataframe as colored cells
    return df.style.applymap(highlight_dna)

We can now use this function to look at a slice of this fasta file. You can scroll to the right to see more columns as well. Each column here represents one sequenced base. When all of the samples have the same base at a position this means that the site is not variable (at least given our sampling). No mutations have occurred at that site since these species diverged from each other. You can see that most columns for the Pedicularis samples are invariant, but then there is a lot of mismatch with the two outgroup samples in the bottom two rows.

Now remember, the sequences in this file have not been aligned yet, so we expect that they may look pretty bad at this stage. As you scroll to the right you'll see that the ingroup samples (Pedicularis) also fall out of phase pretty quickly. That is because there was an insertion or deletion at some point which is not being accounted for. Let's see if aligning makes this look better.

In [11]:
# we can't tell yet that the sequences are not aligned
colored_slice("./sequences.fasta", 0, 100)
Out[11]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
P_metaszetschuanica_LC377635.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_szetschuanica_LC377634.1 A C C T T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377633.1 T A G A T C T T T A C A C A T T T G G A T G A A G T G A G A A A T T C G T C C A T A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T
P_refracta_LC377632.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377631.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377630.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377629.1 C T C G G C T T T T T A A G T G C G G C T A G A A T C T T T T A C A C A T T T G G A T G A A G T G A G A A A T T C G T C C A T A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G
P_refracta_LC377628.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_verticillata_AY949769.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_axillaris_AY949768.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_schistostegia_AY949767.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_davidii_AY949766.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T A A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G G A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_furfuracea_AY949765.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_anas_AY949764.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_integrifolia_AY949763.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_verticillata_AY949762.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_thamnophila_AY949761.1 A C C A T T G G T A A A G T T T T G A A G A C C A C G A C T G A T C C T G A A G G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_trichocymba_AY949760.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_cranolopha_AY949759.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T A A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_ternata_AY949758.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
O_lycoctoni_MK588421.1 C T A C T T T T G A T T T A A T T T C A A A T G G A G G A A A T C C A A A A A T A T T T A C A G C T A G A A A G A T C T C A A C A A C A C A G T T T C C T A T A T C C A C T T A T T T T T C A G G A G T
O_flava_MK588420.1 A C A C T T A T T A C G A T T C T T T A T C A A C A A A A A T T G G A A T A G T T T T A T T A C T C C A A A G A A A G C C A G T T C T T T T T T T G C A A A A A G T A A T C A A A G A T T A T T C T T A

Alignment

The muscle alignment program is a command line tool that can be called as muscle -in infile -out outfile. In the function below we use the Python library subprocess to call and run the program from within Python with subprocess.call(). This is a convenient way in which to use Python code as "glue" to tie together programs written in other languages with tasks you want to complete in Python.

The program muscle uses an algorithm to insert "-" characters into the alignment so that it minimizes differences among sites. The details of this algorithm are interesting, and there are many ways to perform alignments, but that is outside the scope of this tutorial. Below we simply call the muscle program and parse the output to write it back to a file in a way that will be easy for us to load it back in as a pandas DataFrame to look at again with color.

In [12]:
# execute this cell to create the align_fasta function

def align_fasta(fasta_file):
    
    # the output aligned file name to use
    aligned = fasta_file + ".aligned"
    
    # run muscle alignment program on fasta file to produce output file
    cmd = ["muscle", "-in", fasta_file, "-out", aligned]
    subprocess.call(cmd)

    # read in results of aligned file
    with open(aligned) as infile:
        fastas = infile.read().strip()[1:].split("\n>")
        
    # remove newlines from aligned file
    with open(aligned, 'w') as out:
        for fasta in fastas:

            # separate name and sequence
            name, sequence = fasta.split("\n", 1)

            # remove line breaks from sequence
            sequence = sequence.replace("\n", "")

            # write to file
            out.write(">{}\n{}\n".format(name, sequence))
    
    # print statement
    print("Wrote aligned file to {}".format(aligned))
    
In [13]:
# run the alignment program
align_fasta("./sequences.fasta")
Wrote aligned file to ./sequences.fasta.aligned

Look at the aligned fasta file

Muscle produced a new fasta file where the data is now aligned. Let's take a look to ensure that it looks better than our previous non-aligned data. Scroll around and see that there are now grey "-" characters imputed to improve the alignment. Is likely not perfect, but hopefully it's pretty accurate. Now when there is variation at a site we expect that variation represents substitutions that have happened at a single orthologous site.

In [14]:
# looking in the middle of the file it's clear they are not aligned
colored_slice("./sequences.fasta.aligned", 100, 200)
Out[14]:
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
O_lycoctoni_MK588421.1 T A T T T A C A G C T A G A A A G A T C T C A A C A A C A C A G T T T C C T A T A T C C A C T T A T T T T T C A G G A G T A T A T T T A T T C A T T T G C T C A T G A T C G T G G T T T C G