Notebook 12.2: Phylogenetics in practice

Introduction

Orthologs are homologous gene copies that diverged from a common ancestral gene during speciation events. After speciation, lineages evolve independently such that orthologous gene copies accumulate different mutations over time. This process can continue through many speciation events. As mutations arise in the periods between speciation they leave behind a record of the existence of the common ancestral sequence. In this way, the mutational process constructs a record that we can eventually use to reconstruct the order and timing of speciation events. By comparing the orthologous gene sequences of the descendant lineages alive today we can contruct a phylogeny as a snapshot of the past.

Inferring phylogenies

In this notebook we will not dive too deep into the algorithmic and methodological complexity of phylogenetic inference. In its most basic form, the goal of molecular phylogenetics is to count the number of character differences (e.g., substitutions) between multiple orthologous sequences, and to build a tree that minimizes the number of steps needed to explain those differences. We can do this using simple parsimony methods (the fewest steps is the best), or with methods that model DNA substitution as a probabilistic process (e.g., maximum likelihood).

The details of how various phylogenetic software tools differ, and what assumptions they make, is important to learn and relevant if you use phylogenetics for research. And we'll get to some of those details later. But for now, we'll focus on the applied end of things. How do I infer a phylogenetic tree in practice? We'll turn our attention back to the more theoretical and statistical details later.

Practical tree inference

There are three basic steps involved in inferring a phylogenetic tree. We'll complete a practical example from start to finish in this notebook. Then at the end you will try to replicate this in an organism of your choice.

  1. Generate/retrieve sequence data from orthologous gene regions. For this we will use the NCBI API utilities that we learned about in an earlier session to query and fetch fasta files directly from a large online database.
  1. Align the data to identify substitutions at homologous sites. For this we will use a command line program called muscle. There are many alignment programs available, I chose this one because it's easy to use. We will input a fasta file that includes all of our sequence data, but for which each ortholog may be a different length and may start or stop at a different position in the gene. The muscle program will ouput a new fasta file where the sequences are aligned by imputing "-" symbols to improve the alignment. We'll see an example of this below.
  1. Infer a tree by entering the aligned data into a phylogenetic inference program. For this we'll enter the aligned fasta files into a program called RAxML. This is a commonly used phylogenetic inference program that uses Maximum Likelihood to infer the best tree to fit the data. As output RAxML produces a newick formatted tree file (like we learned about in the last notebook). Finally, we'll plot the tree result.
In [1]:
import os
import subprocess
import toytree
import requests
import pandas as pd
Question [5]: What are the 3 steps involved in practical phylogenetic inference? Answer concisely in Markdown below.
- collect character data (e.g., sequences) for each sample you want to study. - align the characters/sequences to identify homology. - infer a tree using one of several possible inference methods. We will also accept detailed answers about tree inference: - propose a starting tree - calculate score of the tree (e.g., parsimony or likelihood) - propose a change to the tree and repeat.

Retrieving sequence data

To find individual genes or gene regions sampled from particular species the best place to look is the NCBI Genbank sequence archive. This database includes both Sanger sequenced DNA regions that were targeted by researchers for sequencing using primers, as well as genes or gene regions from sequenced and annotated genomes. In an earlier notebook we used API tools to access fasta data directly from the database. Here we'll use that code again, written as Python functions. Execute the cells to load these functions, and follow along by reading the code and comments in it. We'll mostly focus on the results that the code produces.

Retrieve unique IDs for some number of samples

The esearch API returns unique IDs for the search term that we use. Our query below is searching for a particular marker that is commonly used in plant phylogenetic studies (trnK). This is a chloroplast gene that has an important function and so tends to be highly conserved, which makes it easy to sequence by designing primers to amplify it. However, it also contains an intron sequence nearby that can be sequenced along with it, and this region is more variable, and thus provides more phylogenetic information than the conserved part. It has historically been difficult to find genomic regions that are both easy to amplify and variable enough to work as good phylogenetic markers. The few regions that were first discovered to fit this category have been used extensively. And because of this inertia they have become what is called universal markers -- a few gene regions that have been sequenced across thousands of organisms by many research groups over several decades. These markers provide the best sampled phylogenetic data that we have, and through a community effort of collaboration and data sharing have made it possible to combine available data to build enormous phylogenies for thousands of species that help to elucidate the tree of life.

In addition to targeting the trnK gene we're also restricting our query to a specific organism, the plant genus Pedicularis. This is a group of plants that I study. Please follow along using this search term for now, you will have a chance to explore other search terms later below.

In [5]:
# execute this cell to create the get_uids function

def get_uids(term, retmax=10):
    "Search NCBI nucleotide database and return N sequence matches"
    
    # make a request to esearch 
    res = requests.get(
        url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi", 
        params={
            "db": "nucleotide",
            "term": term,
            "sort": "length",
            "retmode": "text",
            "retmax": retmax,
            "tool": "genomics-course", 
            "email": "student@university.edu",
            },
        )
    
    # parse the xml output
    count = res.text.split("<Count>")[1].split("</Count>")[0].strip()
    
    # if nothing found then bail out
    if not int(count):
        raise ValueError("No UIDs found")
    
    # return the list of UIDs
    uids = []
    ids = res.text.split("<IdList>")[1].split("</IdList>")[0].strip()
    for item in ids.split("\n"):
        uids.append(item[4:-5])
    return uids

Our query statement to the database is stored as the variable term below. We enter this as an argument to the get_uids() function and tell it to return at most 20 results. The function returns a list of strings that represent unique IDs matching our query.

In [6]:
# the search term
term = "trnk[GENE] AND Pedicularis[ORGN]"

# call the function to get uids and store as a list
ingroups = get_uids(term=term, retmax=20)

# show the first ten values
ingroups[:10]
Out[6]:
['1720554516',
 '1720554514',
 '1720554512',
 '1720554510',
 '1720554508',
 '1720554506',
 '1720554504',
 '1720554502',
 '63253964',
 '63253962']

Outgroup sample

As we learned in the last notebook, it is important to root your tree so that you know the direction of evolution when interpreting evolutionary relationships. One way to do this is to include an outgroup sample in your analysis -- a sample that you know is more distantly related to everything else. Here we'll include 2 samples from a closely related genus. To do this I changed the search term string below to look for the same gene but from a different organism. Finally we add the list results of this search (2 UIDs) with the list results of the first one (20 UIDs) to have a total of 22 samples in the final list uids. We could combine many searches to select samples from a variety of species, or genera, or families in this way.

In [7]:
# the search term
term = "trnk[GENE] AND Orobanche[ORGN]"

# call the function to get uids and store as a list
outgroups = get_uids(term=term, retmax=2)

# show these uids
print(outgroups)

# add these uids to the list
uids = ingroups + outgroups
['1690492016', '1690492014']

Retrieve fasta data for the UIDs

The next function fetches fasta data from the database based on the unique IDs that we got from the previous search. In the API call here we specify the strand of DNA, which will ensure they are all returned in the same orientation, and we also specify a start and stop limit just to ensure we do not get back sequences longer than 2000 base pairs. Here automation can sometimes be dangerous because there are errors in the metadata of the database. For example, people sometimes enter the wrong strand information into the metadata when they upload the sequences, which can cause your sequences to not align well. It is important to visually inspect your data to check for these types of errors, as we will do below.

In [8]:
# execute this cell to create the get_fasta function

def get_fasta(uids):
    """
    Fetch fasta sequence data from NCBI for a list of uids
    and return as a dictionary of {name: sequence}.
    """
    # make a request to efetch 
    res = requests.get(
        url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi", 
            params={
            "db": "nucleotide",
            "id": ",".join(uids),
            "seq_start": "1",
            "seq_stop": "2000",
            "strand": "1",
            "retmode": "text",   
            "rettype": "fasta",
            "tool": "genomics-course", 
            "email": "student@university.edu",
            },
        )
    
    # split fasta string into separate fasta sequences
    fastas = res.text.strip().split("\n\n")
    
    # write to output file
    with open("sequences.fasta", 'w') as out:
        for fasta in fastas:

            # separate name and sequence
            name, sequence = fasta.split("\n", 1)

            # reorder and shorten fasta names for easier reading
            bits = name.split()
            genus = bits[1][0]
            species = bits[2]
            accession = bits[0][1:].split(":")[0]
            name = ">{}_{}_{}".format(genus, species, accession)

            # remove line breaks from sequence
            sequence = sequence.replace("\n", "")

            # write to file
            out.write("{}\n{}\n".format(name, sequence))
    
    # print statement
    print("Wrote {} sequences to ./sequences.fasta".format(len(fastas)))
    
In [9]:
# run the get_fasta command
fastadict = get_fasta(uids)
Wrote 22 sequences to ./sequences.fasta

Looking at the fasta file

We have now downloaded orthologous gene sequences for 22 samples (20 Pedicularis and 2 Orobanche). The function above printed a statement saying that it printed the data to ./sequences.fasta. You can open the file in your file viewer to look at it. Or, we can also peek at it here in the notebook. Let's look at just the first few columns and rows of data. Here I'm going to use a cool trick in Pandas to display the DNA as a DataFrame with colored cells.

In [10]:
# execute to create the following functions

def highlight_dna(cell):
    "A function to color cells by DNA base"
    color = {"A": 'red', "T": 'blue', "C": 'green', "G": 'yellow', "-": "lightgrey"}
    return 'background-color: {}'.format(color[cell])


def colored_slice(fasta, start, stop):
    "returns a colored dataframe over a given slice"
    # load seq data 
    with open(fasta) as infile:

        # load sequence file as a dictionary
        fdict = {}
        for fa in infile.read().strip()[1:].split("\n>"):
            name, seq = fa.split("\n")
            fdict[name] = list(seq)[start: stop]

        # make dataframe from dictionary
        df = pd.DataFrame(fdict, index=range(start, stop)).T

    # show dataframe as colored cells
    return df.style.applymap(highlight_dna)

We can now use this function to look at a slice of this fasta file. You can scroll to the right to see more columns as well. Each column here represents one sequenced base. When all of the samples have the same base at a position this means that the site is not variable (at least given our sampling). No mutations have occurred at that site since these species diverged from each other. You can see that most columns for the Pedicularis samples are invariant, but then there is a lot of mismatch with the two outgroup samples in the bottom two rows.

Now remember, the sequences in this file have not been aligned yet, so we expect that they may look pretty bad at this stage. As you scroll to the right you'll see that the ingroup samples (Pedicularis) also fall out of phase pretty quickly. That is because there was an insertion or deletion at some point which is not being accounted for. Let's see if aligning makes this look better.

In [11]:
# we can't tell yet that the sequences are not aligned
colored_slice("./sequences.fasta", 0, 100)
Out[11]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
P_metaszetschuanica_LC377635.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_szetschuanica_LC377634.1 A C C T T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377633.1 T A G A T C T T T A C A C A T T T G G A T G A A G T G A G A A A T T C G T C C A T A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T
P_refracta_LC377632.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377631.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377630.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_refracta_LC377629.1 C T C G G C T T T T T A A G T G C G G C T A G A A T C T T T T A C A C A T T T G G A T G A A G T G A G A A A T T C G T C C A T A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G
P_refracta_LC377628.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_verticillata_AY949769.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_axillaris_AY949768.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_schistostegia_AY949767.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_davidii_AY949766.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T A A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G G A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_furfuracea_AY949765.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_anas_AY949764.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_integrifolia_AY949763.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_verticillata_AY949762.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
P_thamnophila_AY949761.1 A C C A T T G G T A A A G T T T T G A A G A C C A C G A C T G A T C C T G A A G G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_trichocymba_AY949760.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_cranolopha_AY949759.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T A A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G G G G A T A T T T C A T T C
P_ternata_AY949758.1 A C C A T T G G T A A A G T T T G G A A G A C C A C G A C T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C A A T A G A G A G T T C T G T G G A T A T T T C A T T C
O_lycoctoni_MK588421.1 C T A C T T T T G A T T T A A T T T C A A A T G G A G G A A A T C C A A A A A T A T T T A C A G C T A G A A A G A T C T C A A C A A C A C A G T T T C C T A T A T C C A C T T A T T T T T C A G G A G T
O_flava_MK588420.1 A C A C T T A T T A C G A T T C T T T A T C A A C A A A A A T T G G A A T A G T T T T A T T A C T C C A A A G A A A G C C A G T T C T T T T T T T G C A A A A A G T A A T C A A A G A T T A T T C T T A

Alignment

The muscle alignment program is a command line tool that can be called as muscle -in infile -out outfile. In the function below we use the Python library subprocess to call and run the program from within Python with subprocess.call(). This is a convenient way in which to use Python code as "glue" to tie together programs written in other languages with tasks you want to complete in Python.

The program muscle uses an algorithm to insert "-" characters into the alignment so that it minimizes differences among sites. The details of this algorithm are interesting, and there are many ways to perform alignments, but that is outside the scope of this tutorial. Below we simply call the muscle program and parse the output to write it back to a file in a way that will be easy for us to load it back in as a pandas DataFrame to look at again with color.

In [12]:
# execute this cell to create the align_fasta function

def align_fasta(fasta_file):
    
    # the output aligned file name to use
    aligned = fasta_file + ".aligned"
    
    # run muscle alignment program on fasta file to produce output file
    cmd = ["muscle", "-in", fasta_file, "-out", aligned]
    subprocess.call(cmd)

    # read in results of aligned file
    with open(aligned) as infile:
        fastas = infile.read().strip()[1:].split("\n>")
        
    # remove newlines from aligned file
    with open(aligned, 'w') as out:
        for fasta in fastas:

            # separate name and sequence
            name, sequence = fasta.split("\n", 1)

            # remove line breaks from sequence
            sequence = sequence.replace("\n", "")

            # write to file
            out.write(">{}\n{}\n".format(name, sequence))
    
    # print statement
    print("Wrote aligned file to {}".format(aligned))
    
In [13]:
# run the alignment program
align_fasta("./sequences.fasta")
Wrote aligned file to ./sequences.fasta.aligned

Look at the aligned fasta file

Muscle produced a new fasta file where the data is now aligned. Let's take a look to ensure that it looks better than our previous non-aligned data. Scroll around and see that there are now grey "-" characters imputed to improve the alignment. Is likely not perfect, but hopefully it's pretty accurate. Now when there is variation at a site we expect that variation represents substitutions that have happened at a single orthologous site.

In [14]:
# looking in the middle of the file it's clear they are not aligned
colored_slice("./sequences.fasta.aligned", 100, 200)
Out[14]:
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
O_lycoctoni_MK588421.1 T A T T T A C A G C T A G A A A G A T C T C A A C A A C A C A G T T T C C T A T A T C C A C T T A T T T T T C A G G A G T A T A T T T A T T C A T T T G C T C A T G A T C G T G G T T T C G G T A G A T
O_flava_MK588420.1 C G A T T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T T T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A C A A A A A - - - - T T G G A A -
P_davidii_AY949766.1 T G A T C C T A A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G G A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G G G G A -
P_cranolopha_AY949759.1 T G A T C C T A A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G G G G A -
P_thamnophila_AY949761.1 T G A T C C T G A A G G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G G G G A -
P_furfuracea_AY949765.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G G G G A -
P_axillaris_AY949768.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G G G G A -
P_trichocymba_AY949760.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G G G G A -
P_schistostegia_AY949767.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G G G G A -
P_szetschuanica_LC377634.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_metaszetschuanica_LC377635.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_verticillata_AY949769.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_verticillata_AY949762.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_refracta_LC377633.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_refracta_LC377632.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_refracta_LC377631.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_refracta_LC377630.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_refracta_LC377629.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_refracta_LC377628.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_anas_AY949764.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_integrifolia_AY949763.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -
P_ternata_AY949758.1 T G A T C C T G A A A G G G A A T A A A T G G A A A A A A T A G C A T G T C G T A T C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A A T A G A G A G T T C T G T G G A -

Infer a phylogeny

Similarly to how we called the program muscle above, here we will call the program raxmlHPC using the Python subprocess library again. The raxml software is complex and requires a number of arguments. You do not need to worry about those for now. You can look at the documentation for the software to learn more. The take home message here is that in just a minute or two this software can take our input aligned fasta file and infer a phylogeny as well as an estimate of confidence/support in that estimate. We'll discuss more later the complexities of how tree inference works.

In [15]:
# execute this cell to create the infer_tree command

def infer_tree(aligned_fasta):
    
    # remove previous results if they exist
    oldname = "RAxML_info.{}".format(os.path.basename(aligned_fasta))
    if os.path.exists(oldname):
        os.remove(oldname)
        
    # the raxml command
    cmd = [
        "raxmlHPC", 
        "-f", "a",
        "-s", aligned_fasta, 
        "-m", "GTRGAMMA", 
        "-n", os.path.basename(aligned_fasta),
        "-p", "12345", 
        "-x", "54321",
        "-w", os.path.realpath("."),
        "-N", "100",
    ]
        
    # print the raxml command
    print("running: {}\n".format(" ".join(cmd)))
    subprocess.call(cmd)
    
    # print statement
    print("Tree file written to ./RAxML_bipartitions.{}".format(
        os.path.basename(aligned_fasta)))
    
In [16]:
# run the tree inference 
infer_tree("./sequences.fasta.aligned")
running: raxmlHPC -f a -s ./sequences.fasta.aligned -m GTRGAMMA -n sequences.fasta.aligned -p 12345 -x 54321 -w /home/deren/Documents/genomics/2020-phylogenetics-I/notebooks -N 100

Tree file written to ./RAxML_bipartitions.sequences.fasta.aligned

Plot/draw the phylogeny

Now that you've inferred a tree the next step is of course to check your results by visualization. Plot the tree. We can then compare the topology to our expectation based on morphology, or geography, or other expectations. Do all of the branch lengths look reasonable? If not, perhaps you have a bad sequence in the alignment. Visually inspect it and consider removing it.

In the tree below everything looks reasonable and likely correct given my knowledge of this group of organisms. I rooted the tree on the two outgroup sequences using the .root() function and a wildcard selector that will match any names starting with "O_". In the final tree I also included a scalebar. You'll notice that on this tree the tips (black lines) do not all line up at 0 (the dashed lines are added just make the name labels align to make reading it easier). This is because this is not a time-scaled tree. The branch lengths are in units of substitutions per site. Some tips branches had more substitutions than others, which may reflect a higher rate of molecular evolution. Using this phylogeny we can now begin to test other evolutionary hypotheses.

In [17]:
# load the tree
tre = toytree.tree("./RAxML_bestTree.sequences.fasta.aligned")

# root on the outgroups
rtre = tre.root(regex="O_*")

# draw with support values
rtre.draw(width=700, tip_labels_align=True, scalebar=True);
P_refracta_LC377633.1P_refracta_LC377632.1P_refracta_LC377631.1P_refracta_LC377629.1P_refracta_LC377628.1P_refracta_LC377630.1P_verticillata_AY949762.1P_verticillata_AY949769.1P_metaszetschuanica_LC377635.1P_szetschuanica_LC377634.1P_ternata_AY949758.1P_davidii_AY949766.1P_cranolopha_AY949759.1P_axillaris_AY949768.1P_trichocymba_AY949760.1P_furfuracea_AY949765.1P_schistostegia_AY949767.1P_thamnophila_AY949761.1P_anas_AY949764.1P_integrifolia_AY949763.1O_lycoctoni_MK588421.1O_flava_MK588420.10.000.070.150.220.290.370.44

Challenge: Use these functions to infer a tree for some other organismal group

We've now written functions to automate the minimal three steps required for inferring a phylogeny. And we can now reuse these functions to perform these tasks again with a different starting query to the database.

Action [6]: Infer a phylogeny for a group of organisms of your choice, and using a genetic marker of your choice, by following the instructions below.

1. Find data

Every organism is not in the NCBI database for all genes. The hardest part of this challenge will be finding a good genetic marker that is sequenced across enough taxa in the group you are searching. I would recommend starting by searching the internet for the name of common phylogenetic markers used in your organism (or in animals versus plants). You can search these terms directly on NCBI https://www.ncbi.nlm.nih.gov/nuccore/. Search for terms like "universal phylogenetic marker", or do a google scholar search for "phylogeny" + the name of the organism you're interested in. Find a phylogenetic study that used a marker and try to sample that data yourself here. Search the NCBI taxonomy database if needed to find taxonomic groups to restrict your search.

Once you develop a good search term then you can insert it into the code below. If you get stuck head to the chat room. Don't try to move on to the next step until you get some UIDs printed below. If your search term does not match anything the function below will raise an error telling you that.

After you get your ingroup samples you next need to find an outgroup. This means finding an organism outside of your core group, but still pretty closely related. This will require you to do some research as well. NCBI includes a lot of information about taxonomy, as do many other sites including Wikipedia, or published phylogenetic papers.

Change the 'term' and 'oterm' search strings below to perform your own search.

In [18]:
# call the function to get uids and store as a list
term = "COI[GENE] AND Delphinidae[ORGN]"
ingroup = get_uids(term=term, retmax=20)

# call the function to uids from an *outgroup* and add to list
oterm = "COI[GENE] AND Phocoenidae[ORGN]"
outgroup = get_uids(term=oterm, retmax=5)

# show uids
uids = ingroup + outgroup
print(uids)
['323100323', '323100313', '323100305', '296465606', '1624685187', '1624685173', '1624685159', '1624685145', '1624685131', '1624685117', '1624685103', '1624685089', '1624685075', '1624685061', '1624685047', '1624685033', '1624685019', '1624685005', '1624684991', '1624684977', '1139695912', '1139695898', '1139695884', '1139695870', '1022717220']
In [19]:
# get fasta data for the UIDs
get_fasta(uids)
Wrote 25 sequences to ./sequences.fasta

2. Align the fasta file

Call the alignment function on your fasta file and then use the colored_slice function to look at your alignment. Does it look good, or a complete mess? If it's a mess do not proceed to the next step. Maybe you need to try a different search term, a more restrictive definition of the gene segment you are searching for.

In [20]:
# run the alignment program
align_fasta("./sequences.fasta")
Wrote aligned file to ./sequences.fasta.aligned
In [21]:
# looking in the middle of the file it's clear they are not aligned
colored_slice("./sequences.fasta.aligned", 700, 900)
Out[21]:
700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899
P_phocoena_KX079493.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - G T A T A T A G G G A T G G T A T G A G C T A T G G T C T C T A T - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - C G G T T T C C T A G - - - - - - - - - - - - - - - - - - - - G T T T T A T C
G_macrorhynchus_MK790767.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A T A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790778.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790776.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790782.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790781.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790780.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790779.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790777.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790775.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790774.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790773.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790770.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790768.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790769.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790772.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
G_macrorhynchus_MK790771.1 C A G C A A A C C C T A A A A A A G G A A T G A A A G T A A G C A C A A C T A T T G C A C G T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G A T T G G G A A G A A A T G G G C T A C A T T T T C T A C A A T A A G A A C A C C C C C T A A A C T T A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
N_phocaenoides_KX650871.1 C A G C A A A C C C T - A A A A A G G A A T A A A A G T A A G C A T A A T T A C A G C A C A T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G G T T G G G A A G A A A T G G G C T A C A T T T T C T A T A A C A A G A A C A C C C C T T A A G C C C A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
N_phocaenoides_KX650869.1 C A G C A A A C C C T - A A A A A G G A A T A A A A G T A A G C A T A A T T A C A G C A C A T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G G T T G G G A A G A A A T G G G C T A C A T T T T C T A T A A C A A G A A C A C C C C T T A A G C C C A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
N_phocaenoides_KX650872.1 C A G C A A A C C C T - A A A A A G G A A T A A A A G T A A G C A T A A T T A C A G C A C A T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G G T T G G G A A G A A A T G G G C T A C A T T T T C T A T A A C A A G A A C A C C C C T T A A G C C C A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
N_phocaenoides_KX650870.1 C A G C A A A C C C T - A A A A A G G A A T A A A A G T A A G C A T A A T T A C A G C A C A T A A A A A C G T T A G G T C A A G G T G T A A C C T A T G G G T T G G G A A G A A A T G G G C T A C A T T T T C T A T A A C A A G A A C A C C C C T T A A A C C C A C A C G A A A G T T T T T A T G A A A T C T A A A A A C T A A A G G A G G A T T T A G C A G T A A A T T A A G A A T A G A A T G C T T A A T T
S_guianensis_HQ918433.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A G C T C A T G C T T T C G T A A T A - - - - - - - - - - A T T T T C T T T - A T A G T T A T A C C T A T C A - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T A A T T G G G G G T T T C G - - - - - - - - - - G G A A C T G A C T A G T T C C T T
S_guianensis_HQ918428.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A G C T C A T G C T T T C G T A A T A - - - - - - - - - - A T T T T C T T T - A T A G T T A T A C C T A T C A - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T A A T T G G G G G T T T C G - - - - - - - - - - G G A A C T G A C T A G T T C C T T
S_guianensis_HQ918424.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A G C T C A T G C T T T C G T A A T A - - - - - - - - - - A T T T T C T T T - A T A G T T A T A C C T A T C A - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T A A T T G G G G G T T T C G - - - - - - - - - - G G A A C T G A C T A G T T C C T T
D_sp._GU674138.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A G C T C A T G C C T T C G T A A T A - - - - - - - - - - A T T T T C T T T - A T A G T T A T A C C T A T C A - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T A A T T G G A G G T T T T G - - - - - - - - - - G G A A C T G A T T A G T C C C C T

Infer the tree

If your alignment is good then there's not much to this step. Run it and wait for it to finish.

In [22]:
# run the tree inference 
infer_tree("./sequences.fasta.aligned")
running: raxmlHPC -f a -s ./sequences.fasta.aligned -m GTRGAMMA -n sequences.fasta.aligned -p 12345 -x 54321 -w /home/deren/Documents/genomics/2020-phylogenetics-I/notebooks -N 100

Tree file written to ./RAxML_bipartitions.sequences.fasta.aligned

Draw your tree

Before drawing your tree be sure to change the code below to select the proper names of your outgroup sequences to root the tree since their names are likely different than the ones that I used. If you need help with rooting the tree you can find more information here.

In [25]:
# load the tree
tre = toytree.tree("./RAxML_bestTree.sequences.fasta.aligned")

# root on the outgroups
rtre = tre.root(regex="[G,N]_*")

# draw with support values
rtre.draw(width=500, tip_labels_align=True, scalebar=True);
G_macrorhynchus_MK790782.1G_macrorhynchus_MK790781.1G_macrorhynchus_MK790777.1G_macrorhynchus_MK790779.1G_macrorhynchus_MK790775.1G_macrorhynchus_MK790780.1G_macrorhynchus_MK790770.1G_macrorhynchus_MK790774.1G_macrorhynchus_MK790768.1G_macrorhynchus_MK790773.1G_macrorhynchus_MK790776.1G_macrorhynchus_MK790778.1G_macrorhynchus_MK790767.1G_macrorhynchus_MK790771.1G_macrorhynchus_MK790772.1G_macrorhynchus_MK790769.1N_phocaenoides_KX650870.1N_phocaenoides_KX650871.1N_phocaenoides_KX650869.1N_phocaenoides_KX650872.1S_guianensis_HQ918428.1S_guianensis_HQ918433.1S_guianensis_HQ918424.1D_sp._GU674138.1P_phocoena_KX079493.10.00.30.71.01.4
Question [7]: Reflect on your phylogeny above. Do you think it is accurate? Can you find a published phylogeny online for this group of organisms? If so, include it in your answer. What gene/marker did they use in that study? Do your trees show the same relationships for the selected organisms?
- Yes I think the tree is pretty accurate since most species appear monophyletic. But most of the tree is not resolved suggesting we could use more data. - Find a published study and paste a link or title here. - Name the gene in the study, here it would be COI. - State whether the trees are the same or different.
Question [8]: If you were writing a program to automate fetching data from online and inferring a tree, what would you do differently? What steps could be made easier? How do you imagine this will change as more and more annotated genes from assembled genomes are added to NCBI?
- It would be nice if you could more easily examine all of the stats of the samples before downloading the fasta data so that you could select more easily which samples to include. I think this could be done with further modifications to the API code above. - I think that the 'term' used to search becomes very important for extracting data from the database. To automate it you must come up with a term that finds all of the data available while still restricting it to only the data that is relevant to your study.

Orthology and alignments

When performing phylogenetic inference we make the assumption that the gene regions we are studying are homologous. To do so we rely on alignment algorithms to align genomic sites that are homologous. As you can see in the empirical alignments above, especially if you tried to align fairly distantly related organisms, is that sometimes genes diverge so significantly that even if we can still identify that the genes are homologous, there are few sites that can be confidently assigned as homologous, and there is a lot of room for error in alignments.

here are additional tools that can be applied to sequence alignments after they have been aligned and before they are used for phylogenetic inference that aim to mask (i.e., remove) sites that are poorly aligned to improve phylogenetic inference. One example is the tool GBLOCKs. This can apply filtering options such as "Min. seq. for flank pos.: 85%", "Max. contig. nonconserved pos.: 8", "Min. block length: 10", "Gaps in final blocks: no". Below is a function that can be applied to clean an alignment using the default GBLOCKS options.

In [26]:
def gblock_filter_alignment(aligned_fasta):
    """
    Takes an aligned fasta file and applies GBLOCKS to filter it and 
    write the results to a new fasta file. 
    """
    # run muscle alignment program on fasta file to produce output file
    cmd = [
        "Gblocks", aligned_fasta,
        "-t=d",
        "-e=.filt",
        "-d=n",
    ]
    print("command:", " ".join(cmd))
    subprocess.call(cmd)
    
    # print statement
    print("Wrote filtered fastsa file to {}".format(aligned_fasta + ".filt"))
    
Question [9]: Clean your alignment using GBLOCKS and infer the tree again. Did you get a different result? Do you think it was an improvement? If not, can you think of an alternative way to improve your alignment?
- In this case although GBLOCKs excluded many sites from the alignment, the final inferred tree looks very similar. Other ways to improve the alignment might involve sampling more evenly so that there are fewer replicates within species, and thus more similar differences among species.
In [28]:
gblock_filter_alignment("./sequences.fasta.aligned")
command: Gblocks ./sequences.fasta.aligned -t=d -e=.filt -d=n
Wrote filtered fastsa file to ./sequences.fasta.aligned.filt
In [29]:
# run the tree inference 
infer_tree("./sequences.fasta.aligned.filt")
running: raxmlHPC -f a -s ./sequences.fasta.aligned.filt -m GTRGAMMA -n sequences.fasta.aligned.filt -p 12345 -x 54321 -w /home/deren/Documents/genomics/2020-phylogenetics-I/notebooks -N 100

Tree file written to ./RAxML_bipartitions.sequences.fasta.aligned.filt
In [30]:
# load the tree
tre = toytree.tree("./RAxML_bestTree.sequences.fasta.aligned.filt")

# root on the outgroups
rtre = tre.root(regex="[G,N]_*")

# draw with support values
rtre.draw(width=500, tip_labels_align=True, scalebar=True);
G_macrorhynchus_MK790779.1G_macrorhynchus_MK790774.1G_macrorhynchus_MK790781.1G_macrorhynchus_MK790778.1G_macrorhynchus_MK790780.1G_macrorhynchus_MK790767.1G_macrorhynchus_MK790768.1G_macrorhynchus_MK790775.1G_macrorhynchus_MK790770.1G_macrorhynchus_MK790769.1G_macrorhynchus_MK790771.1G_macrorhynchus_MK790782.1G_macrorhynchus_MK790773.1G_macrorhynchus_MK790776.1G_macrorhynchus_MK790772.1G_macrorhynchus_MK790777.1N_phocaenoides_KX650872.1N_phocaenoides_KX650871.1N_phocaenoides_KX650869.1N_phocaenoides_KX650870.1S_guianensis_HQ918424.1S_guianensis_HQ918433.1S_guianensis_HQ918428.1D_sp._GU674138.1P_phocoena_KX079493.10.00.30.50.81.1