Notebook 5.2 -- BLAST, NCBI, and Python APIs¶

This notebook is intended to accompany the following assigned reading:

Waterhouse, Robert M., Fredrik Tegenfeldt, Jia Li, Evgeny M. Zdobnov, and Evgenia V. Kriventseva. 2013. “OrthoDB: A Hierarchical Catalog of Animal, Fungal and Bacterial Orthologs.” Nucleic Acids Research 41: D358–65. https://doi.org/10.1093/nar/gks1116.

Learning objectives:¶

By the end of this notebook your should:

Understand the difference between BLAST and the NCBI sequence database.
Recognize that NCBI holds a vast network of information.
Understand that APIs allow online database queries using code.

BLAST, ENTREZ, and NCBI databases¶

BLAST is an algorithm and command line program for comparing sequences and computing a similarity score. It is not, however, the name of a sequence database. In fact, you can perform BLAST on many different sequence databases, or even create your own database to search with blast. This is often a point of confusion when first learning about BLAST.

NCBI (National Center for Biotechnology Information) is a U.S. institution that hosts genomic data and tools. It is the home of an online BLAST server that we just used, as well as the Entrez database, which is an enormous relational database for storing metadata information about sequence data that has been made publicly available. The Entrez system includes tools for querying sequences according to unique identifier (GI codes); sequence names; sequences (using BLAST); or by taxonomy (Entrez hosts a complex and continually updated taxonomy).

Reproducibility: Using APIs¶

In the BLAST and OrthoDB examples we ran in the last notebook my instructions had you click on buttons and enter in web forms. Using the tools this way was convenient because the results of your search were displayed nicely in a browser at the end. But, a major short-coming of this approach is that it was not easy to keep a record of which links you clicked on, and therefore to communicate to someone else later how you did what you did, and how to repeat it. This is a major problem in genomics, and in science generally. Software with a graphical user inferface (GUI) is easy to use, but hard to repeat and share.

A good solution for this is to use browser/GUI based tools for exploratory analyses to initially find results, but then after you have found the data you wish to retrieve, to follow this up by writing code that can be executed to repeat your analysis. This will be the goal of the section below, which describes the use of a Application Programming Interface (API) to interact with web-based tools using Python code.

The Entrez API¶

Genomic data are stored in the Entrez database according to a complex hierarchy. We will not delve into the details of this. But the gist is that just about everything has a type of ID associated with it. There is a species-ID to identify the species used in any study; there is a sample-ID for the specific individuals included in a study; there is a run-ID for the sequence run used to generate genomic data for a study; and there are study-IDs and project-IDs that are used to store that all of these other IDs are related to each other through your research. Entrez stores all of this information, and in doing so makes it possible to retrieve the data later using complex searches. For this, the Entrez database has been designed to be accessible through both online searches in a web-browser as well as by using API code.

Accessing Entrez data using the API¶

To use the API simply means to use code to request data from particular URLs instead of going through the process of clicking on many links in a browser. There is documentation on NCBI for how to use the Entrez API (https://www.ncbi.nlm.nih.gov/books/NBK25501/). I don't expect you to read it. Many other online tools or programs have APIs as well. This is meant to demonstrate their utility, but I do not expect you to memorize how to access the API. It is typical that you need to read the documentation for any specific database to figure it out.

We can query the Entrez database using the Python package requests. We have used this before to download data from the web. Requests has a simple framework for building API web queries using a Python dictionary. Let's learn through some examples below.

import requests

The esearch tool¶

Esearch is an Entrez tool for searching data based on a text query. We can search terms in PubMed and get abstracts that match them, or we can search taxon names or project-IDs for published genomic studies and get the accession numbers of all of the samples in the study. To make a search we need to start by telling NCBI that we are using the esearch tool (supplied in the url argument below), then we supply arguments to it about which database to search (using the params argument), as well as the term to search for. Finally we need to tell the API who we are by supplying an email address (you won't get spammed).

esearch API call with requests¶

# search term
# term = "SARS-CoV-2[ORGN] complete genome"
term = "FOXP2[GENE] AND Mammalia[ORGN] AND phylogenetic study[PROP]"

# make a request to esearch 
res = requests.get(
    url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi", 
    params={
        "db": "nucleotide",
        "term": term,
        "sort": "Organism Name",
        "retmode": "text",
        "retmax": "20",
        "tool": "genomics-course", 
        "email": "student@columbia.edu",
        },
    )

The requests object builds a URL string that it will use to query data from the internet. The results are then returned as a response object, which we stored in the variable res. I show the full URL below. The results of the query are accessible from the .text attribute.

# the url built from the command above
res.url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=FOXP2%5BGENE%5D+AND+Mammalia%5BORGN%5D+AND+phylogenetic+study%5BPROP%5D&sort=Organism+Name&retmode=text&retmax=20&tool=genomics-course&email=student%40columbia.edu'

# get the text result of the request
res.text

'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">\n<eSearchResult><Count>62</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdList>\n<Id>22596910</Id>\n<Id>22596958</Id>\n<Id>24496240</Id>\n<Id>263198301</Id>\n<Id>263198296</Id>\n<Id>22596906</Id>\n<Id>22596954</Id>\n<Id>22596956</Id>\n<Id>24496234</Id>\n<Id>39762715</Id>\n<Id>263198298</Id>\n<Id>39762716</Id>\n<Id>39762714</Id>\n<Id>263198295</Id>\n<Id>24496236</Id>\n<Id>263198299</Id>\n<Id>263198300</Id>\n<Id>22596912</Id>\n<Id>263198302</Id>\n<Id>512739739</Id>\n</IdList><TranslationSet><Translation>     <From>Mammalia[ORGN]</From>     <To>"Mammalia"[Organism]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>FOXP2[GENE]</Term>    <Field>GENE</Field>    <Count>2125</Count>    <Explode>N</Explode>   </TermSet>   <TermSet>    <Term>"Mammalia"[Organism]</Term>    <Field>Organism</Field>    <Count>79586164</Count>    <Explode>Y</Explode>   </TermSet>   <OP>AND</OP>   <TermSet>    <Term>phylogenetic study[PROP]</Term>    <Field>PROP</Field>    <Count>6496756</Count>    <Explode>N</Explode>   </TermSet>   <OP>AND</OP>  </TranslationStack><QueryTranslation>FOXP2[GENE] AND "Mammalia"[Organism] AND phylogenetic study[PROP]</QueryTranslation></eSearchResult>\n'

Parsing the output¶

OK, the text results above looks a little scary, but it's actually mostly composed of metadata about the parameters of our query. The only thing we want to parse out of it is the part between the <Id> tags. I wrote a small function below to do this which we will use. This function just uses the .split() function of string objects to split the string into multiple parts on delimiters. The final result of our esearch request is a list with the unique IDs for sequences matching our term.

def parse_ids(xml):
    "parses xml output to get unique IDs (uids)"
    uids = []
    xml = xml.split("<IdList>")[1].split("</IdList>")[0].strip()
    for item in xml.split("\n"):
        uids.append(item[4:-5])
    return uids

uids = parse_ids(res.text)
uids

['22596910',
 '22596958',
 '24496240',
 '263198301',
 '263198296',
 '22596906',
 '22596954',
 '22596956',
 '24496234',
 '39762715',
 '263198298',
 '39762716',
 '39762714',
 '263198295',
 '24496236',
 '263198299',
 '263198300',
 '22596912',
 '263198302',
 '512739739']

Efetch API call¶

The efetch tool is meant to be used to get actual sequence data from NCBI based on the unique IDs that you obtain with esearch. I know, it seems like it would be easier if these two things acted as a single tool. Oh well. Now that we have the taxonomic ID we can use this to get data from the nucleotide database. I use the string join function below to combine the list of uids above into a comma separated string to use as an argument in the params.

# make a request to esearch 
res = requests.get(
    url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi", 
        params={
        "db": "nucleotide",
        "id": ",".join(uids),
        "strand": "1",
        "retmode": "text",   
        "rettype": "fasta",
        "tool": "genomics-course", 
        "email": "student@columbia.edu",
        },
    )

# show the url query
res.url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=22596910%2C22596958%2C24496240%2C263198301%2C263198296%2C22596906%2C22596954%2C22596956%2C24496234%2C39762715%2C263198298%2C39762716%2C39762714%2C263198295%2C24496236%2C263198299%2C263198300%2C22596912%2C263198302%2C512739739&strand=1&retmode=text&rettype=fasta&tool=genomics-course&email=student%40columbia.edu'

# the fasta data returned from NCBI
fastas = res.text

Go to the nucleotide database search page of NCBI (https://www.ncbi.nlm.nih.gov/nuccore/) and search for the same term that we did above (FOXP2[GENE] AND Mammalia[ORGN] AND phylogenetic study[PROP]). This will find the FOXP2 gene in any mammal from any study that was listed as a phylogenetic study. The order may be a bit different. To download the fasta sequences of all of the hits on the website would require some work. But, our code above was able to download all of these sequences super quickly. It requires some work up front to learn how to set up the query, but if we plan to perform this task many times (imagine we planned to download some number of genes for every mammal species) then it would be well worth it for the automation (and of course it's good for reproduciblity).

# parse the fasta data and print only headers
fna = [i for i in fastas.strip().split("\n\n")]
for seq in fna:
    print(seq.split("\n")[0][:120], '...')

>AF512949.1 Pongo pygmaeus forkhead box P2 (FOXP2) mRNA, complete cds ...
>AF515053.1 Pongo pygmaeus forkhead box P2 (FOXP2) gene, exons 4 through 7 and partial cds ...
>AY143181.1 Pongo pygmaeus FOXP2 protein (FOXP2) mRNA, complete cds ...
>GQ911141.1 Lagothrix lagotricha FOXP2 gene, promoter region and 5' UTR ...
>GQ911136.1 Macaca nigra FOXP2 gene, promoter region and 5' UTR ...
>AF512947.1 Pan troglodytes forkhead box P2 (FOXP2) mRNA, complete cds ...
>AF515051.1 Pan troglodytes forkhead box P2 (FOXP2) gene, exons 4 through 7 and partial cds ...
>AF515052.1 Pan troglodytes forkhead box P2 (FOXP2) gene, exons 4 through 7 and partial cds ...
>AY143178.1 Pan troglodytes FOXP2 protein (FOXP2) mRNA, complete cds ...
>AY406744.1 Pan troglodytes FOXP2 gene, VIRTUAL TRANSCRIPT, partial sequence, genomic survey sequence ...
>GQ911138.1 Macaca fascicularis FOXP2 gene, promoter region and 5' UTR ...
>AY406745.1 Mus musculus FOXP2 gene, VIRTUAL TRANSCRIPT, partial sequence, genomic survey sequence ...
>AY406743.1 Homo sapiens FOXP2 gene, VIRTUAL TRANSCRIPT, partial sequence, genomic survey sequence ...
>GQ911135.1 Macaca nemestrina FOXP2 gene, promoter region and 5' UTR ...
>AY143179.1 Pan paniscus FOXP2 protein (FOXP2) mRNA, complete cds ...
>GQ911139.1 Erythrocebus patas FOXP2 gene, promoter region and 5' UTR ...
>GQ911140.1 Saguinus labiatus FOXP2 gene, promoter region and 5' UTR ...
>AF512950.1 Macaca mulatta forkhead box P2 (FOXP2) mRNA, complete cds ...
>GQ911142.1 Macaca mulatta FOXP2 gene, promoter region and 5' UTR ...
>JX502243.1 Rhinolophus pearsonii voucher YL23 forkhead box protein P2 (FoxP2) gene, partial cds ...

# print a fasta sequence
print(fna[0])

>AF512949.1 Pongo pygmaeus forkhead box P2 (FOXP2) mRNA, complete cds
ATGATGCAGGAATCTGTGACAGAGACAATAAGCAACAGTTCAATGAATCAAAATGGAATGAGCACTCTAA
GCAGCCAATTAGATGCTGGCAGCAGAGATGGAAGATCAAGTGGTGACACCAGCTCTGAAGTAAGCACAGT
AGAACTGCTGCATCTGCAACAACAGCAGGCTCTCCAGGCAGCAAGACAACTTCTTTTACAGCAGCAAACA
AGTGGATTGAAATCTCCTAAGAGCAGTGATAAACAGAGACCACTGCAGGTGCCTGTGTCAGTGGCCATGA
TGACTCCCCAGGTGATCACCCCTCAGCAAATGCAGCAGATCCTTCAGCAACAAGTCCTGTCTCCTCAGCA
GCTCCAAGCCCTTCTCCAACAACAGCAGGCCGTCATGCTGCAGCAGCAACAACTACAAGAGTTTTACAAG
AAACAGCAAGAGCAGTTACATCTTCAGCTTTTGCAGCAGCAGCAACAGCAGCAGCAGCAACAACAGCAGC
AACAACAGCAGCAGCAGCAACAACAACAACAGCAGCAGCAACAGCAGCAGCAGCAGCAACAGCAGCAGCA
GCAGCAACAGCATCCTGGAAAGCAAGCGAAAGAGCAGCAGCAGCAGCAGCAGCAACAGCAGTTGGCAGCC
CAGCAGCTTGTCTTCCAGCAGCAGCTTCTCCAGATGCAACAACTCCAGCAGCAGCAGCATCTGCTCAGCC
TTCAGCGTCAGGGACTCATCTCCATTCCACCTGGCCAGGCAGCCCTTCCTGTCCAATCGCTGCCTCAAGC
TGGCTTAAGTCCTGCTGAGATTCAGCAGTTATGGAAAGAAGTGACTGGAGTTCACAGTATGGAAGACAAT
GGCATTAAACATGGAGGGCTAGACCTCACTACTAACAATTCCTCCTCGACTACCTCCTCCACCACTTCCA
AAGCATCACCACCAATAACTCATCATTCCATAGTGAATGGACAGTCTTCAGTTCTAAATGCAAGACGAGA
CAGCTCGTCACATGAGGAGACTGGGGCCTCTCACACTCTCTATGGCCATGGAGTTTGCAAATGGCCAGGC
TGTGAAAGCATTTGTGAAGATTTTGGACAGTTTTTAAAGCACCTTAACAATGAACACGCATTGGATGACC
GAAGCACTGCTCAGTGTCGAGTGCAAATGCAGGTGGTGCAACAGTTAGAAATACAGCTTTCTAAAGAACG
CGAACGTCTTCAAGCAATGATGACCCACTTGCACATGCGACCCTCAGAGCCCAAACCATCTCCCAAACCT
CTAAATCTGGTGTCTAGTGTCACCATGTCGAAGAATATGTTGGAGACATCCCCACAGAGCTTACCTCAAA
CCCCTACCACACCAACGGCCCCAGTCACCCCGATTACCCAGGGACCCTCAGTAATCACCCCAGCCAGTGT
GCCCAATGTGGGAGCCATACGAAGGCGACATTCAGACAAATACAACATTCCCATGTCATCAGAAATTGCC
CCAAACTACGAATTTTATAAAAATGCAGATGTCAGACCTCCATTTACTTATGCAACTCTCATAAGGCAGG
CTATCATGGAGTCATCTGACAGGCAGTTAACACTTAATGAAATTTACAGCTGGTTTACACGGACGTTTGC
TTACTTCAGGCGTAATGCAGCAACTTGGAAGAATGCAGTACGTCATAATCTTAGCCTGCACAAGTGTTTT
GTTCGAGTAGAAAATGTTAAAGGAGCAGTATGGACTGTGGATGAAGTAGAATACCAGAAGCGAAGGTCAC
AAAAGATAACAGGAAGTCCAACCTTAGTAAAAAATATACCTACCAGTTTAGGCTATGGAGCAGCTCTTAA
TGCCAGTTTGCAGGCTGCCTTGGCAGAGAGCAGTTTACCTTTGCTAAGTAATCCTGGACTGATCAATAAT
GCATCCAGTGGCCTACTGCAGGCCGTCCACGAAGACCTCAATGGTTCTCTGGATCACATTGACAGCAATG
GAAACAGTAGTCCGGGCTGCTCGCCTCAGCCGCACATACATTCAATCCATGTCAAGGAAGAGCCAGTGAT
TGCAGAGGATGAAGACTGCCCAATGTCCTTAGTGACAACAGCTAATCACAGTCCAGAATTAGAAGACGAC
AGAGAGATTGAAGAAGAGCCTTTATCTGAAGATCTGGAATGA

Choosing sequences¶

Depending on the type of investigation you are doing you may have different reasons for accessing public sequence data online. If you only need the information from a BLAST search to identify the source of some genetic material then a simple web query can get the job done. If you have a sequenced gene for one or more species and you want to compare it to many other sequences using phylogenetic methods then you will want to sample at least one sequence from each of the species you are interested in. And if you are interested in a specific gene and how it is expressed and translated into a protein then you may want to find all of the alternative transcripts available for that gene. Finding the specific type of data that you want, and filtering through the other stuff on NCBI can be grueling, and confusing. But there's a wealth of data out there waiting to be mined, and many interesting things that can be done with it.

This was just an introduction to API methods for querying sequence data. You will not be tested on the details of learning code to make API queries. But we will use API calls in later notebooks to find and grab sequence data from online, so it is worth becoming familiar with.

Question [12]: Think of a study that you might complete using publicly available genomic data. What type of genomic regions/features would you target, and from what organisms? What is a question you would like to answer? Please provide a long form answer here, written in full paragraphs. Think creatively, and search online to see if the type of data you are proposing is available. Answer in Markdown below.

EXAMPLE: I'm interested in the development of floral diversity in angiosperms, so I could target genes that are associated with floral development (e.g., CYCLOIDEA) and get the sequences across a large number of families and genera. In OrthoDB I found 144 genes from 73 species for this gene. I would like to know if the rates of evolution of this gene varies among plant groups, and if it is associated with floral morphological diversity. I could try to answer this question by aligning all of the sequences and building a phylogeny, and then estimating branch lengths.

Save and download this notebook to upload to courseworks when finished.