This notebook is intended to accompany the following assigned reading:
By the end of this notebook your should:
BLAST is an algorithm and command line program for comparing sequences and computing a similarity score. It is not, however, the name of a sequence database. In fact, you can perform BLAST on many different sequence databases, or even create your own database to search with blast. This is often a point of confusion when first learning about BLAST.
NCBI (National Center for Biotechnology Information) is a U.S. institution that hosts genomic data and tools. It is the home of an online BLAST server that we just used, as well as the Entrez database, which is an enormous relational database for storing metadata information about sequence data that has been made publicly available. The Entrez system includes tools for querying sequences according to unique identifier (GI codes); sequence names; sequences (using BLAST); or by taxonomy (Entrez hosts a complex and continually updated taxonomy).
In the BLAST and OrthoDB examples we ran in the last notebook my instructions had you click on buttons and enter in web forms. Using the tools this way was convenient because the results of your search were displayed nicely in a browser at the end. But, a major short-coming of this approach is that it was not easy to keep a record of which links you clicked on, and therefore to communicate to someone else later how you did what you did, and how to repeat it. This is a major problem in genomics, and in science generally. Software with a graphical user inferface (GUI) is easy to use, but hard to repeat and share.
A good solution for this is to use browser/GUI based tools for exploratory analyses to initially find results, but then after you have found the data you wish to retrieve, to follow this up by writing code that can be executed to repeat your analysis. This will be the goal of the section below, which describes the use of a Application Programming Interface (API) to interact with web-based tools using Python code.
Genomic data are stored in the Entrez database according to a complex hierarchy. We will not delve into the details of this. But the gist is that just about everything has a type of ID associated with it. There is a species-ID to identify the species used in any study; there is a sample-ID for the specific individuals included in a study; there is a run-ID for the sequence run used to generate genomic data for a study; and there are study-IDs and project-IDs that are used to store that all of these other IDs are related to each other through your research. Entrez stores all of this information, and in doing so makes it possible to retrieve the data later using complex searches. For this, the Entrez database has been designed to be accessible through both online searches in a web-browser as well as by using API code.
To use the API simply means to use code to request data from particular URLs instead of going through the process of clicking on many links in a browser. There is documentation on NCBI for how to use the Entrez API (https://www.ncbi.nlm.nih.gov/books/NBK25501/). I don't expect you to read it. Many other online tools or programs have APIs as well. This is meant to demonstrate their utility, but I do not expect you to memorize how to access the API. It is typical that you need to read the documentation for any specific database to figure it out.
We can query the Entrez database using the Python package requests
. We have used this before to download data from the web. Requests has a simple framework for building API web queries using a Python dictionary. Let's learn through some examples below.
import requests
Esearch is an Entrez tool for searching data based on a text query. We can search terms in PubMed and get abstracts that match them, or we can search taxon names or project-IDs for published genomic studies and get the accession numbers of all of the samples in the study. To make a search we need to start by telling NCBI that we are using the esearch
tool (supplied in the url
argument below), then we supply arguments to it about which database to search (using the params
argument), as well as the term to search for. Finally we need to tell the API who we are by supplying an email address (you won't get spammed).
# search term
# term = "SARS-CoV-2[ORGN] complete genome"
term = "FOXP2[GENE] AND Mammalia[ORGN] AND phylogenetic study[PROP]"
# make a request to esearch
res = requests.get(
url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
params={
"db": "nucleotide",
"term": term,
"sort": "Organism Name",
"retmode": "text",
"retmax": "20",
"tool": "genomics-course",
"email": "student@columbia.edu",
},
)
The requests object builds a URL string that it will use to query data from the internet. The results are then returned as a response object, which we stored in the variable res
. I show the full URL below. The results of the query are accessible from the .text
attribute.
# the url built from the command above
res.url
# get the text result of the request
res.text
OK, the text results above looks a little scary, but it's actually mostly composed of metadata about the parameters of our query. The only thing we want to parse out of it is the part between the <Id>
tags. I wrote a small function below to do this which we will use. This function just uses the .split()
function of string objects to split the string into multiple parts on delimiters. The final result of our esearch
request is a list with the unique IDs for sequences matching our term.
def parse_ids(xml):
"parses xml output to get unique IDs (uids)"
uids = []
xml = xml.split("<IdList>")[1].split("</IdList>")[0].strip()
for item in xml.split("\n"):
uids.append(item[4:-5])
return uids
uids = parse_ids(res.text)
uids
The efetch
tool is meant to be used to get actual sequence data from NCBI based on the unique IDs that you obtain with esearch
. I know, it seems like it would be easier if these two things acted as a single tool. Oh well. Now that we have the taxonomic ID we can use this to get data from the nucleotide database. I use the string join
function below to combine the list of uids above into a comma separated string to use as an argument in the params.
# make a request to esearch
res = requests.get(
url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
params={
"db": "nucleotide",
"id": ",".join(uids),
"strand": "1",
"retmode": "text",
"rettype": "fasta",
"tool": "genomics-course",
"email": "student@columbia.edu",
},
)
# show the url query
res.url
# the fasta data returned from NCBI
fastas = res.text
Go to the nucleotide database search page of NCBI (https://www.ncbi.nlm.nih.gov/nuccore/) and search for the same term that we did above (FOXP2[GENE] AND Mammalia[ORGN] AND phylogenetic study[PROP]
). This will find the FOXP2 gene in any mammal from any study that was listed as a phylogenetic study. The order may be a bit different. To download the fasta sequences of all of the hits on the website would require some work. But, our code above was able to download all of these sequences super quickly. It requires some work up front to learn how to set up the query, but if we plan to perform this task many times (imagine we planned to download some number of genes for every mammal species) then it would be well worth it for the automation (and of course it's good for reproduciblity).
# parse the fasta data and print only headers
fna = [i for i in fastas.strip().split("\n\n")]
for seq in fna:
print(seq.split("\n")[0][:120], '...')
# print a fasta sequence
print(fna[0])
Depending on the type of investigation you are doing you may have different reasons for accessing public sequence data online. If you only need the information from a BLAST search to identify the source of some genetic material then a simple web query can get the job done. If you have a sequenced gene for one or more species and you want to compare it to many other sequences using phylogenetic methods then you will want to sample at least one sequence from each of the species you are interested in. And if you are interested in a specific gene and how it is expressed and translated into a protein then you may want to find all of the alternative transcripts available for that gene. Finding the specific type of data that you want, and filtering through the other stuff on NCBI can be grueling, and confusing. But there's a wealth of data out there waiting to be mined, and many interesting things that can be done with it.
This was just an introduction to API methods for querying sequence data. You will not be tested on the details of learning code to make API queries. But we will use API calls in later notebooks to find and grab sequence data from online, so it is worth becoming familiar with.