Notebook 12.0 RESTful GBIF¶

This notebook will introduce you to methods for access and parse data from online databases by accessing websites that have RESTful APIs.

Learning objectives¶

Gain familiarity with the GBIF biological database
Understand the structure of REST API requests
Be able to use the request Python library to query REST APIs

GBIF - The Global Biodiversity Information Facility¶

Check out how amazing GBIF is as a source for biological data. Wouldn't it be great if you could automate accessing the information here????

Required software¶

Install the following with conda before running this notebook

# conda install pandas requests toyplot -c conda-forge

import toyplot
import requests
import pandas as pd

The design of REST APIs.¶

The idea behind REST APIs is that data on a server (like a webpage) can be accessed with a consistent type of argument in the form of a URL, to query data which will then be returned in a form that is easy to analyze (usually json or xml), as opposed to being returned in messy HTML that needs to be parsed. Many websites have REST APIs, but some are much easier to use than others.

Limits¶

Many REST APIs have limits on the way that you can use them. For example, the REST APIs for Twitter, GitHub, and Reddit require that you login in order to access data, and some sites will throttle how many requests you can make per hour. In this notebook we will be focusing on accessing data from websites that are designed to be queried -- they purposefully created an API for this purpose. It's worth noting that there exists other methods to parse data from the raw HTML representation of websites, but that is not our focus here. By contrast, sites with API access typically have data stored in a server database that allows efficient access to even enormous datasets with billions of records (e.g., Twitter).

Good REST APIs¶

A good REST API will have good documentation explaining its intended usage. Two very good examples are the USDA Bison API and the Global Biodiversity Information Facility (GBIF) API. We'll focus on the latter in this notebook. The GBIF database is an international effort to collect all observation data on plants, animals, and fungi into a single place where it can be searched. It is actually a conglomeration of many separate databases, with data from museums and similar institutions all over the world. These APIs are free to use but request that you cite them if the data is used in a publication eventually.

(Although GBIF is free to use it does limit the size of a request you can make at once, but there are ways around this as explained in its documentation.)

What does GBIF do?¶

GBIF can be used to find specimen collection records, or other types of observation data, stored in museum type databases. The website offers a convenient way to request taxa by name, and to select specific research criteria. For example, if we wanted to find all specimens of bumblebees (genus Bombus) that were collected between 1910 and 1920 we could request this through the website. It will draw a nice map with their locations and you can download a table with coordinates of where they were collected. This is actually one of the best databases around, since it organizes the data quite easily for you to download, but nevertheless, we'll use it as our example to learn REST APIs.

REST APIs and websites¶

It's worth noting that the GBIF website itself is designed around the REST API. When you fill in the search form on the site to tell it to search a particular species from a particular area, its response is collected from the database in the same way that your response would be when you make the same request using a URL passed to the API. The website simply takes this response (likely in JSON format) and renders it nicely into a table or map for you to view on the website. This is how many websites work!

Even though GBIF has a very nice web interface, it is obviously often more efficient to be able to query this database programmatically, instead of having to type each name we wish to search, and click on several buttons. This can provide a much more powerful way of applying filters over many different types of searches. That is the idea behind REST APIs and the reason why GBIF provides one.

The base-url¶

The base URL is the web address of the API. This is simply a string that we wil add arguments to in order to request particular types of data be returned to use from the database. For GBIF this is the following, which we'll store as a string for now. This base-url address is given to us right at the top of the GBIF API documentation. You can see that it looks much like any other web address.

# store the base url as a string variable
baseurl = "http://api.gbif.org/v1/occurrence/search?"

How to query GBIF¶

As you can see in the URL below, an API query just has additional arguments added to the baseurl. The string below searches for records with the name Bombus, which is the genus for bumblebees. We've added just the 'query' option 'q' and the name. How did I know that 'q' was valid parameter to this API? By reading the documentation.

We'll see next how to make more complex queries. But first, copy the URL below (without the quotation marks around it) and paste it into a web browser. This will show you what the returned data looks like. It might look a bit different depending on which browser you are using (I recommend using firefox or chrome) but the underlying data is the same, and is called JSON data.

# store a endpoint request as a string variable
search_url = "http://api.gbif.org/v1/occurrence/search?q=Bombus"

JSON format¶

We will be using the requests library to get data from online, but before we do, let's talk a bit about how the data will be structured so we know what to expect. The data that you should see in your browser now is called JSON formatted data. You'll notice that this format is almost identical to what a Python dictionary looks like. It is composed of key:value pairs. This will make it particularly easy to work with.

Requests¶

Documentation

The requests package work a bit like an automated web browser. We've used requests briefly in the past but now we'll start to use it more effectively. The main function we will call is .get(), which will send a GET command (a form of HTTP method that the web is built on) to the web address and return a Response Class object. We will then access attributes and functions of the Response instance to see if our request worked, and to parse the resulting text from it. Let's try this on our search_url string defined above.

# create a Response instance from a request
response = requests.get(search_url)

# check that your request worked (200 = worked; other codes No))
response.status_code

# or, run this to check if it worked.
# This would return an error message if it didn't work (else None)
response.raise_for_status()

Parse a Response¶

Before when we've used requests we've parsed the results as plain text, since it was usually in a format that was easiest to work with as a string (we used requests in an early notebook to download iris-data-dirty.csv). In this case, we are going to access the data a bit differently, by instead accessing it in JSON format. This is easily available from the object just like text is. The first is not very easily readable or parseable, whereas the second can be accessed and searched more easily.

# first 500 characters of the .text string from GBIF API query
response.text[:500]

As you can see above, the result is a string. But we want it to be parsed as a dictionary. The .json() function of the response object will do this for us.

# or, get results as a dictionary (JSON converted)
rdict = response.json()

# get some quick info on the dictionary keys
list(rdict.keys())

Parsing the results¶

In GBIF our response can be parsed into a dictionary object using the JSON format, and this has six keys shown above. These are explained in the API docs, and correspond to information about what records are available for our query. However, it did not return all of the data for those records to us yet. That would be too easy. Instead, databases usually have limits on the amount of data from each request as a way of limiting the bandwidth they will need for sending the data, and to make it faster. For GBIF the default number, shown under the "limit" key, is 20. And the default starting position, shown under "offset" is 0. The total number of records is in "count". So for Bombus, as we show below, there are >2M records, but only records 1-20 were returned to us so far.

## how many records are there for this query
rdict["count"]

## how many records were returned
rdict["limit"]

## starting from which record
rdict["offset"]

So where's the data?¶

It's stored under the results key, and is returned as a list of dictionaries, where each dictionary is a record with lots of information. Below I show the first record from our search.

# here is the first record, it's also a dictionary
rdict["results"][0]

There are too many columns for you see them all here. We will call .columns to see all the column names printed as a list.

# load as a dataframe
sdf = pd.json_normalize(rdict['results'])
sdf.head()

sdf.columns

Building a request¶

Here we add more arguments to further filter the results. To see which options are available, you can either look at the results from our existing calls so far, or you can read further into the API docs. Sometimes API docs will be incomplete though, so it can be useful to learn to try to infer which options are possible from looking at the results. A more complex search is accomplished by building a URL that has more key:value pairs each appended to the end of the URL, and separated by a "&" symbol. For large searches it begins to get difficult to write out by hand, and that is where requests comes in handy. Here we enter the additional arguments we want using a simple python dictionary into the entry 'params'.

# previously we wrote this request by hand
urlpath = "http://api.gbif.org/v1/occurrence/search?q=Bombus"

# here we create the same urlpath using params
response = requests.get(
    url="https://api.gbif.org/v1/occurrence/search/",
    params={"q": "Bombus"}
)

# show url path
print(response.url)

Narrowing request/responses¶

If you looked closely at the results above you may have noticed that the records returned are not actually all for organisms in the genus Bombus. Instead results include things like Chaetocercus bombus and other organisms that happen to have "bombus" in their names.

This is why its important to look closely at your data. Looking back at the documentation we can see that the 'q=something' search parameter returns a fuzzy hit to anything that has the query in its data. If we instead want to restrict to the genus Bombus we need to find the genusKey for Bombus. This can be found using the 'species' endpoint in the API. So let's take a side track to find this. Note we are searching a different baseurl now, to look in the 'species' path instead of the 'occurrence' path.

The results below provide unique identifiers that are more reliable for searching the database. We will use the genusKey=1340278 for our next search of the occurrence database.

# get taxonomy info for the genus Bombus
res = requests.get(
    url="https://api.gbif.org/v1/species/match/",
    params={"genus": "Bombus"},
)
res.json()

Here is a different search for the genus Pedicularis. This is a group of plants that I study. You can see that it returns a different set of taxonomic keys. Feel free to try searching a taxon of your choice.

# get taxonomy info for the genus Pedicularis
res = requests.get(
    url="https://api.gbif.org/v1/species/match/",
    params={"genus": "Pedicularis"},
)
res.json()

Building more complex queries¶

Below I show the URL for when we add the requirement that a record have coordinate data, and for when we add additional arguments to raise the limit for the number of records returned. The max records at a time (limit - offset) is 300. Above that you need to increment the offset to search higher values. You can see that the URL is simply appending additional queries to the end after the ? symbol to build more complex queries.

# add requirement that the record have coordinate data
res = requests.get(
    url="https://api.gbif.org/v1/occurrence/search/",
    params={
        "genusKey": 1340278, 
        "hasCoordinate": "true",
    }
)
res.url

# request records 0-100
res = requests.get(
    url="https://api.gbif.org/v1/occurrence/search/",
    params={
        "genusKey": 1340278, 
        "hasCoordinate": "true",
        "offset": 100,
        "limit": 20,
    }
)
res.url

A complex search¶

Here I request all Bombus records from 1900-1910 that are associated with a preserved specimen (as opposed to HUMAN_OBSERVATION or FOSSIL_SPECIMEN), has spatial data, and is in the US. The 'count' shows us that there are >6000 records meeting these requirements. This individual search returned only 20 of these results though, and at most can do 300 at a time. So we will need to use a trick to get all the records.

res = requests.get(
    url="https://api.gbif.org/v1/occurrence/search/",
    params={
        "genusKey": 1340278, 
        "year": "1900,1910", 
        "basisOfRecord": "PRESERVED_SPECIMEN",
        "hasCoordinate": "true",
        "hasGeospatialIssue": "false",
        "country": "US",
    },
)

print(res.json()["count"])

Combining many searches¶

If we wanted to collect all records for a given search then we need to increment the "offset" argument until we reach the end of the records. Each is returned as a list of dictionaries, so we can just join all of those lists together and return them. That sounds a bit complex, so let's to it in two parts, first we'll write a function to fulfill a single request, and then a function to call many requests.

def get_single_batch(genusKey, year, offset=0, limit=20):
    """
    Returns a GBIF REST query with records between offset
    and offset + limit in JSON format. The genusKey and 
    year interval can be changed.
    """
    res = requests.get(
        url="https://api.gbif.org/v1/occurrence/search/",
        params={
            "genusKey": genusKey,
            "year": year,
            "offset": offset,
            "limit": limit,
            "hasCoordinate": "true",
            "country": "US",
        }
    )
    return res.json()

# test single batch function
jdata = get_single_batch(
    genusKey=3171670,
    year="1990,2020",
    offset=0, 
    limit=20
)

# how many results were fetched?
print(len(jdata["results"]))

# did we reach the end of the records?
jdata["endOfRecords"]

def get_all_records(genusKey, year):
    """
    Iterate requests over incremental offset positions until
    all records have been fetched. When the last record has
    been fetched the key 'endOfRecords' will be 'true'. Takes
    the API params as a dictionary. Returns result as a list
    of dictionaries.
    """
    # for storing results
    alldata = []

    # continue until we call 'break'
    offset = 0
    while 1:

        # get JSON data for a batch 
        jdata = get_single_batch(genusKey, year, offset, 300)

        # increment counter by 300 (the max limit)
        offset += 300

        # add this batch of data to the growing list
        alldata.extend(jdata["results"])

        # stop when end of record is reached
        if jdata["endOfRecords"]:
            print(f'Done. Found {len(alldata)} records')
            break

        # print a dot on each rep to show progress
        print('.', end='')

    return alldata

# call function to search over all offset values until end. 
# THIS MAY TAKE A FEW MINUTES TO RUN
jdata = get_all_records(1340278, "1900,1902")

The full data¶

# convert to a data frame
df = pd.json_normalize(jdata)

# keys (columns) in the dataframe (there are many!)
list(df.columns)

# view just the columns we're interested in for now.
sdf = df[["species", "year", "decimalLatitude", "decimalLongitude"]]
sdf.head()

# how many records?
sdf.shape

# which unique species?
print(sdf.species.unique())

# plot the number of each species in order (hover over bars for names)
sp_counts = df.species.value_counts()
toyplot.bars(sp_counts, height=350, title=sp_counts.index);

Hover over the bars in the plot above to see the names of species. From this we can easily see which species have the most records, and which are more rare.

Assignment:¶

Task 1:¶

Write a Class object called Records that can be given a taxon query and a range of years and will return a class instance with all results from GBIF for the queried range using the same params from our example above, but allowing the 'genusKey' and 'year' arguments to be changed. You can reuse the code above to create the core functions for your object, and modify it further if you wish. Look at the example below for the intended usage of your Records class object. It should do the following:

Store the genusKey and year params during init().
Write the function get_single_batch to take optional arguments that limit the number of results, and have it return a JSON result as a dictionary.
Write the fuction get_all_records to store the JSON results to the instance object, storing JSON as a dictionary to self.json and as a dataframe to self.df.

# skeleton of a class

class Records:
    def __init__(self, genusKey=None, year=None):

        # store input params
        self.genusKey = genusKey
        self.year = year

        # will be used to store output results
        self.df = None
        self.json = None

    def get_single_batch(self, offset=0, limit=20):
        "returns JSON result for a small batch query"
        # ...

    def get_all_records(self):
        "stores result for all records to self.json and self.df"
        # ...

When finished, you should be able to use your class object in the following way. This means that after writing your Class object, test it using the code below and make sure it gives proper results, otherwise keep hacking away at it.

You need to write the Records class before executing the code below

# create instance by entering query and a range of years as integers
rec = Records(genusKey=1340278, year="1980,1985")

# show a small result
print(rec.get_single_batch(offset=0, limit=10))

# get all records
rec.get_all_records()

# access all of the returned records as a dataframe 
# (here asking for the shape to see how many records there are)
rec.df.shape

Task 2:¶

Once you have tested your Record class object in this notebook and it is working, follow these instructions to create a Python package with the following file structure:

records/
├── setup.py
└── records/
    ├── __init__.py
    └── records.py

Create a new GitHub repo called 'records'.
Clone it to your computer.
Create a subfolder called records (records/records/)
Create an init file (records/records/__init__.py)
Create a module called records.py (records/records/records.py)
Open the folder in your text editor (vscode or sublime)
Copy your Records class object from here to records.py.
Create a setup.py script to make your package installable (records/setup.py).
Run pip install -e . from records/ to install your package locally.
Test that your package is installed and working by importing your new package (called records) in the cell below and running the code. Try to make the Records class object importable in this way. (You may need to restart your notebook). Seek help if you get stuck.

# import your library
import records

# get an instance given some query parameters
rec = records.Records(genusKey=1340278, year="1990,2000")

# access the dataframe results
print(rec.get_single_batch())

[Optional] Advanced challenge¶

If you accomplished this task easily then try to add an additional function to your package to allow entering the genus as a string, such that your code will search the taxonomic database to automatically find the integer genusKey to use in the occurrence database query, instead of requiring you to enter it as a numeric genusKey.

Assessment¶

You will be graded on the records class package in your GitHub repo. This notebook is only for instructions and learning and does not need to be submitted.