Programming and Data Science for Biology
(EEEB G4050)

Lecture 12: Accessing biological data online with REST APIs

Lecture 12.0 Outline:

  • Data formats and practices
  • Using a REST API

Data structures and formats

As a scientist you have likely spent time recording data in a spreadsheet. Tabular data (e.g., CSV or TSV) is a common standard in data science, but a few other formats, like XML or JSON, are also popular.

CSV data

CSV (comma-separated values) is a flat data format. You only need to know
the row and columns indices to find a value. Typically, each row is an obervation
and common factors (e.g., setosa) are highly repeated.

              
                pd.read_csv("https://eaton-lab.org/data/iris-data-dirty.csv", header=None).head(10)
              

                   0    1    2    3            4
              0  5.1  3.5  1.4  0.2  Iris-setosa
              1  4.9  3.0  1.4  0.2  Iris-setosa
              2  4.7  3.2  1.3  0.2  Iris-setosa
              3  4.6  3.1  1.5  0.2  Iris-setosa
              4  5.0  3.6  1.4  0.2  Iris-setosa
              5  5.4  3.9  1.7  0.4  Iris-setosa
              6  4.6  3.4  1.4  0.3  Iris-setosa
              7  5.0  3.4  1.5  0.2  Iris-setosa
              8  4.4  2.9  1.4  0.2  Iris-setosa
              9  4.9  3.1  1.5  0.1  Iris-setosa
              

CSV data: best practices

- record data in tabular form (repetitive columns and all).
- begin data analysis with a cleaning step (e.g., filter missing, relabel typos).
- document the cleaning step (keep the raw data file and cleaning script/nb).
- perform analyses on clean data and archive both data files (e.g., on GitHub/Zenodo).

              
                # load tabular data
                df = pd.read_csv(
                    "https://eaton-lab.org/data/iris-data-dirty.csv", 
                    names=["trait1", "trait2", "trait3", "species"],
                )

                # analyses reveal inconsistencies (mislabeled spp names)
                df.groupby("species").mean()
              

                                    trait1    trait2    trait3
                species                                       
                Iris-setosa       3.418367  1.461224  0.244898
                Iris-setsa        3.400000  1.600000  0.200000
                Iris-versicolor   2.761702  4.251020  1.324490
                Iris-versicolour  3.200000  4.700000  1.400000
                Iris-virginica    2.974000  5.552000  2.026000
              

JSON data: best practices

- java-script object notation (in Python, think of it as a dictionary).
- a hierarchical data format (nested key:values pairs)
- commonly used on the web.

              
                # example
                {
                    "id": 0,
                    "data": {
                        "first": "deren",
                        "last": "eaton",
                        "dog": "phylo",
                    },
                    "id": 1,
                    "data": {
                        "first": "john",
                        "last": "smith",
                        "dog": "fido",
                    },
                }
              

JSON data: best practices

- pandas can be used to convert between JSON and CSV.

              
                # convert dataframe to a json string
                sjson = df.to_json(orient="index")

                # load as a dict
                import json
                print(json.loads(sjson))
              

                {'0': {
                  'trait0': 5.1,
                  'trait1': 3.5,
                  'trait2': 1.4,
                  'trait3': 0.2,
                  'species': 'Iris-setosa'
                  },
                 '1': {
                  'trait0': 4.9,
                  'trait1': 3.0,
                  'trait2': 1.4,
                  'trait3': 0.2,
                  'species': 'Iris-setosa'
                  },
                  ...
              

JSON data: best practices


                # a semi-structured JSON (dict w/ some missing values)
                data = [
                    {'id': 1,
                     'name': "Cole Volk",
                     'fitness': {'height': 130, 'weight': 60},
                    },
                    {
                     'name': "Mose Reg",
                     'fitness': {'height': 130, 'weight': 60},
                    },
                    {'id': 2, 
                     'name': 'Faye Raker',
                     'fitness': {'height': 130, 'weight': 60},
                    },
                ]
                # pandas can load it
                pd.json_normalize(data, max_level=1)
            

                  id        name  fitness.height  fitness.weight
              0  1.0   Cole Volk             130              60
              1  NaN    Mose Reg             130              60
              2  2.0  Faye Raker             130              60
            

Why care about JSON? REST APIs

"Representational State Transfer" (REST) is a set of rules for transferring data over the web using specific URL paths. These paths naturally describe a hierarchical (JSON-like) structure.

REST API terminology

A URL represents a request, and the data sent back to you is called a response.
The URL is called an endpoint, and is composed of a root and path.


                # root for GBIF REST API
                URL = https://api.gbif.org/v1/

                # path for a specific request 
                PATH = /species/suggest

                # endpoint
                https://api.gbif.org/v1/species/suggest/
            

Visit the endpoint: https://api.gbif.org/v1/species/search/

REST API terminology

The API request returns a JSON response that (most) browsers will display.
Here we can see there are >30M results (count), but only 20 are shown (limit)
These 20 results are available in the "results" key.


              {
                "offset": 0,
                "limit": 20,
                "endOfRecords": false,
                "count": 30009607,
                "results": [
                  {
                    "key": 0,
                    "datasetKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
                    "constituentKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
                    "kingdom": "incertae sedis",
                    "kingdomKey": 0,
                    ...
                  },
                  {
                    "key": 1,
                    "datasetKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
                    "constituentKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
                    "kingdom": "Animalia",
                    "kingdomKey": 1,
                    ...
            

REST API rules

Some API endpoints can take parameters to return results matching a query.
Each API may be designed differently, so you have to read their documentation. See:
https://api.gbif.org/v1/species/search/?q="Pedicularis rex"
returns only 13K hits.


              {
                "offset": 0,
                "limit": 20,
                "endOfRecords": false,
                "count": 13678,
                "results": [
                  {
                    "key": 103267516,
                    "datasetKey": "fab88965-e69d-4491-a04d-e3198b626e52",
                    "parentKey": 103266166,
                    "parent": "Pedicularis",
                    "kingdom": "Viridiplantae",
                    "phylum": "Streptophyta",
                    "order": "Lamiales",
                    "family": "Orobanchaceae",
                    "genus": "Pedicularis",
                    "species": "Pedicularis rex",
                    "kingdomKey": 102974832,
                    "phylumKey": 102986054,
                    ...
            

REST APIs

There are MANY free and publicly available REST APIs, and there are also
many that require registration, or even payment to use. We will learn about
some of these in the assignments. Use google to search, you may be surprised.

  • Twitter
  • Reddit
  • GitHub
  • GBIF
  • ProPublica
  • CityBikes
  • USDA soil data
  • NCBI taxonomy
  • Pubmed publications

Query REST APIs in Python

It would be cumbersome to type the URL into your browser for every search,
instead we can automate API requests in Python. This is best done with the
requests library (third-party package).


                import requests

                # search with the simple gbif search parameter q=
                response = requests.get(
                    url="https://api.gbif.org/v1/species/search",
                    params={"q": "Pedicularis rex"},
                )

                print(response.url)
            

                https://api.gbif.org/v1/species/search?q=Pedicularis+rex
            

Query REST APIs in Python

It would be cumbersome to type the URL into your browser for every search,
instead we can automate API requests in Python. This is best done with the
requests library (third-party package).


                # search with the simple gbif search parameter q=
                response = requests.get(
                    url="https://api.gbif.org/v1/species/search",
                    params={"q": "Pedicularis rex"},
                )

                # was request successful?
                print(response.status_code())

                # get data as text or json
                print(response.text)
                print(resonse.json())
            

Query REST APIs in Python


                # search the occurrence database for specific species
                response = requests.get(
                    url="https://api.gbif.org/v1/occurrence/search",
                    params={"Scientific Name": "Pedicularis anas"},
                )
                spdict = response.json()

                # convert JSON back into a dataframe for easy viewing
                df = pd.json_normalize(spdict['results'])

                # show specific columns
                df.loc[:8, ['species', 'decimalLongitude','decimalLatitude', 'datasetName']]
            

                             species  decimalLongitude  decimalLatitude                               datasetName
                0   Pedicularis anas        102.606667        32.234167  Harvard University Herbaria: All Records
                1   Pedicularis anas               NaN              NaN                                       NaN
                2   Pedicularis anas               NaN              NaN  Harvard University Herbaria: All Records
                3   Pedicularis anas               NaN              NaN                                       NaN
                4  Pedicularis anas        100.764700        32.488300                                 Ree China
                5  Pedicularis anas        101.916100        32.896400                                 Ree China
                6  Pedicularis anas               NaN              NaN                                 Ree China
                7  Pedicularis anas        102.572800        34.148100                                 Ree China