As a scientist you have likely spent time recording data in a spreadsheet. Tabular data (e.g., CSV or TSV) is a common standard in data science, but a few other formats, like XML or JSON, are also popular.
CSV (comma-separated values) is a flat data format. You only need to know
the row and columns indices to find a value. Typically, each row is an obervation
and common factors (e.g., setosa) are highly repeated.
pd.read_csv("https://eaton-lab.org/data/iris-data-dirty.csv", header=None).head(10)
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
- record data in tabular form (repetitive columns and all).
- begin data analysis with a cleaning step (e.g., filter missing, relabel typos).
- document the cleaning step (keep the raw data file and cleaning script/nb).
- perform analyses on clean data and archive both data files (e.g., on GitHub/Zenodo).
# load tabular data
df = pd.read_csv(
"https://eaton-lab.org/data/iris-data-dirty.csv",
names=["trait1", "trait2", "trait3", "species"],
)
# analyses reveal inconsistencies (mislabeled spp names)
df.groupby("species").mean()
trait1 trait2 trait3
species
Iris-setosa 3.418367 1.461224 0.244898
Iris-setsa 3.400000 1.600000 0.200000
Iris-versicolor 2.761702 4.251020 1.324490
Iris-versicolour 3.200000 4.700000 1.400000
Iris-virginica 2.974000 5.552000 2.026000
- java-script object notation (in Python, think of it as a dictionary).
- a hierarchical data format (nested key:values pairs)
- commonly used on the web.
# example
{
"id": 0,
"data": {
"first": "deren",
"last": "eaton",
"dog": "phylo",
},
"id": 1,
"data": {
"first": "john",
"last": "smith",
"dog": "fido",
},
}
- pandas can be used to convert between JSON and CSV.
# convert dataframe to a json string
sjson = df.to_json(orient="index")
# load as a dict
import json
print(json.loads(sjson))
{'0': {
'trait0': 5.1,
'trait1': 3.5,
'trait2': 1.4,
'trait3': 0.2,
'species': 'Iris-setosa'
},
'1': {
'trait0': 4.9,
'trait1': 3.0,
'trait2': 1.4,
'trait3': 0.2,
'species': 'Iris-setosa'
},
...
# a semi-structured JSON (dict w/ some missing values)
data = [
{'id': 1,
'name': "Cole Volk",
'fitness': {'height': 130, 'weight': 60},
},
{
'name': "Mose Reg",
'fitness': {'height': 130, 'weight': 60},
},
{'id': 2,
'name': 'Faye Raker',
'fitness': {'height': 130, 'weight': 60},
},
]
# pandas can load it
pd.json_normalize(data, max_level=1)
id name fitness.height fitness.weight
0 1.0 Cole Volk 130 60
1 NaN Mose Reg 130 60
2 2.0 Faye Raker 130 60
"Representational State Transfer" (REST) is a set of rules for transferring data over the web using specific URL paths. These paths naturally describe a hierarchical (JSON-like) structure.
A URL represents a request
, and the data sent back to you is called a response
.
The URL is called an endpoint
, and is composed of a root
and path
.
# root for GBIF REST API
URL = https://api.gbif.org/v1/
# path for a specific request
PATH = /species/suggest
# endpoint
https://api.gbif.org/v1/species/suggest/
The API request returns a JSON response that (most) browsers will display.
Here we can see there are >30M results (count), but only 20 are shown (limit)
These 20 results are available in the "results" key.
{
"offset": 0,
"limit": 20,
"endOfRecords": false,
"count": 30009607,
"results": [
{
"key": 0,
"datasetKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
"constituentKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
"kingdom": "incertae sedis",
"kingdomKey": 0,
...
},
{
"key": 1,
"datasetKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
"constituentKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c",
"kingdom": "Animalia",
"kingdomKey": 1,
...
Some API endpoints can take parameters to return results matching a query.
Each API may be designed differently, so you have to read their documentation. See:
https://api.gbif.org/v1/species/search/?q="Pedicularis rex"
returns only 13K hits.
{
"offset": 0,
"limit": 20,
"endOfRecords": false,
"count": 13678,
"results": [
{
"key": 103267516,
"datasetKey": "fab88965-e69d-4491-a04d-e3198b626e52",
"parentKey": 103266166,
"parent": "Pedicularis",
"kingdom": "Viridiplantae",
"phylum": "Streptophyta",
"order": "Lamiales",
"family": "Orobanchaceae",
"genus": "Pedicularis",
"species": "Pedicularis rex",
"kingdomKey": 102974832,
"phylumKey": 102986054,
...
There are MANY free and publicly available REST APIs, and there are also
many that require registration, or even payment to use. We will learn about
some of these in the assignments. Use google to search, you may be surprised.
It would be cumbersome to type the URL into your browser for every search,
instead we can automate API requests in Python. This is best done with the
requests
library (third-party package).
import requests
# search with the simple gbif search parameter q=
response = requests.get(
url="https://api.gbif.org/v1/species/search",
params={"q": "Pedicularis rex"},
)
print(response.url)
https://api.gbif.org/v1/species/search?q=Pedicularis+rex
It would be cumbersome to type the URL into your browser for every search,
instead we can automate API requests in Python. This is best done with the
requests
library (third-party package).
# search with the simple gbif search parameter q=
response = requests.get(
url="https://api.gbif.org/v1/species/search",
params={"q": "Pedicularis rex"},
)
# was request successful?
print(response.status_code())
# get data as text or json
print(response.text)
print(resonse.json())
# search the occurrence database for specific species
response = requests.get(
url="https://api.gbif.org/v1/occurrence/search",
params={"Scientific Name": "Pedicularis anas"},
)
spdict = response.json()
# convert JSON back into a dataframe for easy viewing
df = pd.json_normalize(spdict['results'])
# show specific columns
df.loc[:8, ['species', 'decimalLongitude','decimalLatitude', 'datasetName']]
species decimalLongitude decimalLatitude datasetName
0 Pedicularis anas 102.606667 32.234167 Harvard University Herbaria: All Records
1 Pedicularis anas NaN NaN NaN
2 Pedicularis anas NaN NaN Harvard University Herbaria: All Records
3 Pedicularis anas NaN NaN NaN
4 Pedicularis anas 100.764700 32.488300 Ree China
5 Pedicularis anas 101.916100 32.896400 Ree China
6 Pedicularis anas NaN NaN Ree China
7 Pedicularis anas 102.572800 34.148100 Ree China