Skip to content

14.1 -- Data Visualization

Prerequisites

Let's make sure we have necessary modules installed, including matplotlib and scikit-learn, which we will briefly introduce, but will get into more detail on next week.

conda install -c conda-forge matplotlib scikit-learn

Open a new notebook in your hack-5-python/notebooks directory and rename this notebook to "14.1-plotting.ipynb". In a new cell, include a few imports that we will need:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression

Fetch the Iris data

We have been loading and cleaning the iris data by hand up to this point, but now we are going to make use of a nice feature in scikit-learn, which provides the iris data as a pre-loaded dataset. This chunk of code is 'data curation': we load in the iris data, transform it into a pandas DataFrame (it is not structured natively as a dataframe because it a bunch of other features internally), and then create a new column called species to hold the species IDs.

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
iris_df

Exploring the data 1-D at a time with histograms

Let's say we want to visualize the differences between the different measurements for each of the species. A simple way to do this is with histograms. Here we will use the pd.groupby method to group the data by species ID.

# Define a variable for the feature we wish to plot
feature = "sepal length (cm)"
# Group the data in the dataframe by species ID
gb = iris_df.groupby("species")
# Iterate through the groupby object
for sp, dat in gb:
    # For each species, plot the data for the given feature as a histogram
    plt.hist(dat[feature], label=sp)

Set the alpha channel

The initial result looks good, but hist defaults to plotting all histograms with zero opacity, so it's not clear what is happening in regions where the histograms overlap. Matplotlib plotting functions can take an opacity parameter called alpha which takes values between 0 and 1. Try setting the alpha to a small value and replotting.

feature = "sepal length (cm)"
gb = iris_df.groupby("species")
for sp, dat in gb:
    plt.hist(dat[feature], label=sp, alpha=0.01)
plt.legend()

This produces a ghostly outline of the histogram. Try experimenting with different alpha values until you find something you are happy with.

Change the colors

By default, matplotlib will choose different colors for each histogram, and this is fine, but you might want to control the colors for each histogram to unify color schemes across your plots, and also to beautify them. You can specify colors in several different ways, but the most straightforward way is to pick from the large list of matplotlib named colors. Here are some examples:

feature = "sepal length (cm)"
gb = iris_df.groupby("species")
# Create a dictionary mapping species IDs to color names, which we will access inside the for loop
cdict = {'setosa':'cornflowerblue',
         'versicolor':'salmon',
         'virginica':'goldenrod'}
for sp, dat in gb:
    plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])

Experiment with different color schemes until you find one you are happy with.

Add a legend

If you add labels to the hist call (as we have done), then matplotlib can easily generate a legend for your figure with the plt.legend() method.

feature = "sepal length (cm)"
gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
         'versicolor':'salmon',
         'virginica':'goldenrod'}
for sp, dat in gb:
    plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])
plt.legend()

By default matplotlib legend() will place itself in the most unoccupied region of the figure, and generally it does this pretty well, but you can control the placement of the legend within the figure using the loc argument, which you can learn more about in the legend documentation.

Add axis labels and title

Finally, we will wrap up this figure by providing axis labels and a plot title, which we can do with the plt.title(), and plt.xlabel()/plt.ylabel() methods.

feature = "sepal length (cm)"
gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
         'versicolor':'salmon',
         'virginica':'goldenrod'}
for sp, dat in gb:
    plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])

plt.title(f'Histogram of {feature} by Species')
plt.ylabel('Count')
plt.xlabel(feature)

If you wish to modify the style of the title and axis labels you can change many of these properties (for example the fontsize), all of which are documented in the text documentation

Saving figures to a file

Once you are happy with your figure, you can either right-click on the image in the notebook and "copy output to clipboard", but this will copy a low-res version of the image, which is fine for presentations but not great for publications. For this reason matplotlib provides a savefig() method for saving the resulting figure in several different possible output formats, also allowing to control the output resolution (using the dpi argument, for example). Here is an example of saving as a standard resolution PNG file.

feature = "sepal length (cm)"
gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
         'versicolor':'salmon',
         'virginica':'goldenrod'}
for sp, dat in gb:
    plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])

plt.title(f'Histogram of {feature} by Species')
plt.ylabel('Count')
plt.xlabel(feature)
plt.savefig('sepal_length.png')
After you call this code you should see a new file called "sepal_length.png" in your jupyter lab browser, and you can even double-click it to open it in jupter lab to verify that it looks good.

You can see what other file formats are available by calling the get_supported_filetypes() function, like this:

plt.gcf().canvas.get_supported_filetypes()
{'eps': 'Encapsulated Postscript',
 'jpg': 'Joint Photographic Experts Group',
 'jpeg': 'Joint Photographic Experts Group',
 'pdf': 'Portable Document Format',
 'pgf': 'PGF code for LaTeX',
 'png': 'Portable Network Graphics',
 'ps': 'Postscript',
 'raw': 'Raw RGBA bitmap',
 'rgba': 'Raw RGBA bitmap',
 'svg': 'Scalable Vector Graphics',
 'svgz': 'Scalable Vector Graphics',
 'tif': 'Tagged Image File Format',
 'tiff': 'Tagged Image File Format',
 'webp': 'WebP Image Format'}

Post a copy of your finished histogram

I have created a google slides presentation for sharing our data visualizations. Open this link (you will need to use your CU account), and paste your favorite visualization into a new slide.

Challenge: Visualize another column of data

Now that you have the code settled for one column of the data, and you abstracted out the feature to plot as a variable named feature, it should be simple to plot another feature by dropping in another variable name. Go ahead and try it.

Visualizing 2-D data with scatterplots

Often we will have more than one dimension of data and we might have hypotheses about how the different dimension of data co-vary in our dataset. Let's remind ourselves of what types of data we have available in the iris data.

print(iris_df.columns)
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'species'],
      dtype='object')

Let's choose sepal length and width, as these seem like they might be reasonably correlated. The machinery of plotting 2D data is pretty similar to plotting the 1D data with the exception that we need to provide x and y values for each datapoint. Whereas above with hist we only passed in one column of data, now we need to pass in two columns, otherwise the details of manipulating the 'look' of the figure are quite similar.

gb = iris_df.groupby("species")
x_feature = "sepal length (cm)"
y_feature = "sepal width (cm)"
for sp, dat in gb:
    plt.scatter(dat[x_feature], dat[y_feature], label=sp)
plt.legend()

Styling scatterplots

The details of figure styling for scatterplots are identical to those for histogram, so we can copy our approach for styling the figure above.

x_feature = "sepal length (cm)"
y_feature = "sepal width (cm)"

gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
         'versicolor':'salmon',
         'virginica':'goldenrod'}
for sp, dat in gb:
    plt.scatter(dat[x_feature], dat[y_feature], label=sp, alpha=0.75, color=cdict[sp])

plt.title(f'Histogram of {feature} by Species')
plt.ylabel('Count')
plt.xlabel(feature)
plt.legend()
plt.savefig('sepal_length.png')

One difference with scatter is that you can define a "marker style", which controls the form of the scatter points. You can define the marker for a scatterplot using the marker argument, for example marker="*" would use stars instead of points. There are many options for marker style that you can see in the documentation

Scatterplot challenge

In fact you can reasonably represent 3-Dimensions in a scatterplot by using the size of the marker to indicate a third dimension. This is something that was shown in the reading this week in section 04.02. If you have time, experiment with seeing if you can recreate this figure.

Post a copy of your finished scatterplot

As before, post your finished scatterplot to our doc for sharing our data visualizations. Go ahead and post it on the same slide as your histogram, side by side.

Further challenges if time remains

Use the requests module to download a small mammal life history dataset that was published by Ernst in 2003 in the journal Ecology.

Load this data into a pandas DataFrame. You will need to use a couple additional arguments when reading this file, which indicate tab separated values and replace missing data values with NaN. Here is the call you'll need to load the data: pd.read_csv("Mammal_lifehistories_v2.txt", sep="\t", na_values=['-999', '-999.00'])

  • Use a scatter plot to investigate adult mass vs. newborn mass. What do you notice about this plot?
  • Try transforming the data using np.log10() and then replotting the same data
  • Develop a hypothesis about the relationship between adult body mass and litter size. Should litter size be smaller or larger with increasing body mass? Plot the data to get a sense of whether you were right about this. When plotting, will you log-transform both, one, or neither of the axis? Justify your decision.

Wrapping up

Add, commit, and push your notebook to your github repo.