14.1 -- Data Visualization
Prerequisites¶
Let's make sure we have necessary modules installed, including matplotlib
and scikit-learn
,
which we will briefly introduce, but will get into more detail on next week.
conda install -c conda-forge matplotlib scikit-learn
Open a new notebook in your hack-5-python/notebooks
directory and rename this notebook to
"14.1-plotting.ipynb". In a new cell, include a few imports that we will need:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
Fetch the Iris data¶
We have been loading and cleaning the iris data by hand up to this point, but now
we are going to make use of a nice feature in scikit-learn, which provides the iris
data as a pre-loaded dataset. This chunk of code is 'data curation': we load in
the iris data, transform it into a pandas DataFrame (it is not structured natively as
a dataframe because it a bunch of other features internally), and then create a new
column called species
to hold the species IDs.
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
iris_df
Exploring the data 1-D at a time with histograms¶
Let's say we want to visualize the differences between the different measurements for each
of the species. A simple way to do this is with histograms. Here we will use the pd.groupby
method to group the data by species ID.
# Define a variable for the feature we wish to plot
feature = "sepal length (cm)"
# Group the data in the dataframe by species ID
gb = iris_df.groupby("species")
# Iterate through the groupby object
for sp, dat in gb:
# For each species, plot the data for the given feature as a histogram
plt.hist(dat[feature], label=sp)
Set the alpha channel¶
The initial result looks good, but hist
defaults to plotting all histograms with zero
opacity, so it's not clear what is happening in regions where the histograms overlap. Matplotlib
plotting functions can take an opacity parameter called alpha
which takes values between 0 and 1.
Try setting the alpha
to a small value and replotting.
feature = "sepal length (cm)"
gb = iris_df.groupby("species")
for sp, dat in gb:
plt.hist(dat[feature], label=sp, alpha=0.01)
plt.legend()
This produces a ghostly outline of the histogram. Try experimenting with different alpha values until you find something you are happy with.
Change the colors¶
By default, matplotlib will choose different colors for each histogram, and this is fine, but you might want to control the colors for each histogram to unify color schemes across your plots, and also to beautify them. You can specify colors in several different ways, but the most straightforward way is to pick from the large list of matplotlib named colors. Here are some examples:
feature = "sepal length (cm)"
gb = iris_df.groupby("species")
# Create a dictionary mapping species IDs to color names, which we will access inside the for loop
cdict = {'setosa':'cornflowerblue',
'versicolor':'salmon',
'virginica':'goldenrod'}
for sp, dat in gb:
plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])
Experiment with different color schemes until you find one you are happy with.
Add a legend¶
If you add labels to the hist
call (as we have done), then matplotlib can easily
generate a legend for your figure with the plt.legend()
method.
feature = "sepal length (cm)"
gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
'versicolor':'salmon',
'virginica':'goldenrod'}
for sp, dat in gb:
plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])
plt.legend()
By default matplotlib legend()
will place itself in the most unoccupied region of the figure, and
generally it does this pretty well, but you can control the placement of the legend within the figure
using the loc
argument, which you can learn more about in the legend
documentation.
Add axis labels and title¶
Finally, we will wrap up this figure by providing axis labels and a plot title, which we can do with
the plt.title()
, and plt.xlabel()
/plt.ylabel()
methods.
feature = "sepal length (cm)"
gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
'versicolor':'salmon',
'virginica':'goldenrod'}
for sp, dat in gb:
plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])
plt.title(f'Histogram of {feature} by Species')
plt.ylabel('Count')
plt.xlabel(feature)
If you wish to modify the style of the title and axis labels you can change many of these
properties (for example the fontsize
), all of which are documented in the text
documentation
Saving figures to a file¶
Once you are happy with your figure, you can either right-click on the image in the
notebook and "copy output to clipboard", but this will copy a low-res version of the image, which is fine for presentations but not great for publications. For this reason
matplotlib provides a savefig()
method for saving the resulting figure in several
different possible output formats, also allowing to control the output resolution (using
the dpi
argument, for example). Here is an example of saving as a standard resolution PNG file.
feature = "sepal length (cm)"
gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
'versicolor':'salmon',
'virginica':'goldenrod'}
for sp, dat in gb:
plt.hist(dat[feature], label=sp, alpha=0.5, color=cdict[sp])
plt.title(f'Histogram of {feature} by Species')
plt.ylabel('Count')
plt.xlabel(feature)
plt.savefig('sepal_length.png')
You can see what other file formats are available by calling the
get_supported_filetypes()
function, like this:
plt.gcf().canvas.get_supported_filetypes()
{'eps': 'Encapsulated Postscript',
'jpg': 'Joint Photographic Experts Group',
'jpeg': 'Joint Photographic Experts Group',
'pdf': 'Portable Document Format',
'pgf': 'PGF code for LaTeX',
'png': 'Portable Network Graphics',
'ps': 'Postscript',
'raw': 'Raw RGBA bitmap',
'rgba': 'Raw RGBA bitmap',
'svg': 'Scalable Vector Graphics',
'svgz': 'Scalable Vector Graphics',
'tif': 'Tagged Image File Format',
'tiff': 'Tagged Image File Format',
'webp': 'WebP Image Format'}
Post a copy of your finished histogram¶
I have created a google slides presentation for sharing our data visualizations. Open this link (you will need to use your CU account), and paste your favorite visualization into a new slide.
Challenge: Visualize another column of data¶
Now that you have the code settled for one column of the data,
and you abstracted out the feature to plot as a variable named feature
, it should be simple to plot another feature by dropping in another variable name. Go ahead and try it.
Visualizing 2-D data with scatterplots¶
Often we will have more than one dimension of data and we might have hypotheses about how the different dimension of data co-vary in our dataset. Let's remind ourselves of what types of data we have available in the iris data.
print(iris_df.columns)
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)', 'species'],
dtype='object')
Let's choose sepal length and width, as these seem like they might be reasonably
correlated. The machinery of plotting 2D data is pretty similar to plotting the 1D data
with the exception that we need to provide x and y values for each datapoint. Whereas
above with hist
we only passed in one column of data, now we need to pass in two columns,
otherwise the details of manipulating the 'look' of the figure are quite similar.
gb = iris_df.groupby("species")
x_feature = "sepal length (cm)"
y_feature = "sepal width (cm)"
for sp, dat in gb:
plt.scatter(dat[x_feature], dat[y_feature], label=sp)
plt.legend()
Styling scatterplots¶
The details of figure styling for scatterplots are identical to those for histogram, so we can copy our approach for styling the figure above.
x_feature = "sepal length (cm)"
y_feature = "sepal width (cm)"
gb = iris_df.groupby("species")
cdict = {'setosa':'cornflowerblue',
'versicolor':'salmon',
'virginica':'goldenrod'}
for sp, dat in gb:
plt.scatter(dat[x_feature], dat[y_feature], label=sp, alpha=0.75, color=cdict[sp])
plt.title(f'Histogram of {feature} by Species')
plt.ylabel('Count')
plt.xlabel(feature)
plt.legend()
plt.savefig('sepal_length.png')
One difference with scatter
is that you can define a "marker style", which controls
the form of the scatter points. You can define the marker for a scatterplot using the
marker
argument, for example marker="*"
would use stars instead of points. There are
many options for marker style that you can see in the documentation
Scatterplot challenge¶
In fact you can reasonably represent 3-Dimensions in a scatterplot by using the size of the marker to indicate a third dimension. This is something that was shown in the reading this week in section 04.02. If you have time, experiment with seeing if you can recreate this figure.
Post a copy of your finished scatterplot¶
As before, post your finished scatterplot to our doc for sharing our data visualizations. Go ahead and post it on the same slide as your histogram, side by side.
Further challenges if time remains¶
Use the requests
module to download a small mammal life history dataset that was published by Ernst in 2003 in the journal Ecology.
Load this data into a pandas DataFrame. You will need to use a couple additional arguments when
reading this file, which indicate tab separated values and replace missing data values with NaN.
Here is the call you'll need to load the data:
pd.read_csv("Mammal_lifehistories_v2.txt", sep="\t", na_values=['-999', '-999.00'])
- Use a scatter plot to investigate adult mass vs. newborn mass. What do you notice about this plot?
- Try transforming the data using
np.log10()
and then replotting the same data - Develop a hypothesis about the relationship between adult body mass and litter size. Should litter size be smaller or larger with increasing body mass? Plot the data to get a sense of whether you were right about this. When plotting, will you log-transform both, one, or neither of the axis? Justify your decision.
Wrapping up¶
Add, commit, and push your notebook to your github repo.