Skip to content

Intro to Machine Learning with Scikit-Learn

In this session we will get a hands-on introduction to Machine learning with the scikit-learn package. In this lesson we will become familiar with:

  • Basic machine learning vocabulary
  • Data importing, transformation, and test/train splitting
  • Unsupervised learning for clustering/dimensionality reduction
  • Supervised learning for classification/regression tasks

Prerequisites

You should already have the necessary packages installed, but if you don't you can install them like this:

conda install -c conda-forge matplotlib scikit-learn

Open a new notebook in your hack-5-python/notebooks directory and rename this notebook to "15.1-intro-scikit.ipynb". In a new cell, include a few imports that we will need:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.decomposition import PCA

Fetch the Iris data

We have been loading and cleaning the iris data by hand up to this point, but now we are going to make use of a nice feature in scikit-learn, which provides the iris data as a pre-loaded dataset.

The iris data is loaded as a dictionary with several keys. Spend a few moments plotting the values associated with each of those keys and make sure you understand the structure of this data.

data = datasets.load_iris()

iris.keys()

Data Cleaning

Previously we had been removing rows that had contained missing values from the iris data. This might not be the best strategy because for a given row with missing data, it may still have values in the columns that are not missing, and throwing away data feels bad. In the case where you want to retain all your data you can elect to impute missing values, which means to fill them in using some strategy which is hopefully defensible. A very simple and defensible strategy is to impute missing data with the mean value for a given column. Much more detailed information about imputation is available on the scikit.impute documentation.

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')  
imputed_data = imputer.fit_transform(iris.data)

Visualizing raw data

As we did last week, we can use matplotlib scatter to visualize the relationship between 2 of the dimensions of data.

plt.scatter(iris.data[:, 0], iris.data[:, 1], c=data.target)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')

Unsupervised learning: Dimension reduction & clustering

The nature of 'Big Data' is that it is hugely multidimensional, so plotting 2 dimension at a time might be a pointless operation. Unsupervised learning is a class of methods that takes unlabeled high-dimensional data and collapses it into a smaller number of dimensions that can capture in some way the relationship between the data points. I sometimes think of unsupervised learning as 'exploratory data analysis', because it's a way of looking at the structure of the data through different lenses. Unsupervised learning also includes clustering methods for finding groupings of data-points that are 'similar' given some statistical model. We'll only talk about dimension reduction (or 'decomposition') today, but you can learn more about both of these things in the scikit-learn documentation on "Unsupervised learning: Clustering" & "Unsupervised learning: Decomposition"

Principle component analysis

Principal component analysis is a dimensionality reduction method used to transform and project data points onto fewer orthogonal axes that can explain the greatest amount of variance in the data.

We begin by constructing a PCA object, specifying to retain 3 components, and then calling fit_transform() passing in the iris data. fit_transform() is actually a shortcut for calling pca.fit(iris.data).transform(iris.data), because fitting and transforming are two different actions, but in some cases (as with PCA) it isn't necessary (or useful) to fit a model to data without transforming it as well.

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
X_r = pca.fit_transform(iris.data)
Here X_r is a new variable holding the values of each data point transformed with PCA.

Evaluate the PCA

One key piece of information with PCA is the amount of variance explained along each orthogonal PC axis. By definition the variance explained monotonically decreases from the first to the last PC, meaning the first axis captures the most variation, and the second axis captures the second most, and so on. This allows you to 'read the tea leaves' a bit in interpreting the relationship between samples in PC-space.

## Percentage of variance explained for each component
print(f"Explained variance ratio (first two components): {pca.explained_variance_ratio_})

Visualize samples in PC space

We can now plot the transformed data to see a 2-dimensional representation of our 4-dimensional iris dataset.

colors = ["navy", "turquoise", "darkorange"]

for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names):
    plt.scatter(X_r[iris.target == i, 0], X_r[iris.target == i, 1], color=color, label=target_name)

plt.legend()

PCA is a powerful and very widely used tool for dimension reduction, which has many features that we have not discussed today that you can find in the PCA documentation

Challenge: Plot PCs other than 1 and 2

In the previous example we plotted PCs 1 and 2, which by definition explain the most variance in the data. Because we asked for n_components=3 we actually have 3 PC dimensions to play with, so try plotting dimensions other than 1 and 2. Try plotting 2 and 3, for example, but first think about how the relationship between the points will change when looking at these PCs

Supervised learning: Classification tasks

Whereas unsupervised learning used unlabeled data and statistical models to help identify patterns in this data, Supervised learning is a class of methods that uses labeled high-dimensional data, with the goal of learning a mapping between the features (explanatory or predictor variables) and the targets (response variables, or really the labels applied to the data). We could spend an entire semester on supervised learning in scikit-learn, so we will only touch on one specific model and the broad outline of the fit/transform/predict paradigm.

The first kind of supervised learning models we will explore are Classifation models. In a classification task the labels are categorical variables, and the goal of the inference is to learn a mapping between the values of the data columns (features) and the corresponding labels (targets).

In the simplest terms, we want to be able to look at the values of our iris measurements and learn something about how all these values combined can tell us about species ID, with the goal of being able to ID a new flower simply by pluggin in values for these measuerements.

Train/Test split

For classification problems the target must be categorical, and so we will choose 'species' as the target, and the data columns as the features. One important step in ML workflows is splitting the data into training and testing sets. The reason for this is we want to train the ML model on one set of data, and then we want to evaluate how it performs on a set of data that it hasn't 'seen' yet. This provides a more honest evaluation of model performance.

scikit-learn provides a function for this called test_train_split() which handles all the details of randomly partitioning your data. In ML (particularly in the scikit world), features are indicated by variables labeled X and targets are indicated by variables labeled y, and in the specific case of test_train_split, which returns a 4-tuple, the train and test data variables are almost always named this way:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_data, iris_data.target)

Look at the shape of some of these new variables? What is the shape of iris.data? What can you infer about the default ratio of training to testing data this function uses?

Random Forest Classifier

There is a huge wealth of parameters for tuning Random Forest models. We will side-step these for now by specifying n_estimators=100 as the only parameter, and guiding you to the wonderful documentation dedicated to tuning hyper-parameters of an estimator for more in-depth information.

After we instantiate a new RandomForestClassifier object, the next thing we will want to do is "fit" the model to the features and the targets.

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100) 
rf_model.fit(X_train, y_train)

Model evaluation

For classification, key metrics include:

  • Accuracy: Overall proportion of correct predictions
  • Precision: Proportion of positive predictions that are actual positives
  • Recall: Proportion of actual positives predicted positively
from sklearn.metrics import classification_report

y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))

Visualizing classification performance

scikit-learn also provides a nice way of visualy representing classification performance with a confusion matrix. The ConfusionMatrixDisplay class has a method called from_estimator which takes as arguments a trained model, and the test data for features and targets, and plots a nice graphical representation that captures most of the information in the classification report in a way that is very confusing at first, which is how it got this name. ;)

from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(rf_model, X_test, y_test)

Supervised learning: Regression tasks

In our previous example of a classification task, we had a categorical target (response variable), so the predictions were category probabilities (with the most probable category being the predicted value). If we have continuous value targets (e.g. numerical values), we can approach this task as a regression procedure where we are attempting to find a (potentially non-linear) relationship between the features and the target(s).

Train/Test split

Let's look back at the iris data one more time, to see what features we have available.

print(iris.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
For this exercise we will choose Petal width (cm) as the target (response variable), and we'll use the other 3 measurements as features (predictor variables). Essentially we will ask how well we can predict petal width, if we only have information about sepal length/width, and petal length. Here we have to redo the test_train_split() call, because now we will split on a different subset of the data.

# imputed_data[:, :3] - The first columns of the imputed_data matrix
# imputed_data[:, -1] - The last column of this matrix

X_train, X_test, y_train, y_test = train_test_split(imputed_data[:, :3], imputed_data[:, -1])

Random Forest Regression

The format of the call to fit the random forest regressor is identical to the call to fit the classifier, we simply call fit() and pass in the features and targets from the training set.

from sklearn.ensemble import RandomForestRegressor

rgr_model = RandomForestRegressor(n_estimators=100) 
rgr_model.fit(X_train, y_train)

Model evaulation

Sklearn regression estimators have a different evaluation mechanism for performance, because the 'targets' (response variables) are continuous and not categorical. To evaluate regression performance you can use the score() method, passing in the held-out features (X_test) and targets (y_test).

print(rgr_model.score(X_test, y_test))

Visualizing regression performance

You can also visualize regression prediction performance by calling predict() on the held-out features (X_test) to generate target predictions (here y_pred) and then plotting the results in a scatter plot, with "True" values on the y-axis and predicted values on the y-axis.

y_pred = rgr_model.predict(X_test)
plt.scatter(y_test, y_pred)

Challenge: Re-do this analysis workflow with a new dataset

Go through the same workflow we did above with exploratory analysis (PCA dimension reduction), classification, and regression using the penguin_data.csv file from today's debugging challenge.

Wrapping up

Add, commit, and push your notebook to your github repo.