Intro to Machine Learning with Scikit-Learn¶
In this session we will get a hands-on introduction to Machine learning with the
scikit-learn
package. In this lesson we will become familiar with:
- Basic machine learning vocabulary
- Data importing, transformation, and test/train splitting
- Unsupervised learning for clustering/dimensionality reduction
- Supervised learning for classification/regression tasks
Prerequisites¶
You should already have the necessary packages installed, but if you don't you can install them like this:
conda install -c conda-forge matplotlib scikit-learn
Open a new notebook in your hack-5-python/notebooks
directory and rename this notebook to
"15.1-intro-scikit.ipynb". In a new cell, include a few imports that we will need:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
Fetch the Iris data¶
We have been loading and cleaning the iris data by hand up to this point, but now we are going to make use of a nice feature in scikit-learn, which provides the iris data as a pre-loaded dataset.
The iris data is loaded as a dictionary with several keys. Spend a few moments plotting the values associated with each of those keys and make sure you understand the structure of this data.
data = datasets.load_iris()
iris.keys()
Data Cleaning¶
Previously we had been removing rows that had contained missing values from the iris
data. This might not be the best strategy because for a given row with missing data,
it may still have values in the columns that are not missing, and throwing away data
feels bad. In the case where you want to retain all your data you can elect to impute
missing values, which means to fill them in using some strategy which is hopefully
defensible. A very simple and defensible strategy is to impute missing data with the
mean value for a given column. Much more detailed information about imputation
is available on the scikit.impute documentation
.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(iris.data)
Visualizing raw data¶
As we did last week, we can use matplotlib scatter
to visualize the
relationship between 2 of the dimensions of data.
plt.scatter(iris.data[:, 0], iris.data[:, 1], c=data.target)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
Unsupervised learning: Dimension reduction & clustering¶
The nature of 'Big Data' is that it is hugely multidimensional, so plotting 2 dimension at a time might be a pointless operation. Unsupervised learning is a class of methods that takes unlabeled high-dimensional data and collapses it into a smaller number of dimensions that can capture in some way the relationship between the data points. I sometimes think of unsupervised learning as 'exploratory data analysis', because it's a way of looking at the structure of the data through different lenses. Unsupervised learning also includes clustering methods for finding groupings of data-points that are 'similar' given some statistical model. We'll only talk about dimension reduction (or 'decomposition') today, but you can learn more about both of these things in the scikit-learn documentation on "Unsupervised learning: Clustering" & "Unsupervised learning: Decomposition"
Principle component analysis¶
Principal component analysis is a dimensionality reduction method used to transform and project data points onto fewer orthogonal axes that can explain the greatest amount of variance in the data.
We begin by constructing a PCA
object, specifying to retain 3 components, and then calling
fit_transform()
passing in the iris data. fit_transform()
is actually a shortcut for calling
pca.fit(iris.data).transform(iris.data)
, because fitting and transforming are two different
actions, but in some cases (as with PCA) it isn't necessary (or useful) to fit a model to data
without transforming it as well.
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_r = pca.fit_transform(iris.data)
X_r
is a new variable holding the values of each data point transformed with PCA.
Evaluate the PCA¶
One key piece of information with PCA is the amount of variance explained along each orthogonal PC axis. By definition the variance explained monotonically decreases from the first to the last PC, meaning the first axis captures the most variation, and the second axis captures the second most, and so on. This allows you to 'read the tea leaves' a bit in interpreting the relationship between samples in PC-space.
## Percentage of variance explained for each component
print(f"Explained variance ratio (first two components): {pca.explained_variance_ratio_})
Visualize samples in PC space¶
We can now plot the transformed data to see a 2-dimensional representation of our 4-dimensional iris dataset.
colors = ["navy", "turquoise", "darkorange"]
for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names):
plt.scatter(X_r[iris.target == i, 0], X_r[iris.target == i, 1], color=color, label=target_name)
plt.legend()
PCA is a powerful and very widely used tool for dimension reduction, which has many features that we have not discussed today that you can find in the PCA documentation
Challenge: Plot PCs other than 1 and 2¶
In the previous example we plotted PCs 1 and 2, which by definition explain the most
variance in the data. Because we asked for n_components=3
we actually have 3 PC dimensions
to play with, so try plotting dimensions other than 1 and 2. Try plotting 2 and 3, for example,
but first think about how the relationship between the points will change when looking at these PCs
Supervised learning: Classification tasks¶
Whereas unsupervised learning used unlabeled data and statistical models to help identify
patterns in this data, Supervised learning is a class of methods that uses labeled high-dimensional data, with the goal of learning a mapping between the features
(explanatory or
predictor variables) and the targets
(response variables, or really the labels applied to the data). We could spend an entire semester on supervised learning in scikit-learn, so we will only touch on one specific model and the broad outline of the
fit/transform/predict paradigm.
The first kind of supervised learning models we will explore are Classifation models. In a classification task the labels are categorical variables, and the goal of the inference is to learn a mapping between the values of the data columns (features) and the corresponding labels (targets).
In the simplest terms, we want to be able to look at the values of our iris measurements and learn something about how all these values combined can tell us about species ID, with the goal of being able to ID a new flower simply by pluggin in values for these measuerements.
Train/Test split¶
For classification problems the target must be categorical, and so we will choose 'species'
as the target, and the data columns as the features. One important step in ML workflows
is splitting the data into training
and testing
sets. The reason for this is we want to
train the ML model on one set of data, and then we want to evaluate how it performs on a
set of data that it hasn't 'seen' yet. This provides a more honest evaluation of model performance.
scikit-learn provides a function for this called test_train_split()
which handles all the
details of randomly partitioning your data. In ML (particularly in the scikit world),
features are indicated by variables labeled X
and targets are indicated by variables labeled y
,
and in the specific case of test_train_split
, which returns a 4-tuple, the train and test
data variables are almost always named this way:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_data, iris_data.target)
Look at the shape
of some of these new variables? What is the shape of iris.data
? What can you
infer about the default ratio of training to testing data this function uses?
Random Forest Classifier¶
There is a huge wealth of parameters for tuning
Random Forest models. We will side-step
these for now by specifying n_estimators=100
as the only parameter, and guiding you to the
wonderful documentation dedicated to tuning hyper-parameters of an estimator for more in-depth information.
After we instantiate a new RandomForestClassifier
object, the next thing we will want to do
is "fit" the model to the features and the targets.
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
Model evaluation¶
For classification, key metrics include:
- Accuracy: Overall proportion of correct predictions
- Precision: Proportion of positive predictions that are actual positives
- Recall: Proportion of actual positives predicted positively
from sklearn.metrics import classification_report
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))
Visualizing classification performance¶
scikit-learn also provides a nice way of visualy representing classification
performance with a confusion matrix. The ConfusionMatrixDisplay
class has a method
called from_estimator
which takes as arguments a trained model, and the test data
for features and targets, and plots a nice graphical representation that captures most
of the information in the classification report in a way that is very confusing at first,
which is how it got this name. ;)
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(rf_model, X_test, y_test)
Supervised learning: Regression tasks¶
In our previous example of a classification task, we had a categorical target (response variable), so the predictions were category probabilities (with the most probable category being the predicted value). If we have continuous value targets (e.g. numerical values), we can approach this task as a regression procedure where we are attempting to find a (potentially non-linear) relationship between the features and the target(s).
Train/Test split¶
Let's look back at the iris data one more time, to see what features we have available.
print(iris.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Petal width (cm)
as the target
(response variable), and
we'll use the other 3 measurements as features
(predictor variables). Essentially we will ask
how well we can predict petal width, if we only have information about sepal length/width, and
petal length. Here we have to redo the test_train_split()
call, because now we will split on
a different subset of the data.
# imputed_data[:, :3] - The first columns of the imputed_data matrix
# imputed_data[:, -1] - The last column of this matrix
X_train, X_test, y_train, y_test = train_test_split(imputed_data[:, :3], imputed_data[:, -1])
Random Forest Regression¶
The format of the call to fit the random forest regressor is identical to the
call to fit the classifier, we simply call fit()
and pass in the features and
targets from the training set.
from sklearn.ensemble import RandomForestRegressor
rgr_model = RandomForestRegressor(n_estimators=100)
rgr_model.fit(X_train, y_train)
Model evaulation¶
Sklearn regression estimators have a different evaluation mechanism for
performance, because the 'targets' (response variables) are continuous
and not categorical. To evaluate regression performance you can use the
score()
method, passing in the held-out features (X_test
) and targets
(y_test
).
print(rgr_model.score(X_test, y_test))
Visualizing regression performance¶
You can also visualize regression prediction performance by calling predict()
on the held-out features (X_test
) to generate target predictions (here y_pred
)
and then plotting the results in a scatter plot, with "True" values on the y-axis
and predicted values on the y-axis.
y_pred = rgr_model.predict(X_test)
plt.scatter(y_test, y_pred)
Challenge: Re-do this analysis workflow with a new dataset¶
Go through the same workflow we did above with exploratory analysis (PCA dimension
reduction), classification, and regression using the penguin_data.csv
file from
today's debugging challenge.
Wrapping up¶
Add, commit, and push your notebook to your github repo.