Topics¶

Review of model evaluation procedures
Steps for K-fold cross-validation
Comparing cross-validation to train/test split
Cross-validation recommendations
Cross-validation example: parameter tuning
Cross-validation example: model selection
Cross-validation example: feature selection
Improvements to cross-validation
Resources

This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos.

1. Review of model evaluation procedures¶

Motivation: Need a way to choose between machine learning models

Goal is to estimate likely performance of a model on out-of-sample data

Initial idea: Train and test on the same data

But, maximizing training accuracy rewards overly complex models which overfit the training data

Alternative idea: Train/test split

Split the dataset into two pieces, so that the model can be trained and tested on different data
Testing accuracy is a better estimate than training accuracy of out-of-sample performance
Problem with train/test split
- It provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy
- Testing accuracy can change a lot depending on a which observation happen to be in the testing set

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# read in the iris data
iris = load_iris()

# create X (features) and y (response)
X = iris.data
y = iris.target

# use train/test split with different random_state values
# we can change the random_state values that changes the accuracy scores
# the accuracy changes a lot
# this is why testing accuracy is a high-variance estimate
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=6)

# check classification accuracy of KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

0.97368421052631582

Question: What if we created a bunch of train/test splits, calculated the testing accuracy for each, and averaged the results together?

Answer: That's the essense of cross-validation!

2. Steps for K-fold cross-validation¶

Split the dataset into K equal partitions (or "folds")
- So if k = 5 and dataset has 150 observations
- Each of the 5 folds would have 30 observations
Use fold 1 as the testing set and the union of the other folds as the training set
- Testing set = 30 observations (fold 1)
- Training set = 120 observations (folds 2-5)
Calculate testing accuracy
Repeat steps 2 and 3 K times, using a different fold as the testing set each time
- We will repeat the process 5 times
- 2nd iteration
  - fold 2 would be the testing set
  - union of fold 1, 3, 4, and 5 would be the training set
- 3rd iteration
  - fold 3 would be the testing set
  - union of fold 1, 2, 4, and 5 would be the training set
- And so on...
Use the average testing accuracy as the estimate of out-of-sample accuracy

Diagram of 5-fold cross-validation:

5-fold cross-validation

# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.cross_validation import KFold
kf = KFold(25, n_folds=5, shuffle=False)


# print the contents of each training and testing set
# ^ - forces the field to be centered within the available space
# .format() - formats the string similar to %s or %n
# enumerate(sequence, start=0) - returns an enumerate object
print('{} {:^61} {}'.format('Iteration', 'Training set obsevations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
    print('{!s:^9} {} {!s:^25}'.format(iteration, data[0], data[1]))

Iteration                   Training set obsevations                    Testing set observations
    1     [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [0 1 2 3 4]
    2     [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [5 6 7 8 9]
    3     [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]     [10 11 12 13 14]
    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]     [15 16 17 18 19]
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]     [20 21 22 23 24]

Dataset contains 25 observations (numbered 0 through 24)
5-fold cross-validation, thus it runs for 5 iterations
For each iteration, every observation is either in the training set or the testing set, but not both
Every observation is in the testing set exactly once

3. Comparing cross-validation to train/test split¶

Advantages of cross-validation:

More accurate estimate of out-of-sample accuracy
More "efficient" use of data
- This is because every observation is used for both training and testing

Advantages of train/test split:

Runs K times faster than K-fold cross-validation
- This is because K-fold cross-validation repeats the train/test split K-times
Simpler to examine the detailed results of the testing process

4. Cross-validation recommendations¶

K can be any number, but K=10 is generally recommended
- This has been shown experimentally to produce the best out-of-sample estimate
For classification problems, stratified sampling is recommended for creating the folds
- Each response class should be represented with equal proportions in each of the K folds
  - If dataset has 2 response classes
    - Spam/Ham
    - 20% observation = ham
    - Each cross-validation fold should consist of exactly 20% ham
- scikit-learn's cross_val_score function does this by default

5. Cross-validation example: parameter tuning¶

Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset

We want to choose the best tuning parameters that best generalize the data

from sklearn.cross_validation import cross_val_score

# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
# k = 5 for KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

# Use cross_val_score function
# We are passing the entirety of X and y, not X_train or y_train, it takes care of splitting the dat
# cv=10 for 10 folds
# scoring='accuracy' for evaluation metric - althought they are many
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)

[ 1.          0.93333333  1.          1.          0.86666667  0.93333333
  0.93333333  1.          1.          1.        ]

In the first iteration, the accuracy is 100%
Second iteration, the accuracy is 93% and so on

cross_val_score executes the first 4 steps of k-fold cross-validation steps which I have broken down to 7 steps here in detail

Split the dataset (X and y) into K=10 equal partitions (or "folds")
Train the KNN model on union of folds 2 to 10 (training set)
Test the model on fold 1 (testing set) and calculate testing accuracy
Train the KNN model on union of fold 1 and fold 3 to 10 (training set)
Test the model on fold 2 (testing set) and calculate testing accuracy
It will do this on 8 more times
When finished, it will return the 10 testing accuracy scores as a numpy array

# use average accuracy as an estimate of out-of-sample accuracy
# numpy array has a method mean()
print(scores.mean())

0.966666666667

Our goal here is to find the optimal value of K

# search for an optimal value of K for KNN

# range of k we want to try
k_range = range(1, 31)
# empty list to store scores
k_scores = []

# 1. we will loop through reasonable values of k
for k in k_range:
    # 2. run KNeighborsClassifier with k neighbours
    knn = KNeighborsClassifier(n_neighbors=k)
    # 3. obtain cross_val_score for KNeighborsClassifier with k neighbours
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    # 4. append mean of scores for k neighbors to k_scores list
    k_scores.append(scores.mean())


print(k_scores)

[0.95999999999999996, 0.95333333333333337, 0.96666666666666656, 0.96666666666666656, 0.96666666666666679, 0.96666666666666679, 0.96666666666666679, 0.96666666666666679, 0.97333333333333338, 0.96666666666666679, 0.96666666666666679, 0.97333333333333338, 0.98000000000000009, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.98000000000000009, 0.97333333333333338, 0.98000000000000009, 0.96666666666666656, 0.96666666666666656, 0.97333333333333338, 0.95999999999999996, 0.96666666666666656, 0.95999999999999996, 0.96666666666666656, 0.95333333333333337, 0.95333333333333337, 0.95333333333333337]

# in essence, this is basically running the k-fold cross-validation method 30 times because we want to run through K values from 1 to 30
# we should have 30 scores here
print('Length of list', len(k_scores))
print('Max of list', max(k_scores))

Length of list 30
Max of list 0.98

# plot how accuracy changes as we vary k
import matplotlib.pyplot as plt
%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
# plt.plot(x_axis, y_axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-validated accuracy')

<matplotlib.text.Text at 0x111cb07b8>

The maximum cv accuracy occurs from k=13 to k=20

The general shape of the curve is an upside down yield
- This is quite typical when examining the model complexity and accuracy
- This is an example of bias-variance trade off
  - Low values of k (low bias, high variance)
    - The 1-Nearest Neighbor classifier is the most complex nearest neighbor model
    - It has the most jagged decision boundary, and is most likely to overfit
  - High values of k (high bias, low variance)
    - underfit
  - Best value is the middle of k (most likely to generalize out-of-sample data)
    - just right
The best value of k
- Higher values of k produce less complex model
  - So we will choose 20 as our best KNN model

6. Cross-validation example: model selection¶

Goal: Compare the best KNN model with logistic regression on the iris dataset

# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=20)

# Instead of saving 10 scores in object named score and calculating mean
# We're just calculating the mean directly on the results
print(cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())

0.98

# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print(cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())

0.953333333333

We can conclude that KNN is likely a better choice than logistic regression

7. Cross-validation example: feature selection¶

Goal: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# read in the advertising dataset
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]

# select the Sales column as the response (y)
# since we're selecting only one column, we can select the attribute using .attribute
y = data.Sales

# 10-fold cross-validation with all three features
# instantiate model
lm = LinearRegression()

# store scores in scores object
# we can't use accuracy as our evaluation metric since that's only relevant for classification problems
# RMSE is not directly available so we will use MSE
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print(scores)

[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618
 -8.17338214 -2.11409746 -3.04273109 -2.45281793]

MSE should be positive

But why is the MSE here negative?
MSE is a loss function
- It is something we want to minimize
- A design decision was made so that the results are made negative
- The best results would be the largest number (the least negative) so we can still maximize similar to classification accuracy
Classification Accuracy is a reward function
- It is something we want to maximize

# fix the sign of MSE scores
mse_scores = -scores
print(mse_scores)

[ 3.56038438  3.29767522  2.08943356  2.82474283  1.3027754   1.74163618
  8.17338214  2.11409746  3.04273109  2.45281793]

# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print(rmse_scores)

[ 1.88689808  1.81595022  1.44548731  1.68069713  1.14139187  1.31971064
  2.85891276  1.45399362  1.7443426   1.56614748]

# calculate the average RMSE
print(rmse_scores.mean())

1.69135317081

# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print(np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean())

1.67967484191

Without Newspaper

Average RMSE = 1.68
lower number than with model with Newspaper
- RMSE is something we want to minimize
- So the model excluding Newspaper is a better model

8. Improvements to cross-validation¶

Repeated cross-validation

Repeat cross-validation multiple times (with different random splits of the data) and average the results
More reliable estimate of out-of-sample performance by reducing the variance associated with a single trial of cross-validation

Creating a hold-out set

"Hold out" a portion of the data before beginning the model building process
Locate the best model using cross-validation on the remaining data, and test it using the hold-out set
More reliable estimate of out-of-sample performance since hold-out set is truly out-of-sample

Feature engineering and selection within cross-validation iterations

Normally, feature engineering and selection occurs before cross-validation
Instead, perform all feature engineering and selection within each cross-validation iteration
More reliable estimate of out-of-sample performance since it better mimics the application of the model to out-of-sample data

9. Resources¶

scikit-learn documentation: Cross-validation, Model evaluation
scikit-learn issue on GitHub: MSE is negative when returned by cross_val_score
Section 5.1 of An Introduction to Statistical Learning (11 pages) and related videos: K-fold and leave-one-out cross-validation (14 minutes), Cross-validation the right and wrong ways (10 minutes)
Scott Fortmann-Roe: Accurately Measuring Model Prediction Error
Machine Learning Mastery: An Introduction to Feature Selection
Harvard CS109: Cross-Validation: The Right and Wrong Way
Journal of Cheminformatics: Cross-validation pitfalls when selecting and assessing regression and classification models

Cross-Validation