Efficiently Searching Optimal Tuning Parameters
Topics¶
- Review of K-fold cross-validation
- Review of parameter tuning using cross_val_score
- More efficient parameter tuning using GridSearchCV
- Searching multiple parameters simultaneously
- Using the best parameters to make predictions
- Reducing computational expense using RandomizedSearchCV
- Resources
This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos.
1. Review of K-fold cross-validation¶
Steps for cross-validation:
- Dataset is split into K "folds" of equal size
- Each fold acts as the testing set 1 time, and acts as the training set K-1 times
- Average testing performance is used as the estimate of out-of-sample performance
- Also known as cross-validated performance
Benefits of cross-validation:
- More reliable estimate of out-of-sample performance than train/test split
- Reduce the variance of a single trial of a train/test split
- Can be used for
- Selecting tuning parameters
- Choosing between models
- Selecting features
Drawbacks of cross-validation:
- Can be computationally expensive
- Especially when the data set is very large or the model is slow to train
2. Review of parameter tuning using cross_val_score¶
Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset
- To select the best value of k for KNN model to predict species
In [1]:
# imports
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# read in the iris data
iris = load_iris()
# create X (features) and y (response)
X = iris.data
y = iris.target
print('X matrix dimensionality:', X.shape)
print('Y vector dimensionality:', y.shape)
In [3]:
# 10-fold (cv=10) cross-validation with K=5 (n_neighbors=5) for KNN (the n_neighbors parameter)
# instantiate model
knn = KNeighborsClassifier(n_neighbors=5)
# store scores in scores object
# scoring metric used here is 'accuracy' because it's a classification problem
# cross_val_score takes care of splitting X and y into the 10 folds that's why we pass X and y entirely instead of X_train and y_train
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
In [4]:
# use average accuracy as an estimate of out-of-sample accuracy
# scores is a numpy array so we can use the mean method
print(scores.mean())
In [28]:
# search for an optimal value of K for KNN
# list of integers 1 to 30
# integers we want to try
k_range = range(1, 31)
# list of scores from k_range
k_scores = []
# 1. we will loop through reasonable values of k
for k in k_range:
# 2. run KNeighborsClassifier with k neighbours
knn = KNeighborsClassifier(n_neighbors=k)
# 3. obtain cross_val_score for KNeighborsClassifier with k neighbours
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
# 4. append mean of scores for k neighbors to k_scores list
k_scores.append(scores.mean())
print(k_scores)
In [29]:
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
Out[29]:
3. More efficient parameter tuning using GridSearchCV¶
Allows you to define a grid of parameters that will be searched using K-fold cross-validation
- This is like an automated version of the "for loop" above
In [30]:
from sklearn.grid_search import GridSearchCV
In [46]:
# define the parameter values that should be searched
# for python 2, k_range = range(1, 31)
k_range = list(range(1, 31))
print(k_range)
In [50]:
# create a parameter grid: map the parameter names to the values that should be searched
# simply a python dictionary
# key: parameter name
# value: list of values that should be searched for that parameter
# single key-value pair for param_grid
param_grid = dict(n_neighbors=k_range)
print(param_grid)
In [52]:
# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
- Grid object is ready to do 10-fold cross validation on a KNN model using classification accuracy as the evaluation metric
- In addition, there is a parameter grid to repeat the 10-fold cross validation process 30 times
- Each time, the n_neighbors parameter should be given a different value from the list
- We can't give GridSearchCV just a list
- We've to specify n_neighbors should take on 1 through 30
- You can set
n_jobs = -1
to run computations in parallel (if supported by your computer and OS)- This is also called parallel programming
In [53]:
# fit the grid with data
grid.fit(X, y)
Out[53]:
Remember this is running 10-fold validation 30 times
- KNN model is being fit and predictions are being made 30 x 10 = 300 times
In [54]:
# view the complete results (list of named tuples)
grid.grid_scores_
Out[54]:
List of 30 named tuples
- First tuple
- When n_neighbors = 1
- Mean of accuracy scores = 0.96
- Standard deviation of accuracy scores = 0.053
- If SD is high, the cross-validated estimate of the accuracy might not be as reliable
- There is one tuple for each of the 30 trials of CV
In [60]:
# examine the first tuple
# we will slice the list and select its elements using dot notation and []
print('Parameters')
print(grid.grid_scores_[0].parameters)
# Array of 10 accuracy scores during 10-fold cv using the parameters
print('')
print('CV Validation Score')
print(grid.grid_scores_[0].cv_validation_scores)
# Mean of the 10 scores
print('')
print('Mean Validation Score')
print(grid.grid_scores_[0].mean_validation_score)
In [66]:
# create a list of the mean scores only
# list comprehension to loop through grid.grid_scores
grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]
print(grid_mean_scores)
In [67]:
# plot the results
# this is identical to the one we generated above
plt.plot(k_range, grid_mean_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
Out[67]:
In [69]:
# examine the best model
# Single best score achieved across all params (k)
print(grid.best_score_)
# Dictionary containing the parameters (k) used to generate that score
print(grid.best_params_)
# Actual model object fit with those best parameters
# Shows default parameters that we did not specify
print(grid.best_estimator_)
4. Searching multiple parameters simultaneously¶
- Example: tuning
max_depth
andmin_samples_leaf
for aDecisionTreeClassifier
- Could tune parameters independently: change
max_depth
while leavingmin_samples_leaf
at its default value, and vice versa - But, best performance might be achieved when neither parameter is at its default value
In [70]:
# define the parameter values that should be searched
k_range = list(range(1, 31))
# Another parameter besides k that we might vary is the weights parameters
# default options --> uniform (all points in the neighborhood are weighted equally)
# another option --> distance (weights closer neighbors more heavily than further neighbors)
# we create a list
weight_options = ['uniform', 'distance']
In [71]:
# create a parameter grid: map the parameter names to the values that should be searched
# dictionary = dict(key=values, key=values)
param_grid = dict(n_neighbors=k_range, weights=weight_options)
print(param_grid)
In [73]:
# instantiate and fit the grid
# exhaustive grid-search because it's trying every combination
# 10-fold cross-validation is being performed 30 x 2 = 60 times
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
grid.fit(X, y)
Out[73]:
In [74]:
# view the complete results
grid.grid_scores_
Out[74]:
In [76]:
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
# Best score did not improve for this model
5. Using the best parameters to make predictions¶
In [77]:
# train your model using all data and the best known parameters
# instantiate model with best parameters
knn = KNeighborsClassifier(n_neighbors=13, weights='uniform')
# fit with X and y, not X_train and y_train
# even if we use train/test split, we should train on X and y before making predictions on new data
# otherwise we throw away potential valuable data we can learn from
knn.fit(X, y)
# make a prediction on out-of-sample data
knn.predict([3, 5, 4, 2])
Out[77]:
In [78]:
# shortcut:
# GridSearchCV automatically refits the best model using all of the data
# that best fitted model is stored in grid object
# we can then use prediction using the best fitted model
# code in this cell is the same as the top
grid.predict([3, 5, 4, 2])
Out[78]:
6. Reducing computational expense using RandomizedSearchCV¶
- This is a close cousin to GridSearchCV
- Searching many different parameters at once may be computationally infeasible
- For example
- Searching 10 parameters (each range of 1000)
- Require 10,000 trials of CV
- 100,000 model fits with 10-fold CV
- 100,000 predictions with 10-fold CV
- Searching 10 parameters (each range of 1000)
- For example
RandomizedSearchCV
searches a subset of the parameters, and you control the computational "budget"- You can decide how long you want it to run for depending on the computational time we have
In [79]:
from sklearn.grid_search import RandomizedSearchCV
In [81]:
# specify "parameter distributions" rather than a "parameter grid"
# since both parameters are discrete, so param_dist is the same as param_grid
param_dist = dict(n_neighbors=k_range, weights=weight_options)
# if parameters are continuous (like regularization)
- Important: Specify a continuous distribution (rather than a list of values) for any continous parameters
In [82]:
# n_iter controls the number of searches
# instantiate model
# 2 new params
# n_iter --> controls number of random combinations it will try
# random_state for reproducibility
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)
# fit
rand.fit(X, y)
# scores
rand.grid_scores_
Out[82]:
In [83]:
# examine the best model
print(rand.best_score_)
print(rand.best_params_)
print(rand.best_estimator_)
In [89]:
# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in list(range(20)):
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10)
rand.fit(X, y)
best_scores.append(rand.best_score_)
print(best_scores)
7. Resources¶
- scikit-learn documentation: Grid search, GridSearchCV, RandomizedSearchCV
- Timed example: Comparing randomized search and grid search
- scikit-learn workshop by Andreas Mueller: Video segment on randomized search (3 minutes), related notebook
- Paper by Yoshua Bengio: Random Search for Hyper-Parameter Optimization