Topics¶
- Types of supervised learning
- Reading data using pandas
- Visualizing data using seaborn
- Linear regression pros and cons
- Form of linear regression
- Preparing X and y using pandas
- Splitting X and y into training and testing sets
- Linear regression in scikit-learn
- Interpreting model coefficients
- Making predictions
- Model evaluation metrics for regression
- Computing the RMSE for our Sales predictions
- Feature selection
- Resources
This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos.
1. Types of supervised learning¶
- Classification: Predict a categorical response
- Regression: Predict a continuous response
2. Reading data using pandas¶
Pandas: popular Python library for data exploration, manipulation, and analysis
- Anaconda users: pandas is already installed
- Other users: installation instructions
# conventional way to import pandas
import pandas as pd
# read CSV file directly from a URL and save the results
# use .read_csv method and simply pass in the name of the files (local and through a url)
# to find out more about this method, click on the url and press shift + tab (twice)
# we set the index_col=0
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# display the first 5 rows
data.head()
Primary object types:
- DataFrame: rows and columns (like a spreadsheet or matrix)
- First row will always be the column headers
- First column is an index
- Series: a single column (vector)
# display the last 5 rows
data.tail()
# check the shape of the DataFrame (rows, columns)
# there are 200 rows x 4 columns
data.shape
What are the features?
- TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- Radio: advertising dollars spent on Radio
- Newspaper: advertising dollars spent on Newspaper
What is the response?
- Sales: sales of a single product in a given market (in thousands of items)
What else do we know?
- Because the response variable is continuous, this is a regression problem.
- There are 200 observations (represented by the rows), and each observation is a single market.
3. Visualizing data using seaborn¶
Seaborn: Python library for statistical data visualization built on top of Matplotlib
- Anaconda users: run
conda install seaborn
from the command line - Other users: installation instructions
# conventional way to import seaborn
import seaborn as sns
# allow plots to appear within the notebook
%matplotlib inline
# visualize the relationship between the features and the response using scatterplots
# this produces pairs of scatterplot as shown
# use aspect= to control the size of the graphs
# use kind='reg' to plot linear regression on the graph
sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')
Linear regression
- Strong relationship between TV ads and sales
- Weak relationship between Radio ads and sales
- Very weak to no relationship between Newspaper ads and sales
4. Linear regression Pros and Cons¶
Pros:
- Fast
- No tuning required
- Highly interpretable
- Well-understood
Cons:
- Unlikely to produce the best predictive accuracy
- Presumes a linear relationship between the features and response
- If the relationship is highly non-linear as with many scenarios, linear relationship will not effectively model the relationship and its prediction would not be accurate
5. Form of linear regression¶
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$
- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)
In this case:
$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$
The $\beta$ values are called the model coefficients
- These values are "learned" during the model fitting step using the "least squares" criterion
- Then, the fitted model can be used to make predictions
6. Preparing X and y using pandas¶
- scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays
- However, pandas is built on top of NumPy
- Thus, X can be a pandas DataFrame (matrix) and y can be a pandas Series (vector)
# create a Python list of feature names
feature_cols = ['TV', 'Radio', 'Newspaper']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# equivalent command to do this in one line using double square brackets
# inner bracket is a list
# outer bracker accesses a subset of the original DataFrame
X = data[['TV', 'Radio', 'Newspaper']]
# print the first 5 rows
X.head()
# check the type and shape of X
print(type(X))
print(X.shape)
# select a Series from the DataFrame
y = data['Sales']
# equivalent command that works if there are no spaces in the column name
# you can select the Sales as an attribute of the DataFrame
y = data.Sales
# print the first 5 values
y.head()
# check the type and shape of y
print(type(y))
print(y.shape)
7. Splitting X and y into training and testing sets¶
# import
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
8. Linear regression in scikit-learn¶
# import model
from sklearn.linear_model import LinearRegression
# instantiate
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
9. Interpreting model coefficients¶
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)
# pair the feature names with the coefficients
# hard to remember the order, we so we python's zip function to pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)
How do we interpret the TV coefficient (0.0466)?
- For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
- Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items.
Important notes:
- This is a statement of association, not causation
- If an increase in TV ad spending was associated with a decrease in sales, $\beta_1$ would be negative.
10. Making predictions¶
# make predictions on the testing set
y_pred = linreg.predict(X_test)
We need an evaluation metric in order to compare our predictions with the actual values.
11. Model evaluation metrics for regression¶
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.
Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$# calculate MAE by hand
print((10 + 0 + 20 + 10) / 4)
# calculate MAE using scikit-learn
from sklearn import metrics
print(metrics.mean_absolute_error(true, pred))
Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$# calculate MSE by hand
import numpy as np
print((10**2 + 0**2 + 20**2 + 10**2) / 4)
# calculate MSE using scikit-learn
print(metrics.mean_squared_error(true, pred))
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$# calculate RMSE by hand
import numpy as np
print(np.sqrt(((10**2 + 0**2 + 20**2 + 10**2) / 4)))
# calculate RMSE using scikit-learn
print(np.sqrt(metrics.mean_squared_error(true, pred)))
Comparing these metrics:
- MAE is the easiest to understand, because it's the average error.
- MSE is more popular than MAE, because MSE "punishes" larger errors.
- RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
- Easier to put in context as it's the same units as our response variable
12. Computing the RMSE for our Sales predictions¶
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
13. Feature selection¶
Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions?
Let's remove it from the model and check the RMSE!
# create a Python list of feature names
feature_cols = ['TV', 'Radio']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# select a Series from the DataFrame
y = data.Sales
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
# make predictions on the testing set
y_pred = linreg.predict(X_test)
# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.
14. Resources¶
Linear regression:
- Longer notebook on linear regression by Data School
- Chapter 3 of An Introduction to Statistical Learning and related videos by Hastie and Tibshirani (Stanford)
- Quick reference guide to applying and interpreting linear regression by Data School
- Introduction to linear regression by Robert Nau (Duke)
Pandas:
- Three-part pandas tutorial by Greg Reda
- read_csv and read_table documentation
Seaborn: