Machine Learning introduction by Data School


  1. Types of supervised learning
  2. Reading data using pandas
  3. Visualizing data using seaborn
  4. Linear regression pros and cons
  5. Form of linear regression
  6. Preparing X and y using pandas
  7. Splitting X and y into training and testing sets
  8. Linear regression in scikit-learn
  9. Interpreting model coefficients
  10. Making predictions
  11. Model evaluation metrics for regression
  12. Computing the RMSE for our Sales predictions
  13. Feature selection
  14. Resources

This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos.

1. Types of supervised learning

  • Classification: Predict a categorical response
  • Regression: Predict a continuous response

2. Reading data using pandas

Pandas: popular Python library for data exploration, manipulation, and analysis

In [1]:
# conventional way to import pandas
import pandas as pd
In [2]:
# read CSV file directly from a URL and save the results
# use .read_csv method and simply pass in the name of the files (local and through a url)
# to find out more about this method, click on the url and press shift + tab (twice)
# we set the index_col=0
data = pd.read_csv('', index_col=0)

# display the first 5 rows
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

Primary object types:

  • DataFrame: rows and columns (like a spreadsheet or matrix)
    • First row will always be the column headers
    • First column is an index
  • Series: a single column (vector)
In [3]:
# display the last 5 rows
TV Radio Newspaper Sales
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4
In [4]:
# check the shape of the DataFrame (rows, columns)
# there are 200 rows x 4 columns
(200, 4)

What are the features?

  • TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
  • Radio: advertising dollars spent on Radio
  • Newspaper: advertising dollars spent on Newspaper

What is the response?

  • Sales: sales of a single product in a given market (in thousands of items)

What else do we know?

  • Because the response variable is continuous, this is a regression problem.
  • There are 200 observations (represented by the rows), and each observation is a single market.

3. Visualizing data using seaborn

Seaborn: Python library for statistical data visualization built on top of Matplotlib

In [5]:
# conventional way to import seaborn
import seaborn as sns

# allow plots to appear within the notebook
%matplotlib inline
In [6]:
# visualize the relationship between the features and the response using scatterplots
# this produces pairs of scatterplot as shown
# use aspect= to control the size of the graphs
# use kind='reg' to plot linear regression on the graph
sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')
<seaborn.axisgrid.PairGrid at 0x119e68198>

Linear regression

  • Strong relationship between TV ads and sales
  • Weak relationship between Radio ads and sales
  • Very weak to no relationship between Newspaper ads and sales

4. Linear regression Pros and Cons


  • Fast
  • No tuning required
  • Highly interpretable
  • Well-understood


  • Unlikely to produce the best predictive accuracy
    • Presumes a linear relationship between the features and response
    • If the relationship is highly non-linear as with many scenarios, linear relationship will not effectively model the relationship and its prediction would not be accurate

5. Form of linear regression

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

  • $y$ is the response
  • $\beta_0$ is the intercept
  • $\beta_1$ is the coefficient for $x_1$ (the first feature)
  • $\beta_n$ is the coefficient for $x_n$ (the nth feature)

In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

The $\beta$ values are called the model coefficients

  • These values are "learned" during the model fitting step using the "least squares" criterion
  • Then, the fitted model can be used to make predictions

6. Preparing X and y using pandas

  • scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays
  • However, pandas is built on top of NumPy
  • Thus, X can be a pandas DataFrame (matrix) and y can be a pandas Series (vector)
In [7]:
# create a Python list of feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# equivalent command to do this in one line using double square brackets
# inner bracket is a list
# outer bracker accesses a subset of the original DataFrame
X = data[['TV', 'Radio', 'Newspaper']]

# print the first 5 rows
TV Radio Newspaper
1 230.1 37.8 69.2
2 44.5 39.3 45.1
3 17.2 45.9 69.3
4 151.5 41.3 58.5
5 180.8 10.8 58.4
In [8]:
# check the type and shape of X
<class 'pandas.core.frame.DataFrame'>
(200, 3)
In [9]:
# select a Series from the DataFrame
y = data['Sales']

# equivalent command that works if there are no spaces in the column name
# you can select the Sales as an attribute of the DataFrame
y = data.Sales

# print the first 5 values
1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: Sales, dtype: float64
In [10]:
# check the type and shape of y
<class 'pandas.core.series.Series'>

7. Splitting X and y into training and testing sets

In [11]:
# import  
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [12]:
# default split is 75% for training and 25% for testing
(150, 3)
(50, 3)

8. Linear regression in scikit-learn

In [13]:
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients), y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

9. Interpreting model coefficients

In [14]:
# print the intercept and coefficients
[ 0.04656457  0.17915812  0.00345046]
In [15]:
# pair the feature names with the coefficients
# hard to remember the order, we so we python's zip function to pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)
<zip at 0x11d372448>
$$y = 2.88 + 0.0466 \times TV + 0.179 \times Radio + 0.00345 \times Newspaper$$

How do we interpret the TV coefficient (0.0466)?

  • For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
  • Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items.

Important notes:

  • This is a statement of association, not causation
  • If an increase in TV ad spending was associated with a decrease in sales, $\beta_1$ would be negative.

10. Making predictions

In [16]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)

We need an evaluation metric in order to compare our predictions with the actual values.

11. Model evaluation metrics for regression

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

In [17]:
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
In [18]:
# calculate MAE by hand
print((10 + 0 + 20 + 10) / 4)

# calculate MAE using scikit-learn
from sklearn import metrics
print(metrics.mean_absolute_error(true, pred))

Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
In [19]:
# calculate MSE by hand
import numpy as np
print((10**2 + 0**2 + 20**2 + 10**2) / 4)

# calculate MSE using scikit-learn
print(metrics.mean_squared_error(true, pred))

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
In [20]:
# calculate RMSE by hand
import numpy as np
print(np.sqrt(((10**2 + 0**2 + 20**2 + 10**2) / 4)))

# calculate RMSE using scikit-learn
print(np.sqrt(metrics.mean_squared_error(true, pred)))

Comparing these metrics:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
    • Easier to put in context as it's the same units as our response variable

12. Computing the RMSE for our Sales predictions

In [21]:
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

13. Feature selection

Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions?

Let's remove it from the model and check the RMSE!

In [22]:
# create a Python list of feature names
feature_cols = ['TV', 'Radio']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# select a Series from the DataFrame
y = data.Sales

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# fit the model to the training data (learn the coefficients), y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.

14. Resources

Linear regression:

