Dimensionality Reduction: Feature Transformation¶

Feature Transformation

The problem of pre-propressing a set of features (m) to create a new feature set (n) while retaining as much information as possible
- Number of features m reduced to n where
  - $$m < n$$
- Transformation operator P^T
  - $$P^Tx$$

Ad Hoc Query Problem: Google Search

A lot of words
- Polysemy (a word with multiple meanings)
  - Car
    - Automobile
    - First element in a cons cell in Lisp
  - This would give
    - False positives
- Synonomy (multiple words with the same meaning)
  - False negatives
We can combine words together for better indicators
- We can solve this using 3 kinds of algorithms
  - Principal Components Analysis
  - Independent Components Analysis

Measureable vs Latent Features

Measurable features
- Square footage
- Number of rooms
- School ranking
- Neighborhood Safety
Latent features: you're basically only "measuring" these 2 latent variables through all the measurable features
- Size
- Neighborhood
How do we condense our features while preserving information?
- We can use Scikit-learn
  - SelectKBest (k = no. of features to keep)
    - In this scenario we should use this since we know we need only 2 latent variables
    - This will throw all variables except the two which are the best
  - SelectPercentile
    - You could also use this and run at 50% for 2 features

Principal Component

We have many features, but we hypothesise a smaller number of features actually driving the patterns
We can try making a composite feature (principle component) that more directly probes the underlying phenomenon
- Using PCA for dimensionality reduction
- Using PCA for unsupervised learning
Example
- Measurable Features
  1. Square Footage
  2. Number of Rooms
- Latent Feature
  1. Size
- - We will project the points on the principle component
  - It will be one dimension now

Determining the Principal Component

Principal component of a dataset is the direction that has the largest variance
- Variance
  - Spread of data distribution
- Retains the maximum amount of "information" in the original data

Maximal Variance and Information Loss

Information loss
- Example
  - The length of the yellow line
  - The longer the length, the more information loss
- Total information loss is the sum of all the projected distances from the points on the principal component
- Projection onto direction of maximal variance minimizes diatance from old (higher-dimensional) data point to its new transformed value
  - Minimizes information loss (sum of red lines instead of blue lines)

Algorithm 1
PCA as a General Algorithm for Feature Transformation

When to use PCA

Latent features driving the pattersn in data
Dimensionality reduction
- Visualize high-dimensional data
  - You can easily draw scatterplots with 2-dimensional data
- Reduce noise
  - You get rid of noise by throwing away less useful components
- Make other algorithms work better with fewer inputs
  - Very high dimensionality might result in overfitting or take up a lot of computing power (time
  - A typical example is in eigenfaces
    - You can use PCA to reduce the dimensionality to then use SVMs for example

PCA Review

Systematized way to transform input features into principal components (PC)
Use PCs as new features
PCs are directions in data that maximize variance (minimize information loss) when you project/compress donw onto them
The more variance of data along a PC, the higher that PC is ranked
Most variance, most information would be the first PC
- Second-most variance would be the second PC
Max number of PCs = number of input features

PCA with Scikit-learn

# Import
from sklearn.decomposition import PCA
from sklearn import datasets

# Create data
iris = datasets.load_iris()
X = iris.data
y = iris.target

Standardizing

Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement scales of the original features
Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales. Although, all features in the Iris dataset were measured in centimeters, let us continue with the transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

# Instantiate
pca = PCA(n_components=2)

# Fit and Apply dimensionality reduction on X
pca.fit_transform(X_std)

array([[ -2.26454173e+00,  -5.05703903e-01],
       [ -2.08642550e+00,   6.55404729e-01],
       [ -2.36795045e+00,   3.18477311e-01],
       [ -2.30419716e+00,   5.75367713e-01],
       [ -2.38877749e+00,  -6.74767397e-01],
       [ -2.07053681e+00,  -1.51854856e+00],
       [ -2.44571134e+00,  -7.45626750e-02],
       [ -2.23384186e+00,  -2.47613932e-01],
       [ -2.34195768e+00,   1.09514636e+00],
       [ -2.18867576e+00,   4.48629048e-01],
       [ -2.16348656e+00,  -1.07059558e+00],
       [ -2.32737775e+00,  -1.58587455e-01],
       [ -2.22408272e+00,   7.09118158e-01],
       [ -2.63971626e+00,   9.38281982e-01],
       [ -2.19229151e+00,  -1.88997851e+00],
       [ -2.25146521e+00,  -2.72237108e+00],
       [ -2.20275048e+00,  -1.51375028e+00],
       [ -2.19017916e+00,  -5.14304308e-01],
       [ -1.89407429e+00,  -1.43111071e+00],
       [ -2.33994907e+00,  -1.15803343e+00],
       [ -1.91455639e+00,  -4.30465163e-01],
       [ -2.20464540e+00,  -9.52457317e-01],
       [ -2.77416979e+00,  -4.89517027e-01],
       [ -1.82041156e+00,  -1.06750793e-01],
       [ -2.22821750e+00,  -1.62186163e-01],
       [ -1.95702401e+00,   6.07892567e-01],
       [ -2.05206331e+00,  -2.66014312e-01],
       [ -2.16819365e+00,  -5.52016495e-01],
       [ -2.14030596e+00,  -3.36640409e-01],
       [ -2.26879019e+00,   3.14878603e-01],
       [ -2.14455443e+00,   4.83942097e-01],
       [ -1.83193810e+00,  -4.45266836e-01],
       [ -2.60820287e+00,  -1.82847519e+00],
       [ -2.43795086e+00,  -2.18539162e+00],
       [ -2.18867576e+00,   4.48629048e-01],
       [ -2.21111990e+00,   1.84337811e-01],
       [ -2.04441652e+00,  -6.84956426e-01],
       [ -2.18867576e+00,   4.48629048e-01],
       [ -2.43595220e+00,   8.82169415e-01],
       [ -2.17054720e+00,  -2.92726955e-01],
       [ -2.28652724e+00,  -4.67991716e-01],
       [ -1.87170722e+00,   2.32769161e+00],
       [ -2.55783442e+00,   4.53816380e-01],
       [ -1.96427929e+00,  -4.97391640e-01],
       [ -2.13337283e+00,  -1.17143211e+00],
       [ -2.07535759e+00,   6.91917347e-01],
       [ -2.38125822e+00,  -1.15063259e+00],
       [ -2.39819169e+00,   3.62390765e-01],
       [ -2.22678121e+00,  -1.02548255e+00],
       [ -2.20595417e+00,  -3.22378453e-02],
       [  1.10399365e+00,  -8.63112446e-01],
       [  7.32481440e-01,  -5.98635573e-01],
       [  1.24210951e+00,  -6.14822450e-01],
       [  3.97307283e-01,   1.75816895e+00],
       [  1.07259395e+00,   2.11757903e-01],
       [  3.84458146e-01,   5.91062469e-01],
       [  7.48715076e-01,  -7.78698611e-01],
       [ -4.97863388e-01,   1.84886877e+00],
       [  9.26222368e-01,  -3.03308268e-02],
       [  4.96802558e-03,   1.02940111e+00],
       [ -1.24697461e-01,   2.65806268e+00],
       [  4.38730118e-01,   5.88812850e-02],
       [  5.51633981e-01,   1.77258156e+00],
       [  7.17165066e-01,   1.85434315e-01],
       [ -3.72583830e-02,   4.32795099e-01],
       [  8.75890536e-01,  -5.09998151e-01],
       [  3.48006402e-01,   1.90621647e-01],
       [  1.53392545e-01,   7.90725456e-01],
       [  1.21530321e+00,   1.63335564e+00],
       [  1.56941176e-01,   1.30310327e+00],
       [  7.38256104e-01,  -4.02470382e-01],
       [  4.72369682e-01,   4.16608222e-01],
       [  1.22798821e+00,   9.40914793e-01],
       [  6.29381045e-01,   4.16811643e-01],
       [  7.00472799e-01,   6.34939277e-02],
       [  8.73536987e-01,  -2.50708611e-01],
       [  1.25422219e+00,   8.26200998e-02],
       [  1.35823985e+00,  -3.28820266e-01],
       [  6.62126138e-01,   2.24346071e-01],
       [ -4.72815133e-02,   1.05721241e+00],
       [  1.21534209e-01,   1.56359238e+00],
       [  1.41182261e-02,   1.57339235e+00],
       [  2.36010837e-01,   7.75923784e-01],
       [  1.05669143e+00,   6.36901284e-01],
       [  2.21417088e-01,   2.80847693e-01],
       [  4.31783161e-01,  -8.55136920e-01],
       [  1.04941336e+00,  -5.22197265e-01],
       [  1.03587821e+00,   1.39246648e+00],
       [  6.70675999e-02,   2.12620735e-01],
       [  2.75425066e-01,   1.32981591e+00],
       [  2.72335066e-01,   1.11944152e+00],
       [  6.23170540e-01,  -2.75426333e-02],
       [  3.30005364e-01,   9.88900732e-01],
       [ -3.73627623e-01,   2.01793227e+00],
       [  2.82944343e-01,   8.53950717e-01],
       [  8.90531103e-02,   1.74908548e-01],
       [  2.24356783e-01,   3.80484659e-01],
       [  5.73883486e-01,   1.53719974e-01],
       [ -4.57012873e-01,   1.53946451e+00],
       [  2.52244473e-01,   5.95860746e-01],
       [  1.84767259e+00,  -8.71696662e-01],
       [  1.15318981e+00,   7.01326114e-01],
       [  2.20634950e+00,  -5.54470105e-01],
       [  1.43868540e+00,   5.00105223e-02],
       [  1.86789070e+00,  -2.91192802e-01],
       [  2.75419671e+00,  -7.88432206e-01],
       [  3.58374475e-01,   1.56009458e+00],
       [  2.30300590e+00,  -4.09516695e-01],
       [  2.00173530e+00,   7.23865359e-01],
       [  2.26755460e+00,  -1.92144299e+00],
       [  1.36590943e+00,  -6.93948040e-01],
       [  1.59906459e+00,   4.28248836e-01],
       [  1.88425185e+00,  -4.14332758e-01],
       [  1.25308651e+00,   1.16739134e+00],
       [  1.46406152e+00,   4.44147569e-01],
       [  1.59180930e+00,  -6.77035372e-01],
       [  1.47128019e+00,  -2.53192472e-01],
       [  2.43737848e+00,  -2.55675734e+00],
       [  3.30914118e+00,   2.36132010e-03],
       [  1.25398099e+00,   1.71758384e+00],
       [  2.04049626e+00,  -9.07398765e-01],
       [  9.73915114e-01,   5.71174376e-01],
       [  2.89806444e+00,  -3.97791359e-01],
       [  1.32919369e+00,   4.86760542e-01],
       [  1.70424071e+00,  -1.01414842e+00],
       [  1.95772766e+00,  -1.00333452e+00],
       [  1.17190451e+00,   3.18896617e-01],
       [  1.01978105e+00,  -6.55429631e-02],
       [  1.78600886e+00,   1.93272800e-01],
       [  1.86477791e+00,  -5.55381532e-01],
       [  2.43549739e+00,  -2.46654468e-01],
       [  2.31608241e+00,  -2.62618387e+00],
       [  1.86037143e+00,   1.84672394e-01],
       [  1.11127173e+00,   2.95986102e-01],
       [  1.19746916e+00,   8.17167742e-01],
       [  2.80094940e+00,  -8.44748194e-01],
       [  1.58015525e+00,  -1.07247450e+00],
       [  1.34704442e+00,  -4.22255966e-01],
       [  9.23432978e-01,  -1.92303705e-02],
       [  1.85355198e+00,  -6.72422729e-01],
       [  2.01615720e+00,  -6.10397038e-01],
       [  1.90311686e+00,  -6.86024832e-01],
       [  1.15318981e+00,   7.01326114e-01],
       [  2.04330844e+00,  -8.64684880e-01],
       [  2.00169097e+00,  -1.04855005e+00],
       [  1.87052207e+00,  -3.82821838e-01],
       [  1.55849189e+00,   9.05313601e-01],
       [  1.52084506e+00,  -2.66794575e-01],
       [  1.37639119e+00,  -1.01636193e+00],
       [  9.59298576e-01,   2.22839447e-02]])

# Where the eigenvalues live
# You know first component and second component 
# has a and b percent of the data respectively
pca.explained_variance_ratio_

array([ 0.72770452,  0.23030523])

# Access components
pc_1 = pca.components_[0]
print(pc_1)
pc_2 = pca.components_[1]
print(pc_2)

[ 0.52237162 -0.26335492  0.58125401  0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]

PCA for Facial Recognition

Facial recognition is good for PCA because
- Pictures of faces generally have high input dimensionality
  - Many pixels
- Faces have general patterns that could be captured in smaller number of dimensions
  - A pair of eyes
  - Mouth
  - Chin

Scikit-learn: PCA for Facial Recognition

We will be reducing 1850 PCs to 150 PCs

from time import time
import logging
import pylab as pl
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC

# Data of famous people's faces
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = faces.data
y = faces.target
target_names = faces.target_names
n_classes = target_names.shape[0]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = faces.images.shape

# For machine learning we use the data directly (as relative pixel
# position info is ignored by this model)
X = faces.data
n_features = X.shape[1]

# the label to predict is the id of the person
y = faces.target
target_names = faces.target_names
n_classes = target_names.shape[0]

print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)

Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7

# Compute a PCA (eigenfaces) on the face dataset
n_components = 150

print("Extracting the top {} eigenfaces from {} faces".format(n_components, X_train.shape[0]))
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)

# eigenfaces: principal components
# Takes pca.components and reshape them 
# We've gone from 1800 to 150
eigenfaces = pca.components_.reshape((n_components, h, w))

# Transform data into principal components representation
print("Projecting the input data on the eigenfaces orthonormal basis")
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

Extracting the top 150 eigenfaces from 966 faces
Projecting the input data on the eigenfaces orthonormal basis

# Train an SVM classification model

print("Fitting the classifier to the training set")

param_grid = {
    'C': [1e3, 5e3, 5e4, 1e5],
    'gamma':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]
}

# Instantiate model
svm = SVC(kernel='rbf', class_weight='balanced', random_state=42)

# GridSearch
clf = GridSearchCV(svm, param_grid)
clf.fit(X_train_pca, y_train)
print(clf.best_estimator_)

Fitting the classifier to the training set
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

# Quantitative evaluation of the model quality on the test set
print("Predicting the people names on the testing test")
y_pred = clf.predict(X_test_pca)

Predicting the people names on the testing test

# Classification report and confusion matrix
print(classification_report(y_test, y_pred, target_names=target_names))

                   precision    recall  f1-score   support

     Ariel Sharon       0.47      0.54      0.50        13
     Colin Powell       0.72      0.85      0.78        60
  Donald Rumsfeld       0.67      0.52      0.58        27
    George W Bush       0.84      0.86      0.85       146
Gerhard Schroeder       0.80      0.80      0.80        25
      Hugo Chavez       0.88      0.47      0.61        15
       Tony Blair       0.76      0.69      0.72        36

      avg / total       0.78      0.77      0.77       322

# F1-score average
# The F1 score can be interpreted as a weighted average of the precision and recall
# Where an F1 score reaches its best value at 1 and worst score at 0
# The relative contribution of precision and recall to the F1 score are equal
# The formula for the F1 score is:
# F1 = 2 * (precision * recall) / (precision + recall)

print(f1_score(y_test, y_pred, average='weighted'))

0.769918515887

# Confusion Matrix
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))

[[  7   2   3   0   1   0   0]
 [  2  51   1   3   0   1   2]
 [  3   2  14   8   0   0   0]
 [  3  12   2 125   1   0   3]
 [  0   1   0   3  20   0   1]
 [  0   2   0   3   1   7   2]
 [  0   1   1   7   2   0  25]]

# Qualitative evaluation of the predictions using matplotlib
%matplotlib inline

def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits"""
    pl.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        pl.subplot(n_row, n_col, i + 1)
        pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
        pl.title(titles[i], size=12)
        pl.xticks(())
        pl.yticks(())


# plot the result of the prediction on a portion of the test set

def title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
    true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
    return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)

prediction_titles = [title(y_pred, y_test, target_names, i)
                         for i in range(y_pred.shape[0])]

plot_gallery(X_test, prediction_titles, h, w)

# plot the gallery of the most significative eigenfaces

eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

pl.show()

Variance explained by each principal component

# Variance explained by first component: 0.19346474
# Variance explained by second component: 0.15116931
pca.explained_variance_ratio_

array([ 0.19346474,  0.15116931,  0.07083688,  0.05952028,  0.05157574,
        0.02887213,  0.02514474,  0.02176463,  0.0201937 ,  0.01902118,
        0.01682174,  0.01580626,  0.01223351,  0.01087937,  0.01064428,
        0.00979671,  0.00892415,  0.00854861,  0.00835728,  0.00722645,
        0.0069658 ,  0.00653871,  0.00639547,  0.0056132 ,  0.00531102,
        0.00520167,  0.00507469,  0.00484211,  0.00443586,  0.0041782 ,
        0.00393684,  0.00381711,  0.00356077,  0.00351197,  0.00334554,
        0.00329936,  0.00314637,  0.00296207,  0.00290131,  0.00284712,
        0.00279984,  0.00267544,  0.00259903,  0.00258378,  0.00240921,
        0.00238993,  0.0023542 ,  0.00222581,  0.00217505,  0.00216559,
        0.00209064,  0.00205428,  0.00200421,  0.00197374,  0.00193836,
        0.00188752,  0.00180161,  0.00178887,  0.00174822,  0.00173048,
        0.00165642,  0.00162942,  0.00157415,  0.00153418,  0.00149966,
        0.00147248,  0.00143912,  0.0014187 ,  0.00139686,  0.00138139,
        0.00134001,  0.00133159,  0.00128796,  0.00125587,  0.00124233,
        0.00121846,  0.00120938,  0.00118284,  0.00115081,  0.00113661,
        0.00112612,  0.0011161 ,  0.00109366,  0.0010715 ,  0.00105619,
        0.00104305,  0.00102377,  0.00101666,  0.00099757,  0.00096327,
        0.00094085,  0.00091937,  0.00091261,  0.00089172,  0.00087085,
        0.00086153,  0.00084233,  0.00083839,  0.0008275 ,  0.00080058,
        0.00078658,  0.00077967,  0.00075576,  0.00074998,  0.0007463 ,
        0.00073166,  0.00073109,  0.00071453,  0.00070159,  0.0006959 ,
        0.0006665 ,  0.00066036,  0.00065491,  0.00063732,  0.00063183,
        0.0006231 ,  0.00061431,  0.00060845,  0.00059761,  0.00059021,
        0.00057756,  0.00056924,  0.00056306,  0.00055701,  0.00054258,
        0.00054046,  0.000527  ,  0.00052035,  0.00050859,  0.0005063 ,
        0.00050386,  0.00048574,  0.00048083,  0.0004726 ,  0.0004707 ,
        0.00046785,  0.00045598,  0.00044888,  0.00043977,  0.00043656,
        0.00043214,  0.00042051,  0.00041984,  0.00041423,  0.0004064 ,
        0.00040368,  0.00039458,  0.00038438,  0.00037871,  0.00037464])

F1 score variation as we change the number of principal components

How do we select how many PCs to use?
- Train on different number of PCs
  - See how accuracy responds
  - Cut off when it becomes apparent that adding more PCs doe not buy you much more discrimination
- DO NOT select features before performing PCA
As you add more PCs
- It should give you additional signal to improve our performance
- But it is also possible that we end up with greater complexity resulting in overfitting

PC = [10, 15, 25, 50, 100, 250]
scores = []

for i in PC:
    # Loop through number of components
    n_components = i

    # Instantiate 
    pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)

    # Redefine training data
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)

    # Set param_grid
    param_grid = {
        'C': [1e3, 5e3, 5e4, 1e5],
        'gamma':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]
    }

    # Instantiate model
    svm = SVC(kernel='rbf', class_weight='balanced', random_state=42)

    # GridSearch
    clf = GridSearchCV(svm, param_grid, n_jobs=-1)
    clf.fit(X_train_pca, y_train)
    # clf.best_estimator_

    # Predict
    y_pred = clf.predict(X_test_pca)

    # Score
    score = f1_score(y_test, y_pred, average='weighted')

    # Append score to list
    scores.append(score)

print(scores)

[0.47220748491808795, 0.65382531118443243, 0.73389997057170009, 0.80026156929718439, 0.84753652266061896, 0.80070914894695944]

# Zip the data to compare 
list(zip(scores, PC))

# As you can see, the greater the number of PCAs, the greater the F1 score
# However, it starts to decrease when you have too many PCs 
# You can see the decrease in f1_score from PCs=100 to PCs=250

[(0.47220748491808795, 10),
 (0.65382531118443243, 15),
 (0.73389997057170009, 25),
 (0.80026156929718439, 50),
 (0.84753652266061896, 100),
 (0.80070914894695944, 250)]

Algorithm 2
Independent Components Analysis (ICA)

PCA
- Minimizing correlation by maximizing variance
- Cares about orthogonality
  - Means that a relatively small set of primitive constructs can be combined in a relatively small number of ways to build the control and data structures of the language
  - Finding "common characteristics"
    - Eigenfaces problem
ICA
- Finding independence by converting (through a linear transformation) your input features into a new feature space such that
  - New features are independent of one another
- Cocktail party problem
  - Mixed up sounds from each of the microphones
  - Once you use an ICA algorithm, you can split the 3 sources into 3 independent features instead of the original sources with each source having a mix of everything
    - Police car
    - Foreign language
    - News

Property	PCA	ICA
Mutually Orthoganal	✓
Mutually Independent		✓
Maximal Variance	✓
Maximal Mutual Information		✓
Ordered Features	✓
Bag of Features	✓	✓
Blind Source Separation Problem		✓
Directional		✓

ICA with Scikit-learn: Blind Source Separation

Other alternatives

Random Components Analysis (RCA)
- Generates random directions (projections)
- Advantage over PCA and ICA
  - Fast
  - Simple
Linear Discriminant Analysis (LDA)
- Finds a projection that discriminates based on the label

Dimensionality Reduction and Feature Transformation

Dimensionality Reduction: Feature Transformation¶