Feature Engineering: Scaling and Selection¶

Feature Scaling

Formula
- $$X' = \frac {X - X_{min}}{X_{max} - X_{min}}$$

Algorithms affected by feature rescaling

Algorithms in which two dimensions affect the outcome will be affected by rescaling
- SVM with RBF kernel
  - When you maximize the distance, you've 2 or more dimensions
    - Think of the x and y axis with different dimensions and you need to calculate the distance
- K-means clustering

Feature Scaling Manually in Python

### FYI, the most straightforward implementation might 
### throw a divide-by-zero error, if the min and max values are the same
### but think about this for a second--that means that every
### data point has the same value for that feature!  
### why would you rescale it?  Or even use it at all?
def featureScaling(arr):

    max_num = max(arr)
    min_num = min(arr)

    lst = []

    for num in arr:
        X_prime = (num - min_num) / (max_num - min_num)
        lst.append(X_prime)

    return lst

# tests of your feature scaler--line below is input data
data = [115, 140, 175]
print(featureScaling(data))

[0.0, 0.4166666666666667, 1.0]

Feature Scaling in Scikit-learn

from sklearn.preprocessing import MinMaxScaler
import numpy as np

# 3 different training points for 1 feature
weights = np.array([[115], [140], [175]]).astype(float)

# Instantiate
scaler = MinMaxScaler()

# Rescale
rescaled_weights = scaler.fit_transform(weights)
rescaled_weights

array([[ 0.        ],
       [ 0.41666667],
       [ 1.        ]])

Feature Selection

Why do we want to select features?
- Knowledge discovery
  - Interpretability
  - Insight
- Curse of dimensionality

Feature Selection: Algorithms

How hard is the problem?
- Exponential
  - $${n \choose m}$$
  - $$2^n$$
    - n choose m
    - Assuming our original number of features: n
    - New number of features: m
      - Where m <= n

Filtering vs Wrapping

Disadvantages of filtering
- No feedback
  - Learning algorithm cannot inform on the impact of the changes in features
  - Criteria built in search with no reference to the learner
  - Ignores learning problem
- You'll look at features in isolation
Advantages of filtering
- Fast
Disadvantages of Wrapping
- Slow
Advantages Wrapping
- Feedback
  - Criteria built in the learner
  - Takes into account model bias and learning

Filtering: Search Part

We can use a Decision Tree algorithm for the search function then feed to the learner which does not do well with filtering features
- This is because DTs are good at filtering the best features
- If you want to know all the features, you can easily overfit
Other generic ways
- Information gain
- Entropy
- "Useful" features
- Independent, non-redundant

Wrapping: Search Part (have to deal with the exponential problem)

Hill climbing
Randomized optimization
Forward
- Start with m features
- You pass the features individually to the learning algorithms and get their scores
- You pick
  - 2 features
    - Choose highest score
  - 3 features
    - Choose highest score
    - If lower than 2 features, stop

Relevance: Measures Effect on BOC

X_i, feature, is strongly relevant if removing it degrades Bayes Optimal Classifier (BOC)
- BOC is the best you can do on average if you can find it
X_i, feature, is weakly relevant if
- Not strongly relevant
- Subset of features S such that adding x_i to S improves BOC
X_i is otherwise irrelevant

Usefulness: Measures Effect on Particular Predictor

Minimizing error given a model/learner

Feature Engineering and Scaling

Feature Engineering: Scaling and Selection¶