Convert categorical data into numerical data automatically
One-Hot Encoding in Scikit-learn¶
Intuition
- You will prepare your categorical data using LabelEncoder()
- You will apply OneHotEncoder() on your new DataFrame in step 1
In [2]:
# import
import numpy as np
import pandas as pd
In [4]:
# load dataset
X = pd.read_csv('titanic_data.csv')
X.head(3)
Out[4]:
In [6]:
# limit to categorical data using df.select_dtypes()
X = X.select_dtypes(include=[object])
X.head(3)
Out[6]:
In [49]:
# check original shape
X.shape
Out[49]:
In [8]:
# import preprocessing from sklearn
from sklearn import preprocessing
In [21]:
# view columns using df.columns
X.columns
Out[21]:
In [31]:
# TODO: create a LabelEncoder object and fit it to each feature in X
# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()
# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
X_2.head()
Out[31]:
OneHotEncoder
- Encode categorical integer features using a one-hot aka one-of-K scheme.
- The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
- The output will be a sparse matrix where each column corresponds to one possible value of one feature.
- It is assumed that input features take on values in the range [0, n_values).
- This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
In [50]:
# TODO: create a OneHotEncoder object, and fit it to all of X
# 1. INSTANTIATE
enc = preprocessing.OneHotEncoder()
# 2. FIT
enc.fit(X_2)
# 3. Transform
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape
# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data
Out[50]:
In [47]:
onehotlabels
Out[47]:
In [48]:
type(onehotlabels)
Out[48]: