Handling missing values in pandas
This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.
Handling missing values in pandas¶
In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/uforeports'
ufo = pd.read_csv(url)
In [4]:
ufo.tail()
Out[4]:
NaN shows missing values
In [5]:
# checking which are NaN (True) using isnull
# creates DataFrame of True's and False's
ufo.isnull().tail()
Out[5]:
In [6]:
ufo.notnull().tail()
Out[6]:
In [14]:
# count number of missing values in each column
# sum True's
ufo.isnull().sum()
Out[14]:
In [11]:
# we create a pandas series of booleans
booleans = pd.Series([True, False, True])
In [13]:
# use sum() on series
# this would sum all True
booleans.sum()
# sum() uses axis=0 by default
# the following code does the same thing
# booleans.sum(axis=0)
Out[13]:
In [17]:
# this allows us to see the 25 rows of missing values in the column City
ufo[ufo.City.isnull()]
Out[17]:
What do we do about the missing values?
Method 1: drop missing values
In [18]:
ufo.shape
Out[18]:
In [19]:
# drop rows if any of the 5 columns have a missing value
# how='any' is the default, you need not include this
ufo.dropna(how='any').shape
Out[19]:
In [20]:
# no changes are made
# you can change using inplace='true'
# ufo.dropna(how='any', inplace=True).shape
In [21]:
# drop row if all of the columns are missing
ufo.dropna(how='all').shape
Out[21]:
In [23]:
# drop row if either City or Shape Reported are missing
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
Out[23]:
In [25]:
# drop row if both City and Shape Reported are missing
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape
Out[25]:
Method 2: Filling missing values
In [28]:
# this shows missing values (NaN)
ufo['Shape Reported'].value_counts(dropna=False)
Out[28]:
In [29]:
# inplace=True makes the change to the data
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)
In [32]:
ufo['Shape Reported'].value_counts(dropna=False)
Out[32]: