Filtering rows of a pandas DataFrame by column value
This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.
Filtering rows of a pandas DataFrame by column value¶
In [1]:
import pandas as pd
In [2]:
# url
url = 'http://bit.ly/imdbratings'
# create DataFrame called movies
movies = pd.read_csv(url)
In [3]:
movies.head()
Out[3]:
In [4]:
movies.shape
Out[4]:
In [6]:
# booleans
type(True)
type(False)
Out[6]:
We want to create a list of booleans with the same number of rows as the movies' DataFrame
- True if duration > 200
- False if otherwise
In [7]:
# create list
booleans = []
# loop
for length in movies.duration:
if length >= 200:
booleans.append(True)
else:
booleans.append(False)
In [8]:
booleans[0:5]
Out[8]:
In [9]:
# len(booleans) is the same as the number of rows in movies' DataFrame
len(booleans)
Out[9]:
In [11]:
# convert booleans into a Pandas series
is_long = pd.Series(booleans)
In [12]:
is_long.head()
Out[12]:
In [16]:
# pulls out genre
movies['genre']
Out[16]:
In [17]:
# this pulls out duration >= 200mins
movies[is_long]
Out[17]:
Faster method without a for loop
In [20]:
# this line of code replaces the for loop
# when you use a series name using pandas and use a comparison operator, it will loop through each row
is_long = movies.duration >= 200
is_long.head()
Out[20]:
In [21]:
movies[is_long]
Out[21]:
Even better way to simplify movies[is_long]
In [25]:
movies[movies.duration >= 200]
Out[25]:
Additional tip: we want to study the duration and only the genre instead of all the columns
In [28]:
# this is a DataFrame, we use dot or bracket notation to get what we want
movies[movies.duration >= 200]['genre']
movies[movies.duration >= 200].genre
Out[28]:
In [29]:
# best practice is to use .loc instead of what we did above by selecting columns
movies.loc[movies.duration >= 200, 'genre']
Out[29]: