Making DataFrame smaller and faster in pandas
This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.
Making pandas DataFrame smaller and faster¶
In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/drinksbycountry'
drinks = pd.read_csv(url)
In [3]:
drinks.head()
Out[3]:
In [4]:
drinks.info()
- object usually means there's a string
- memory usage
- DataFrame takes at least 9.1kb of memory
- It might be a lot more depending on what's in those object columns
- In this case, they're just strings of countries and continents
In [6]:
# we can count the actual memory usage using the following command
drinks.info(memory_usage='deep')
In [10]:
# we can check how much space each column is actually taking
# the numbers are in bytes, not kilobytes
drinks.memory_usage(deep=True)
Out[10]:
In [11]:
type(drinks.memory_usage(deep=True))
Out[11]:
In [13]:
# since it is a series, we can use .sum()
drinks.memory_usage(deep=True).sum()
Out[13]:
In [21]:
# there are only 6 unique values of continent
# we can replace strings with digits to save space
sorted(drinks.continent.unique())
Out[21]:
In [20]:
drinks.continent.head()
Out[20]:
In [24]:
# converting continent from object to category
# it stores the strings as integers
drinks['continent'] = drinks.continent.astype('category')
In [23]:
drinks.dtypes
Out[23]:
In [26]:
drinks.continent.head()
Out[26]:
In [30]:
# .cat is similar to .str
# we can do more stuff after .cat
# we can see here how pandas represents the continents as integers
drinks.continent.cat.codes.head()
Out[30]:
In [32]:
# before this conversion, it was over 12332 bytes
# now it is 584 bytes
drinks.memory_usage(deep=True)
Out[32]:
In [34]:
# we can convert country to a category too
drinks.dtypes
Out[34]:
In [35]:
drinks['country'] = drinks.country.astype('category')
In [39]:
# this is larger!
# this is because we've too many categories
drinks.memory_usage(deep=True)
Out[39]:
In [37]:
# now we've 193 digits
# it points to a lookup table with 193 strings!
drinks.country.cat.categories
Out[37]:
The key to converting to category is to ensure that there are few categories to save memory usage. If there are too many, we should not convert.
In [46]:
# passing a dictionary {} to the DataFrame method =
id_list =[100, 101, 102, 103]
quality_list = ['good', 'very good', 'good', 'excellent']
df = pd.DataFrame({'ID': id_list, 'quality': quality_list })
df
Out[46]:
In [52]:
# this sorts using alphabetical order
# but there is a logical ordering to these categories, we need to tell pandas there is a logical ordering
df.sort_values('quality')
Out[52]:
In [49]:
# how do we tell pandas there is a logical order?
quality_list_ordered = ['good', 'very good', 'excellent']
df['quality'] = df.quality.astype('category', categories=quality_list_ordered, ordered=True)
In [53]:
# here we have good < very good < excellent
df.quality
Out[53]:
In [56]:
# now it sorts using the logical order we defined
df.sort_values('quality')
Out[56]:
In [58]:
# we can now use boolean conditions with this
# here we want all columns where the row > good
df.loc[df.quality > 'good', :]
Out[58]: