Using Pandas index
This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.
Pandas Index¶
In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/drinksbycountry'
drinks = pd.read_csv(url)
In [3]:
drinks.head()
Out[3]:
In [4]:
drinks.index
Out[4]:
The index is from 0 to 193 (0, 1, 2, 3, 4... 193)
In [5]:
drinks.columns
Out[5]:
In [6]:
# index is not part of the DataFrame
drinks.shape
Out[6]:
In [7]:
# rarely people leave columns without headers
url2 = 'http://bit.ly/movieusers'
pd.read_table(url2, header=None, sep='|').head()
Out[7]:
What are indexes for?
- Identification
- Selection
- Alignment
1. Identification
In [8]:
# you can identify what rows we are working with here
drinks[drinks.continent=='South America']
Out[8]:
2. Selection
In [9]:
# .loc method to retrieve element/cell
drinks.loc[23, 'beer_servings']
Out[9]:
In [10]:
# inplace=True makes the change
# sets the index to 'country'
drinks.set_index('country', inplace=True)
drinks.head()
Out[10]:
In [11]:
drinks.index
Out[11]:
In [12]:
# country is no longer one of the columns
drinks.columns
Out[12]:
In [13]:
# we can select based on country instead of a number
# we can select more easily by setting a meaningful index
drinks.loc['Brazil', 'beer_servings']
Out[13]:
'country' is the name of the index
We can clear this out
In [14]:
# clearing index name
drinks.index.name = None
drinks.head()
Out[14]:
In [15]:
# say you prefer to use the default index and you want back the column of countries
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()
Out[15]:
In [16]:
drinks.describe()
Out[16]:
In [21]:
type(drinks.describe())
# you can see this is a DataFrame so we can interact with it accordingly
Out[21]:
In [18]:
drinks.describe().index
Out[18]:
In [19]:
drinks.describe().columns
Out[19]:
In [23]:
# .loc is a DataFrame method
# format of .loc
# .loc['index_name_or_number', 'column_name]
drinks.describe().loc['25%', 'beer_servings']
Out[23]:
3. Alignment
In [24]:
drinks.head()
Out[24]:
In [26]:
drinks.continent.head()
Out[26]:
In [27]:
drinks.set_index('country', inplace=True)
In [28]:
drinks.head()
Out[28]:
In [29]:
drinks.continent.head()
Out[29]:
In [30]:
type(drinks.continent.head())
Out[30]:
In [31]:
drinks.continent.value_counts()
Out[31]:
In [32]:
type(drinks.continent.value_counts())
Out[32]:
In [33]:
drinks.continent.value_counts().values
Out[33]:
In [35]:
# we can use the index to select values from the series
# this is similar to .loc for DataFrame
# because series does not have multiple columns, we can do this
drinks.continent.value_counts()['Africa']
Out[35]:
In [37]:
# sort based on values in the Series
drinks.continent.value_counts().sort_values()
Out[37]:
In [38]:
# sort index based on ascending order
drinks.continent.value_counts().sort_index()
Out[38]:
In [50]:
# creating a a pandas series
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people
Out[50]:
In [51]:
drinks.beer_servings.head()
Out[51]:
In [53]:
# you can do math based on shared index
drinks.beer_servings * people
Out[53]:
In [55]:
# axis=1, column concatenation
# beauty of automatic alignment using index
pd.concat([drinks, people], axis=1).head()
Out[55]: