This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Pandas Index¶

import pandas as pd

url = 'http://bit.ly/drinksbycountry'
drinks = pd.read_csv(url)

drinks.head()

drinks.index

RangeIndex(start=0, stop=193, step=1)

The index is from 0 to 193 (0, 1, 2, 3, 4... 193)

drinks.columns

Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

# index is not part of the DataFrame
drinks.shape

(193, 6)

# rarely people leave columns without headers
url2 = 'http://bit.ly/movieusers'
pd.read_table(url2, header=None, sep='|').head()

What are indexes for?

Identification
Selection
Alignment

1. Identification

# you can identify what rows we are working with here
drinks[drinks.continent=='South America']

2. Selection

# .loc method to retrieve element/cell
drinks.loc[23, 'beer_servings']

245

# inplace=True makes the change 
# sets the index to 'country'
drinks.set_index('country', inplace=True)
drinks.head()

drinks.index

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)

# country is no longer one of the columns
drinks.columns

Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

# we can select based on country instead of a number 
# we can select more easily by setting a meaningful index
drinks.loc['Brazil', 'beer_servings']

245

'country' is the name of the index
We can clear this out

# clearing index name
drinks.index.name = None
drinks.head()

# say you prefer to use the default index and you want back the column of countries
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()

drinks.describe()

type(drinks.describe())
# you can see this is a DataFrame so we can interact with it accordingly

pandas.core.frame.DataFrame

drinks.describe().index

Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')

drinks.describe().columns

Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol'],
      dtype='object')

# .loc is a DataFrame method
# format of .loc
# .loc['index_name_or_number', 'column_name]
drinks.describe().loc['25%', 'beer_servings']

20.0

3. Alignment

drinks.head()

drinks.continent.head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object

drinks.set_index('country', inplace=True)

drinks.head()

drinks.continent.head()

country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object

type(drinks.continent.head())

pandas.core.series.Series

drinks.continent.value_counts()

Africa           53
Europe           45
Asia             44
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

type(drinks.continent.value_counts())

pandas.core.series.Series

drinks.continent.value_counts().values

array([53, 45, 44, 23, 16, 12])

# we can use the index to select values from the series
# this is similar to .loc for DataFrame
# because series does not have multiple columns, we can do this
drinks.continent.value_counts()['Africa']

53

# sort based on values in the Series
drinks.continent.value_counts().sort_values()

South America    12
Oceania          16
North America    23
Asia             44
Europe           45
Africa           53
Name: continent, dtype: int64

# sort index based on ascending order
drinks.continent.value_counts().sort_index()

Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

# creating a a pandas series
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people

Albania    3000000
Andorra      85000
Name: population, dtype: int64

drinks.beer_servings.head()

country
Afghanistan      0
Albania         89
Algeria         25
Andorra        245
Angola         217
Name: beer_servings, dtype: int64

# you can do math based on shared index
drinks.beer_servings * people

Afghanistan                     NaN
Albania                 267000000.0
Algeria                         NaN
Andorra                  20825000.0
Angola                          NaN
Antigua & Barbuda               NaN
Argentina                       NaN
Armenia                         NaN
Australia                       NaN
Austria                         NaN
Azerbaijan                      NaN
Bahamas                         NaN
Bahrain                         NaN
Bangladesh                      NaN
Barbados                        NaN
Belarus                         NaN
Belgium                         NaN
Belize                          NaN
Benin                           NaN
Bhutan                          NaN
Bolivia                         NaN
Bosnia-Herzegovina              NaN
Botswana                        NaN
Brazil                          NaN
Brunei                          NaN
Bulgaria                        NaN
Burkina Faso                    NaN
Burundi                         NaN
Cabo Verde                      NaN
Cambodia                        NaN
                           ...     
Sudan                           NaN
Suriname                        NaN
Swaziland                       NaN
Sweden                          NaN
Switzerland                     NaN
Syria                           NaN
Tajikistan                      NaN
Tanzania                        NaN
Thailand                        NaN
Timor-Leste                     NaN
Togo                            NaN
Tonga                           NaN
Trinidad & Tobago               NaN
Tunisia                         NaN
Turkey                          NaN
Turkmenistan                    NaN
Tuvalu                          NaN
USA                             NaN
Uganda                          NaN
Ukraine                         NaN
United Arab Emirates            NaN
United Kingdom                  NaN
Uruguay                         NaN
Uzbekistan                      NaN
Vanuatu                         NaN
Venezuela                       NaN
Vietnam                         NaN
Yemen                           NaN
Zambia                          NaN
Zimbabwe                        NaN
dtype: float64

# axis=1, column concatenation
# beauty of automatic alignment using index
pd.concat([drinks, people], axis=1).head()

	0	1	2	3	4
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
6	Argentina	193	25	221	8.3	South America
20	Bolivia	167	41	8	3.8	South America
23	Brazil	245	145	16	7.2	South America
35	Chile	130	124	172	7.6	South America
37	Colombia	159	76	3	4.2	South America
52	Ecuador	162	74	3	4.2	South America
72	Guyana	93	302	1	7.1	South America
132	Paraguay	213	117	74	7.3	South America
133	Peru	163	160	21	6.1	South America
163	Suriname	128	178	7	5.6	South America
185	Uruguay	115	35	220	6.6	South America
188	Venezuela	333	100	3	7.7	South America

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
count	193.000000	193.000000	193.000000	193.000000
mean	106.160622	80.994819	49.450777	4.717098
std	101.143103	88.284312	79.697598	3.773298
min	0.000000	0.000000	0.000000	0.000000
25%	20.000000	4.000000	1.000000	1.300000
50%	76.000000	56.000000	8.000000	4.200000
75%	188.000000	128.000000	59.000000	7.200000
max	376.000000	438.000000	370.000000	14.400000

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

Using Pandas Index

Pandas Index¶