Exploring Pandas Series
This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.
Exploring pandas Series¶
In [1]:
import pandas as pd
In [2]:
url = 'http://bit.ly/imdbratings'
movies = pd.read_csv(url)
In [3]:
movies.head()
Out[3]:
In [4]:
movies.dtypes
Out[4]:
We will be focusing on 2 columns
- genre (object)
- duration (integer)
In [5]:
# basic summary
movies.genre.describe()
Out[5]:
In [7]:
# frequency of different genres
movies.genre.value_counts()
Out[7]:
In [8]:
# turn raw counts into percentages
movies.genre.value_counts(normalize=True)
Out[8]:
In [9]:
type(movies.genre.value_counts(normalize=True))
Out[9]:
Hence we can use any Series method such as .head()
- Every time when you run a method, think of what other DataFrame or Series method we can chain
In [11]:
# finding out unique values
movies.genre.unique()
Out[11]:
In [13]:
# number of unique values
movies.genre.nunique()
Out[13]:
In [15]:
# crosstab is useful for explorng the data further
pd.crosstab(movies.genre, movies.content_rating)
Out[15]:
In [16]:
movies.duration.describe()
Out[16]:
In [17]:
movies.duration.mean()
Out[17]:
In [18]:
movies.duration.max()
Out[18]:
In [19]:
movies.duration.min()
Out[19]:
In [20]:
movies.duration.value_counts()
Out[20]:
Visualization
In [21]:
%matplotlib inline
In [22]:
data = movies.duration
In [30]:
data
Out[30]:
In [25]:
data.plot(kind='hist')
Out[25]:
In [26]:
data_counts = movies.genre.value_counts()
In [29]:
data_counts
Out[29]:
In [28]:
data_counts.plot(kind='bar')
Out[28]: