Calculate variance, interquartile range and other variance measures
Variability of Data¶
Theory
- Visualizations of data
- Histogram
- Boxplots
- Range = max - min
- Changes sometimes when we add new data to the dataset
- Hence, this changes with outliers
- Statisticians typically cut the top and bottom 25%
- This is called Interquartile (IQR) range = Q3 - Q1
- Changes sometimes when we add new data to the dataset
- Quartiles
- Split data into half
- Median of everything = Q2
- First half's median = Q1
- Second half's median = Q3
- IQR = Q3 - Q1
- About 50% of data falls within the IQR
- IQR is not affected by every value in the dataset
- IQR is not affected by outliers
- Split data into half
- Outliers
- We can statistically calculate an outlier
- Outlier < Q1 - 1.5*IQR
- Outlier > Q3 + 1.5*IQR
- We can statistically calculate an outlier
- Deviation from mean = x_i - x_mean
- Mean absolute deviation = sum(x_i - x_mean) / n
- n is the number of examples
- Squared deviation = (x_i - x_mean)^2
- Mean squared deviation = variance = sum((x_i - x_mean)^2) / n
- Sum of squares (SS) = sum((x_i - x_mean)^2)
- Standard deviation (SD) = variance^0.5
- Approximately 68% of data falls within 1 SD from the mean
- Approximately 95% of data falls within 2 SD from the mean
- Approximately 99.7% of data falls within 3 SD from the mean
- Bessel's Correction
- In general, samples underestimate the variability of a population
- This is because most of the values are centered in the middle
- We can correct for this using Bessel's Correction
- We divide by n - 1 (degree of freedom = 1)
- This will make the standard deviation bigger
- In summary
- If we are trying to estimate the standard deviation of the population, we divide by n - 1
- If we are actually measuring the standard deviation of the population, we divide by n
- In general, samples underestimate the variability of a population
Calculating variability of data using pandas
In [1]:
import pandas as pd
In [31]:
lst = [33219, 36254, 38801, 46335, 46840, 47596, 55130, 56863, 78070, 88830]
sample = pd.Series(lst)
In [32]:
type(sample)
Out[32]:
In [33]:
sample
Out[33]:
In [34]:
sample.mean()
Out[34]:
In [35]:
sample.median()
Out[35]:
In [47]:
# standard deviation
# default ddof = 1
# divded by n - 1
sample.std()
Out[47]:
In [48]:
# standard deviation
# ddof = 0
# divded by n
sample.std(ddof=0)
Out[48]:
In [43]:
# variance with ddof = 0
# sum((x_i - x_mean)^2) / n
sample.var(ddof=0)
Out[43]:
In [44]:
# variance with ddof = 1
# sum((x_i - x_mean)^2) / (n-1)
sample.var(ddof=1)
Out[44]:
In [45]:
# mean (average) absolute deviation
sample.mad()
Out[45]:
Summary
In [54]:
lst2 = [38946, 43420, 49191, 50430, 50557, 52580, 53595, 54135, 60181, 62076]
In [55]:
sample2 = pd.Series(lst2)
In [61]:
print sample2.std(ddof=0)
print sample2.mean()
print sample2.mad()
Reading from a csv
In [63]:
path = './salary.csv'
salary = pd.read_csv(path)
In [65]:
# data read into pandas series
salary.head()
Out[65]:
In [67]:
# standard deviation
# degree of freedom = 0
# divided by n instead of divided by n - 1
salary.std(ddof=0)
Out[67]: