Variability of Data¶

Theory

Visualizations of data
- Histogram
- Boxplots
Range = max - min
- Changes sometimes when we add new data to the dataset
  - Hence, this changes with outliers
  - Statisticians typically cut the top and bottom 25%
    - This is called Interquartile (IQR) range = Q3 - Q1
Quartiles
- Split data into half
  - Median of everything = Q2
  - First half's median = Q1
  - Second half's median = Q3
  - IQR = Q3 - Q1
    - About 50% of data falls within the IQR
    - IQR is not affected by every value in the dataset
    - IQR is not affected by outliers
Outliers
- We can statistically calculate an outlier
  - Outlier < Q1 - 1.5*IQR
  - Outlier > Q3 + 1.5*IQR
Deviation from mean = x_i - x_mean
Mean absolute deviation = sum(x_i - x_mean) / n
- n is the number of examples
Squared deviation = (x_i - x_mean)^2
Mean squared deviation = variance = sum((x_i - x_mean)^2) / n
- Sum of squares (SS) = sum((x_i - x_mean)^2)
Standard deviation (SD) = variance^0.5
- Approximately 68% of data falls within 1 SD from the mean
- Approximately 95% of data falls within 2 SD from the mean
- Approximately 99.7% of data falls within 3 SD from the mean
Bessel's Correction
- In general, samples underestimate the variability of a population
  - This is because most of the values are centered in the middle
  - We can correct for this using Bessel's Correction
    - We divide by n - 1 (degree of freedom = 1)
    - This will make the standard deviation bigger
- In summary
  - If we are trying to estimate the standard deviation of the population, we divide by n - 1
  - If we are actually measuring the standard deviation of the population, we divide by n

Calculating variability of data using pandas

import pandas as pd

lst = [33219, 36254, 38801, 46335, 46840, 47596, 55130, 56863, 78070, 88830]
sample = pd.Series(lst)

type(sample)

pandas.core.series.Series

sample

0    33219
1    36254
2    38801
3    46335
4    46840
5    47596
6    55130
7    56863
8    78070
9    88830
dtype: int64

sample.mean()

52793.800000000003

sample.median()

47218.0

# standard deviation 
# default ddof = 1
# divded by n - 1
sample.std()

18000.701849279834

# standard deviation 
# ddof = 0
# divded by n 
sample.std(ddof=0)

17076.965197598776

# variance with ddof = 0
# sum((x_i - x_mean)^2) / n
sample.var(ddof=0)

291622740.35999984

# variance with ddof = 1
# sum((x_i - x_mean)^2) / (n-1)
sample.var(ddof=1)

324025267.06666648

# mean (average) absolute deviation
sample.mad()

13543.560000000001

Summary

lst2 = [38946, 43420, 49191, 50430, 50557, 52580, 53595, 54135, 60181, 62076]

sample2 = pd.Series(lst2)

print sample2.std(ddof=0)
print sample2.mean()
print sample2.mad()

6557.16326547
51511.1
5002.3

Reading from a csv

path = './salary.csv'
salary = pd.read_csv(path)

# data read into pandas series
salary.head()

# standard deviation
# degree of freedom = 0
# divided by n instead of divided by n - 1
salary.std(ddof=0)

salary    10656.952669
dtype: float64

	salary
0	59147.29
1	61379.14
2	55683.19
3	56272.76
4	52055.88

Variability of Data with Pandas

Variability of Data¶