Sampling and the Central Limit Theorem
Agenda
- Differentiate terms: discriptive/inferential, statistics population/sample, paramater/statistic, sample distribution/sampling distribution
- Define and calculate standard error
- Use Numpy to randomly sample a distribution
- Describe the central limit theorem and connect it to our knowledge of distributions and sampling.
- Citi Bike Example
Probability vs Statistics
- Probability starts with known probabilities and obtains how probable any particular observation would be
- Statistics works the other way around. Start with and observations (data) and try to determine its probability
Descriptive vs Inferential Statistics
- Descriptive Statistics
simply describe what is observed. The average height of a high school football team can be directly calculated by measuring all of the current players height.
- Inferential statistics
try to say something general about a larger group of subjects than those we have measured. For example, we would be doing inferential statistics if we wanted to know about the average height of all high school football teams.
- To put it another way, statistical inference is the process by which we take observations of a subset of a group and generalize to the whole group.
Population Inference
The mayor's office has hired Flatiron Data Science Immersive students to determine a way to fix traffic congestion. A good starting point is to determine what proportion of the population of Seattle owns a car.
In order for us to make any determinations about a population, we must first get information about it.
Because it's usually completely impractical to get data about everyone in a population, we must take a sample.
Key Terms
- the entire group is known as the population
- the subset is a known as the sample
- We would use samples if the population is:
- Too big to enumerate
- too difficult/time consuming or expensive to sample in its entirety.
Random sampling is not easy to do
Continuing our Seattle car example, how would we take a sample?
Here are two strategies we might employ:
-
Stand outside of Flatiron at 12 pm and ask random people until n responses
-
Go to a randomly assigned street corner and at a random time and ask n people if they own a car
Which strikes you as better?
What do we want our sample to look like?
In particular, what relationship do we want between the sample and the population? What steps can we take to improve our odds of success in achieving this?
Discussion
The first way of sampling is considered a convenience sample. You are going about collection in a non-random manner
Sample Conditions
-
The sampled observations must be independent
- The sampling method must be random
-
Sample size distribution:
- The more skewed the sample the larger samples we need.
- n > 30 is considered a large enough sample unless there is extreme skew
Population v Sample Terminology
Characteristics of populations are called parameters
Characteristics of a sample are called statistics
A sample statistic is a point estimate of the population parameter
A Simulation to Reinforce Our Definitions
Let's create a population of systolic blood pressure of adult males in Chicago, assuming a mean of 114 mmHg with a standard deviation of 11 mmHg. We will also assume the adult male population to be 1.5 million.
It is impossible to measure the systolic blood pressure of every man in Chicago, but let's assume multiple investigations have led to the conclusion the the mean and std of this population is 114 and 11, respecively. These are therefore estimators of the population parameter.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pop = int(1.5*10**6)
# Use numpy to generate a normal distribution of the
sys_pop = np.random.normal(loc=114, scale=11, size=pop)
fig, ax = plt.subplots()
sns.kdeplot(sys_pop, ax=ax, shade=True)
ax.set_title('Distribution of Adult Male Systolic Blood Pressure')
ax.set_xlabel('Systolic BP')
Text(0.5, 0, 'Systolic BP')
Let's then imagine we develop an effective manner of random sampling, and simulate with numpy. Our sample size is 40 people.
sample_size = 40
sample = np.random.choice(sys_pop, sample_size)
# We can look at the distribution of the values in the sample.
fig, ax = plt.subplots()
sns.distplot(sample, ax=ax, bins=15)
ax.set_title('Sample Distribution of Systolic BP Measurements')
Text(0.5, 1.0, 'Sample Distribution of Systolic BP Measurements')
We can then calculate the sample statistics:
print(f'Sample mean: {sample.mean()}')
print(f'Sample standard deviation: {sample.std()}')
print(f'Sample median: {np.median(sample)}')
Sample mean: 115.4907660259872
Sample standard deviation: 8.634151702239585
Sample median: 115.4998911419724
If we repeated this process, taking samples of the population repeatedly, we would get an array of sample statistics.
number_of_samples = 1000
sample_size = 50
sample_stats = []
for _ in range(number_of_samples):
sample = np.random.choice(sys_pop, sample_size)
# collect the mean of each of the 1000 samples in sample stats
sample_stats.append(sample.mean())
The collection of sample stats represents our sampling distribution
fig, ax = plt.subplots()
ax.hist(sorted(sample_stats), bins=20)
ax.set_title('Sampling Distribution\n of Systolic BP')
ax.set_xlabel("Systolic Blood Pressure")
ax.set_ylabel('Count');
An interesting property of this sampling distribution:
As we continue to sample, the mean of the sampling distribution gets closer and closer to the population mean.
Standard Error of the Mean
The standard error of the mean is the standard deviation of the sampling distribution.
The issue is that a sample is not an exact replica of the population. We need to account for that fact in order to make our estimate of the
Population sigma
- $ \sigma _{x}$ = standard error of
$\bar{x} $ - $ \sigma $ = standard deviation of population
What if we do not know the population sigma?
If we do not know the population standard deviation, we can approximate it by using the sample standard deviation.
- s = sample standard deviation
Sample size impact on standard error of mean
How should sample size influence standard error of the mean?
It will get smaller as sample size increases
Important implication: The Standard Error of the mean remains the same as long as the population standard deviation is known and sample size remains the same.
def standard_error(distribution, largest_sample_size, population_std=None):
'''
Calculate the standard errors for a range of sample sizes
to demonstrate how standard error decreases when sample
size increases.
'''
std_errors = {}
for sample_size in range(50,largest_sample_size+1):
sample = np.random.choice(distribution, size=sample_size, replace=True)
# Standard error with sample distribution standard deviation
# in place of population
if population_std == None:
std_err = np.std(sample)/np.sqrt(sample_size)
std_errors[sample_size] = std_err
else:
std_err = population_std/np.sqrt(sample_size)
std_errors[sample_size] = std_err
return std_errors
std_errors = standard_error(sys_pop, 1000)
fig, ax = plt.subplots()
sns.scatterplot(list(std_errors.keys()), list(std_errors.values()))
<matplotlib.axes._subplots.AxesSubplot at 0x1a242ffa20>
std_errors = standard_error(sys_pop, 1000, population_std=114)
fig, ax = plt.subplots()
sns.scatterplot(list(std_errors.keys()), list(std_errors.values()))
<matplotlib.axes._subplots.AxesSubplot at 0x1a23d56908>
Word Exercise
Put the variables in the correct place.
var_1 = 'population'
var_2 = 'sample'
var_3 = 'point estimate'
var_4 = 'statistic'
var_5 = 'parameter'
var_6 = 'sampling'
print(f"""We sampled 40 bee hives and calcuted the mean colony population
to be 75,690 bees. 75,690 is a {var_1} of the population paramter\n""")
print(f"""We repeatedly sample 40 people at random from Seattle and
measure their heart rate,then calculate the mean of each sample.
We call the plot of this collection of statistics
the {var_2} distribution.
""")
print(f"""There are exactly 58 Javan Rhino's left in the wild.
Their mean length has been measured accurately at 5 feet.
This mean length is considered a population {var_3}.
""")
print(f"""If we plot a histogram of individual pistil lengths
measured on 50 hibiscus flowers, we would be plotting the distribution
of an attribute of our {var_4} of hibiscus flowers.
""")
print(f"""Since every restaurant in Chicago is required by law to register
with the city, we can accurately count the number of active pizza restaurants
operating right now. This group represents the {var_5} of actively
operating, registered pizza restaurants in Chicago.
""")
print(f"""The mean number of hourly hits to Jelle's Marble Racing website
randomly sampled across a seven day period represents a sample
{var_6}.
""")
We sampled 40 bee hives and calcuted the mean colony population
to be 75,690 bees. 75,690 is a population of the population paramter
We repeatedly sample 40 people at random from Seattle and
measure their heart rate,then calculate the mean of each sample.
We call the plot of this collection of statistics
the sample distribution.
There are exactly 58 Javan Rhino's left in the wild.
Their mean length has been measured accurately at 5 feet.
This mean length is considered a population point estimate.
If we plot a histogram of individual pistil lengths
measured on 50 hibiscus flowers, we would be plotting the distribution
of an attribute of our statistic of hibiscus flowers.
Since every restaurant in Chicago is required by law to register
with the city, we can accurately count the number of active pizza restaurants
operating right now. This group represents the parameter of actively
operating, registered pizza restaurants in Chicago.
The mean number of hourly hits to Jelle's Marble Racing website
randomly sampled across a seven day period represents a sample
sampling.
#__SOLUTION__
# Word Exercise
var_1 = 'population'
var_2 = 'sample'
var_3 = 'point estimate'
var_4 = 'statistic'
var_5 = 'parameter'
var_6 = 'sampling'
print(f"""We sampled 40 bee hives and calcuted the mean colony population
to be 75,690 bees. 75,690 is a {var_3} of the population paramter\n""")
print(f"""We repeatedly sample 40 people at random from Seattle and
measure their heart rate,then calculate the mean of each sample.
We call the plot of this collection of statistics
the {var_6} distribution.
""")
print(f"""There are exactly 58 Javan Rhino's left in the wild.
Their mean length has been measured accurately at 5 feet.
This mean length is considered a population {var_5}.
""")
print(f"""If we plot a histogram of individual pistil lengths
measured on 50 hibiscus flowers, we would be plotting the distribution
of an attribute of our {var_2} of hibiscus flowers.
""")
print(f"""Since every restaurant in Chicago is required by law to register
with the city, we can accurately count the number of active pizza restaurants
actively operating right now. This group represents the {var_1} of actively
operating, registered pizza restaurants in Chicago.
""")
print(f"""The mean number of hourly hits to Jelle's Marble Racing website
randomly sampled across a seven day period represents a sample
{var_4}.
""")
We sampled 40 bee hives and calcuted the mean colony population
to be 75,690 bees. 75,690 is a point estimate of the population paramter
We repeatedly sample 40 people at random from Seattle and
measure their heart rate,then calculate the mean of each sample.
We call the plot of this collection of statistics
the sampling distribution.
There are exactly 58 Javan Rhino's left in the wild.
Their mean length has been measured accurately at 5 feet.
This mean length is considered a population parameter.
If we plot a histogram of individual pistil lengths
measured on 50 hibiscus flowers, we would be plotting the distribution
of an attribute of our sample of hibiscus flowers.
Since every restaurant in Chicago is required by law to register
with the city, we can accurately count the number of active pizza restaurants
actively operating right now. This group represents the population of actively
operating, registered pizza restaurants in Chicago.
The mean number of hourly hits to Jelle's Marble Racing website
randomly sampled across a seven day period represents a sample
statistic.
2. Use numpy to randomly sample a distribution
Group Exercise
Below, we have four different sample scenarios. Each group will code out the following:
You are given a "population" to sample from based on the type of distribution.
-
Take a random sample of size n, where n > 30, from the population and calculate the mean of that population.
-
Repeat the sample n numbers of times (n = 1000).
-
Plot the sampling distribution
mccalister = ['Adam', 'Amanda','Chum', 'Dann',
'Jacob', 'Jason', 'Johnhoy', 'Karim',
'Leana','Luluva', 'Matt', 'Maximilian' ]
for n in range(1,4):
group = np.random.choice(mccalister, 4, replace=False)
print(f'group {n}', group)
for name in list(group):
mccalister.remove(name)
group 1 ['Adam' 'Jason' 'Matt' 'Dann']
group 2 ['Maximilian' 'Johnhoy' 'Chum' 'Jacob']
group 3 ['Leana' 'Luluva' 'Amanda' 'Karim']
Group 1:
A bowler on the PBA rolls a strike 60% of the time. The population strikes of all games ever bowled is stored in in the population variable below.
population = np.random.binomial(12, .6, 10000)
fig, ax = plt.subplots()
ax.bar(range(0,12), np.unique(population, return_counts=True)[1])
ax.set_title('Strikes Per Game')
Text(0.5, 1.0, 'Strikes Per Game')
#__SOLUTION__
sample_means = []
for n in range(1000):
sample = np.random.choice(population, 50)
sample_means.append(sample.mean())
fig, ax = plt.subplots()
ax.hist(sample_means, bins = 20)
(array([ 1., 0., 1., 1., 5., 28., 34., 68., 83., 100., 115.,
167., 111., 121., 67., 51., 24., 12., 6., 5.]),
array([6.26 , 6.343, 6.426, 6.509, 6.592, 6.675, 6.758, 6.841, 6.924,
7.007, 7.09 , 7.173, 7.256, 7.339, 7.422, 7.505, 7.588, 7.671,
7.754, 7.837, 7.92 ]),
<a list of 20 Patch objects>)
Group 2:
Stored in the variable below is the number of pieces of mail that arrive per week at your door for each of the 4500 weeks in your life.
mail_population = np.random.poisson(3, 4500)
counts = np.unique(mail_population, return_counts=True)
fig, ax = plt.subplots()
ax.bar(np.unique(counts[0]), counts[1])
ax.set_title('Distribution of Pieces of Mail/Week')
ax.set_xlabel("Pieces of Mail")
Text(0.5, 0, 'Pieces of Mail')
#__SOLUTION__
sample_means = []
for n in range(1000):
sample = np.random.choice(mail_population, 50)
sample_means.append(sample.mean())
fig, ax = plt.subplots()
ax.hist(sample_means, bins = 30)
ax.set_title('Sample Means of Pieces of Mail\n Arriving at your door')
ax.set_xlabel('Number of pieces of mail')
Text(0.5, 0, 'Number of pieces of mail')
Group 3
The population data for the number of minutes between customers arriving in a Piggly Wiggly is stored in the variable piggly_population.
# on average, 20 customers enter per hour
piggly_population = np.random.exponential(1/(20/60), size=10000)
fig, ax = plt.subplots()
ax.hist(piggly_population, bins = 50, normed=True)
ax.set_title('Sample Means of Time Between Piggle Wiggly Customers')
ax.set_xlabel('Minutes');
/Users/johnmaxbarry/.local/lib/python3.7/site-packages/ipykernel_launcher.py:4: MatplotlibDeprecationWarning:
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
after removing the cwd from sys.path.
#__SOLUTION__
sample_means = []
for n in range(1000):
sample = np.random.choice(piggly_population, 50)
sample_means.append(sample.mean())
fig, ax = plt.subplots()
ax.hist(sample_means, bins = 30);
ax.set_title("""Sample means of number of minutes\n between people entering a Piggly Wiggly""")
ax.set_xlabel("Number of minutes between customers")
Text(0.5, 0, 'Number of minutes between customers')
3. Central Limit Theorem
If we take repeated samples of a population, the sampling distribution of sample means will approximate to a normal distribution, no matter the underlying distribution!
$E(\bar{x_{n}}) = \mu$
as n --> "large"
Let's look at an example taken from the ubiquitous Iris dataset. This histogram represents the distributions of sepal length:
https://www.kaggle.com/tentotheminus9/central-limit-theorem-animation
As we will see in hypothesis testing, pairing this theorem with the Empirical rule will be very powerful.
Knowing that any sampling distribtion, no matter the underlying population distribution, will approach normality, we will be able to judge, given the empirical rule, how rare a given sample statistic is.
Bike Example
Capital bike share is trying to figure out their pricing for members versus non-members. The first step in their analysis is to see if members vs non-members ride for different amounts of time per ride.
Let's head over here for some DC bike data!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
%matplotlib inline
! wget 'https://s3.amazonaws.com/capitalbikeshare-data/201905-capitalbikeshare-tripdata.zip' -O temp.zip
! unzip temp.zip
--2020-06-03 08:43:31-- https://s3.amazonaws.com/capitalbikeshare-data/201905-capitalbikeshare-tripdata.zip
Resolving s3.amazonaws.com... 52.216.160.245
Connecting to s3.amazonaws.com|52.216.160.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7837266 (7.5M) [application/zip]
Saving to: 'temp.zip'
temp.zip 50%[=========> ] 3.81M 1.80MB/s ^C
Archive: temp.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of temp.zip or
temp.zip.zip, and cannot find temp.zip.ZIP, period.
df = pd.read_csv('201905-capitalbikeshare-tripdata.csv')
df.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Duration | Start date | End date | Start station number | Start station | End station number | End station | Bike number | Member type | |
---|---|---|---|---|---|---|---|---|---|
337699 | 434 | 2019-05-31 23:59:15 | 2019-06-01 00:06:30 | 31281 | 8th & O St NW | 31627 | 3rd & M St NE | W23767 | Member |
337700 | 227 | 2019-05-31 23:59:45 | 2019-06-01 00:03:32 | 31201 | 15th & P St NW | 31229 | New Hampshire Ave & T St NW | W23691 | Member |
337701 | 1638 | 2019-05-31 23:59:45 | 2019-06-01 00:27:03 | 31261 | 21st St & Constitution Ave NW | 31247 | Jefferson Dr & 14th St SW | W20810 | Member |
337702 | 1621 | 2019-05-31 23:59:51 | 2019-06-01 00:26:53 | 31261 | 21st St & Constitution Ave NW | 31247 | Jefferson Dr & 14th St SW | W21714 | Casual |
337703 | 373 | 2019-05-31 23:59:51 | 2019-06-01 00:06:04 | 31281 | 8th & O St NW | 31201 | 15th & P St NW | W21769 | Member |
What shape do you expect our dataset to have?
import seaborn as sns
from scipy import stats
fig, ax = plt.subplots(2,1, figsize=(10,10))
sns.distplot(df.Duration, bins = 20, ax=ax[0])
sns.boxplot(df.Duration, ax=ax[1])
<matplotlib.axes._subplots.AxesSubplot at 0x1a35f11cc0>
The shape is difficult to see because of the outliers. Let's remove some to get a better sense of the shape
pop_no_fliers = df[np.abs(stats.zscore(df.Duration) > 3)]
fig, ax = plt.subplots()
sns.distplot(pop_no_fliers.Duration)
<matplotlib.axes._subplots.AxesSubplot at 0x1a267166d8>
That shape looks like an exponential curve. We briefly touched on the exponential in the plot exercise at the beginning of distributions (remember the lifespan of a computer part)?
member_df = df[df['Member type'] == 'Member']
casual_df = df[df['Member type'] == 'Casual']
plt.boxplot(member_df['Duration']);
from scipy import stats
print(member_df.shape)
sum(stats.zscore(member_df.Duration)>3)
(286079, 9)
1899
member_df_nofliers = member_df[np.abs(stats.zscore(member_df.Duration)) < 3]
member_df_nofliers.shape
(284180, 9)
import seaborn as sns
fig, ax = plt.subplots()
sns.distplot(member_df_nofliers.Duration, bins = 20, ax=ax);
fig, ax = plt.subplots()
sns.boxplot(casual_df.Duration)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2be4b630>
casual_df_nofliers = casual_df[np.abs(stats.zscore(casual_df.Duration)) < 3]
fig, ax = plt.subplots()
sns.distplot(casual_df_nofliers.Duration, bins=20, ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x1a264c89b0>
fig, ax = plt.subplots()
sns.distplot(member_df_nofliers.Duration, bins = 20, ax=ax, color='blue');
sns.distplot(casual_df_nofliers.Duration, bins=20, ax=ax, color='green')
<matplotlib.axes._subplots.AxesSubplot at 0x1a2d81c5c0>
Get population statistics
df.median()
Duration 724.0
Start station number 31266.0
End station number 31264.0
dtype: float64
Let's treat the whole dataset as our population.
pop_mean = df.Duration.mean()
pop_std = df.Duration.std()
print(f'pop_mean is {pop_mean} \npop_std is {pop_std}')
pop_mean is 1138.3041865065263
pop_std is 2282.5139701034595
def one_sample_mean(population):
sample = np.random.choice(population, size=200, replace=True)
return sample.mean()
one_sample_mean(df.Duration)
1237.33
When we take multiple samples from the distribution,and plot the means of each sample, the shape of the curve shifts
d = [one_sample_mean(df.Duration) for i in range(1000)]
plt.hist(d, bins=50)
(array([ 3., 5., 8., 26., 25., 50., 50., 75., 72., 68., 68., 73., 58.,
60., 48., 55., 33., 20., 31., 27., 17., 12., 16., 18., 12., 9.,
10., 4., 4., 7., 13., 2., 6., 0., 2., 1., 0., 1., 4.,
0., 0., 2., 1., 0., 0., 0., 0., 0., 2., 2.]),
array([ 829.1 , 852.8163, 876.5326, 900.2489, 923.9652, 947.6815,
971.3978, 995.1141, 1018.8304, 1042.5467, 1066.263 , 1089.9793,
1113.6956, 1137.4119, 1161.1282, 1184.8445, 1208.5608, 1232.2771,
1255.9934, 1279.7097, 1303.426 , 1327.1423, 1350.8586, 1374.5749,
1398.2912, 1422.0075, 1445.7238, 1469.4401, 1493.1564, 1516.8727,
1540.589 , 1564.3053, 1588.0216, 1611.7379, 1635.4542, 1659.1705,
1682.8868, 1706.6031, 1730.3194, 1754.0357, 1777.752 , 1801.4683,
1825.1846, 1848.9009, 1872.6172, 1896.3335, 1920.0498, 1943.7661,
1967.4824, 1991.1987, 2014.915 ]),
<a list of 50 Patch objects>)
import seaborn as sns
def central_limit_theorem_plotter(distribution, sample_size, num_samples, color='blue'):
sample_means = np.zeros(num_samples)
for idx, num in enumerate(range(num_samples)):
sample = np.random.choice(distribution, size=sample_size, replace=True)
sample_means[idx] = sample.mean()
sns.distplot(sample_means, bins=80, kde=True, color=color)
title = f'Sample Distribution n = {sample_size} and number of samples = {num_samples},\
std error = {pop_std / num_samples}'
print(f'mean = {sample_means.mean()}')
plt.title(title)
The number of samples drives the shape of the curve more than the sample size itself
central_limit_theorem_plotter(df.Duration, 1000, 500);
mean = 1136.515084
Larger sample size, Fewer samples
central_limit_theorem_plotter(df.Duration, 5000, 50);
mean = 1132.6324960000002
- What happens as we increase the sample size?
- How does the height of the distribution change? Why does it change?
central_limit_theorem_plotter(member_df.Duration, 1000, 500, 'blue')
central_limit_theorem_plotter(casual_df.Duration, 1000, 500, 'green')
mean = 885.1660820000001
mean = 2523.241936