-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path04_SamplingDistribution.Rmd
323 lines (243 loc) · 18.3 KB
/
04_SamplingDistribution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
# Sampling Distribution of $\bar{X}$
```{r, echo=FALSE}
# Unattach any packages that happen to already be loaded. In general this is unecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
# Set my default chunk options
knitr::opts_chunk$set( fig.height=3, cache=TRUE )
```
```{r, message=FALSE, warning=FALSE}
# load the ggplot2 and dplyr packages... which I use constantly.
library(ggplot2)
library(dplyr)
# Set default behavior of ggplot2 graphs to be black/white theme
theme_set(theme_bw())
# other packages I'll only use occasionally so instead of loading the
# whole package, I'll just do packageName::functionName() when I use
# the function.
```
In the previous chapter, we using bootstrapping to estimate the sampling distribution of $\bar{X}$. We then used this bootstrap distribution to calculate a confidence interval for the population mean. We noticed that the sampling distribution of $\bar{X}$ almost always looked like a normal distribution. Prior to the advent of modern computing, statisticians used a theoretical approximation known as the Central Limit Theorem (CLT). Even today, statistical procedures based on the CLT are widely used and often perform as the corresponding re-sampling technique. In this chapter we'll lay the theoretical foundations for the CLT as well as introduce computation
## Enlightening Example
Suppose we are sampling from a population that has a mean of $\mu=5$ and is skewed. For this example, I'll use a Chi-squared distribution with parameter $\nu=5$.
```{r}
# Population is a Chi-sq distribution with df=5
PopDist <- data.frame(x = seq(0,20,length=10000)) %>%
mutate(density=dchisq(x,df=5))
ggplot(PopDist, aes(x=x, y=density)) +
geom_area(fill='salmon') +
ggtitle('Population Distribution')
```
We want to estimate the mean $\mu$ and take a random sample of $n=5$. Lets do this a few times and notice that the sample mean is never exactly 5, but is a bit off from that.
```{r}
n <- 5 # Our Sample Size!
mosaic::do(3) * {
Sample.Data <- data.frame( x = rchisq(n,df=5) )
Sample.Data %>% summarise( xbar = mean(x) )
}
```
```{r}
n <- 5
SampDist <- mosaic::do(10000) * {
Sample.Data <- data.frame( x = rchisq(n,df=5) )
Sample.Data %>% summarise( xbar = mean(x) )
}
```
We will compare the population distribution to the sampling distribution graphically.
```{r}
ggplot() +
geom_area(data=PopDist, aes(x=x, y=density), fill='salmon') +
geom_histogram(data=SampDist, aes(x=xbar, y=..density..),
binwidth=.1,
alpha=.6) # alpha is the opacity of the layer
```
From the histogram of the sample means, we notice three things:
* The sampling distribution of $\bar{X}$ is centered at the population mean $\mu$.
* The sampling distribution of $\bar{X}$ has less spread than the population distribution.
* The sampling distribution of $\bar{X}$ is less skewed than the population distribution.
## Mathematical details
### Probability Rules for Expectations and Variances
Claim: For random variables $X$ and $Y$ and constant $a$ the following statements hold:
$$E\left(aX\right) = aE\left(X\right)$$
$$Var\left(aX\right) = a^{2}Var\left(X\right)$$
$$E\left(X+Y\right) = E\left(X\right)+E\left(Y\right)$$
$$E\left(X-Y\right) = E\left(X\right)-E\left(Y\right)$$
$$Var\left(X\pm Y\right) = Var\left(X\right)+Var\left(Y\right)\;\textrm{if X,Y are independent}$$
Proving these results is relatively straight forward and is done in almost all introductory probability text books.
### Mean and Variance of the Sample Mean
We have been talking about random variables drawn from a known distribution and being able to derive their expected values and variances. We now turn to the mean of a collection of random variables. Because sample values are random, any function of them is also random. So even though the act of calculating a mean is not a random process, the numbers that are feed into the algorithm are random. Thus the sample mean will change from sample to sample and we are interested in how it varies.
Using the rules we have just confirmed, it is easy to calculate the expectation and variance of the sample mean. Given a sample $X_{1},X_{2},\dots,X_{n}$ of observations where all the observations are independent of each other and all the observations have expectation $E\left[X_{i}\right]=\mu$ and variance $Var\left[X_{i}\right]=\sigma^{2}$ then
$$\begin{aligned}E\left[\bar{X}\right]
&= E\left[\frac{1}{n}\sum_{i=1}^{n}X_{i}\right] \\
&= \frac{1}{n}E\left[\sum_{i=1}^{n}X_{i}\right] \\
&= \frac{1}{n}\sum_{i=1}^{n}E\left[X_{i}\right] \\
&= \frac{1}{n}\sum_{i=1}^{n}\mu \\
&= \frac{1}{n}\,n\mu \\
&= \mu\end{aligned}$$
and
$$\begin{aligned} Var\left[\bar{X}\right]
&= Var\left[\frac{1}{n}\sum_{i=1}^{n}X_{i}\right] \\
&= \frac{1}{n^{2}}Var\left[\sum_{i=1}^{n}X_{i}\right] \\
&= \frac{1}{n^{2}}\sum_{i=1}^{n}Var\left[X_{i}\right] \\
&= \frac{1}{n^{2}}\sum_{i=1}^{n}\sigma^{2} \\
&= \frac{1}{n^{2}}\,n\sigma^{2} \\
&= \frac{\sigma^{2}}{n}
\end{aligned}$$
Notice that the sample mean has the same expectation as the original distribution that the samples were pulled from, but it has a smaller variance! So the sample mean is an unbiased estimator of the population mean $\mu$
and the average distance of the sample mean to the population mean decreases as the sample size becomes larger.
## Distribution of $\bar{X}$
If the samples were drawn from a normal distribution
If $X_{i}\stackrel{iid}{\sim}N\left(\mu,\sigma^{2}\right)$ then it is well known (and proven in most undergraduate probability classes) that $\bar{X}$ is also normally distributed with a mean and variance that were already established. That is
$$\bar{X}\sim N\left(\mu_{\bar{X}}=\mu,\;\sigma_{\bar{X}}^{2}=\frac{\sigma^{2}}{n}\right)$$
Notation: Because the expectations of $X$ and $\bar{X}$ are the same, I could drop the subscript for the expectation of $\bar{X}$ but it is sometimes helpful to be precise. Because the variances are different we will use $\sigma_{\bar{X}}$ to denote the standard deviation of $\bar{X}$ and $\sigma_{\bar{X}}^{2}$ to denote variance of $\bar{X}$. If there is no subscript, we are referring to the population parameter of the distribution from which we taking the sample from.
Exercise: A researcher measures the wingspan of a captured Mountain Plover three times. Assume that each of these $X_{i}$ measurements comes from a $N\left(\mu=6\textrm{ inches},\,\sigma^{2}=1^{2}\textrm{ inch}\right)$ distribution.
1. What is the probability that the first observation is greater than 7?
$$\begin{aligned}P\left(X\ge7\right)
&= P\left(\frac{X-\mu}{\sigma}\ge\frac{7-6}{1}\right) \\
&= P\left(Z\ge1\right) \\
&= 0.1587
\end{aligned}$$
2. What is the distribution of the sample mean?
$$\bar{X}\sim N\left(\mu_{\bar{X}}=6,\,\;\sigma_{\bar{X}}^{2}=\frac{1^{2}}{3}\right)$$
3. What is the probability that the sample mean is greater than 7?
$$\begin{aligned}P\left(\bar{X}\ge7\right)
&= P\left(\frac{\bar{X}-\mu_{\bar{X}}}{\sigma_{\bar{X}}}\ge\frac{7-6}{\sqrt{\frac{1}{3}}}\right) \\
&= P\left(Z\ge\sqrt{3}\right) \\
&= P\left(Z\ge1.73\right) \\
&= 0.0418
\end{aligned}$$
Example: Suppose that the weight of an adult black bear is normally distributed with standard deviation $\sigma=50$ pounds. How large a sample do I need to take to be $95\%$ certain that my sample mean is within $10$ pounds of the true mean $\mu$?
So we want $$\left|\bar{X}-\mu\right| \le 10$$ which we rewrite as $$-10 \le\bar{X}-\mu_{\bar{X}}\le 10$$
$$\frac{-10}{\left(\frac{50}{\sqrt{n}}\right)} \le\frac{\bar{X}-\mu_{\bar{X}}}{\sigma_{\bar{X}}}\le \frac{10}{\left(\frac{50}{\sqrt{n}}\right)}$$
$$\frac{-10}{\left(\frac{50}{\sqrt{n}}\right)} \le Z\le \frac{10}{\left(\frac{50}{\sqrt{n}}\right)}$$
Next we look in our standard normal table to find a $z$-value such that $P\left(-z\le Z\le z\right)=0.95$ and that value is $z=1.96$.
```{r}
data <- data.frame( z= seq(-3, 3, length=1000) ) %>%
mutate( y = dnorm(z) )
ggplot(data, aes(x=z, y=y)) +
geom_line() +
geom_area( data = data %>% filter(abs(z) <= 1.96), fill='grey', alpha=.7) +
geom_text( x=0, y=.2, label='95%')
```
So all we need to do is solve the following equation for $n$
$$1.96 = \frac{10}{ \left( \frac{50}{\sqrt{n}} \right) }$$
$$\frac{1.96}{10}\left(50\right) = \sqrt{n}$$
$$96 \approx n$$
## Central Limit Theorem
> I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along. - Sir Francis Galton (1822-1911)
It was not surprising that the average of a number of normal random variables is also a normal random variable. Because the average of a number of binomial random variables cannot be binomial since the average could be something besides a $0$ or $1$ and the average of Poisson random variables does not have to be an integer. The question arises, what can we say the distribution of the sample mean if the data comes from a non-normal distribution? The answer is quite a lot! Provided the distribution sample from has a non-infinite variance and we have a sufficient sample size.
**Central Limit Theorem**
Let $X_{1},\dots X_{n}$ be independent observations collected from a distribution with expectation $\mu$ and variance $\sigma^{2}$. Then the distribution of $\bar{X}$ converges to a normal distribution with expectation $\mu$ and variance $\sigma^{2}/n$ as $n\rightarrow\infty$.
In practice this means that if $n$ is large (usually $n>30$ is sufficient), then $$\bar{X}\stackrel{\cdot}{\sim}N\left(\mu_{\bar{X}}=\mu,\,\,\,\sigma_{\bar{X}}^{2}=\frac{\sigma^{2}}{n}\right)$$
So what does this mean?
1. Variables that are the sum or average of a bunch of other random variables will be close to normal. Example: human height is determined by genetics, prenatal nutrition, food abundance during adolescence, etc. Similar reasoning explains why the normal distribution shows up surprisingly often in natural science.
2. With sufficient data, the sample mean will have a known distribution and we can proceed as if the sample mean came from a normal distribution.
Example: Suppose the waiting time from order to delivery at a fast-food restaurant is a exponential random variable with rate $\lambda=1/2$
minutes and so the expected wait time is $2$ minutes and the variance is $4$ minutes. What is the approximate probability that we observe a sample of size $n=40$
with a mean time greater than $2.5$ minutes?
$$\begin{aligned}P\left(\bar{X}\ge2.5\right)
&= P\left(\frac{\bar{X}-\mu_{\bar{X}}}{\sigma_{\bar{X}}}\ge\frac{2.5-\mu_{\bar{X}}}{\sigma_{\bar{X}}}\right) \\
&\approx P\left(Z\ge\frac{2.5-2}{\frac{2}{\sqrt{40}}}\right) \\
&= P\left(Z\ge1.58\right) \\
&= 0.0571 \end{aligned}$$
```{r}
# Answer obtained via simulation
SampDist <- mosaic::do(10000) *{ # make 10,000
Sample <- data.frame( x= rexp(n=40, rate=1/2 ) ) # simulated xbar
Sample %>% summarise( xbar = mean( x ) ) # values
}
SampDist %>% # What proportion of those
mutate(Greater = ifelse(xbar >= 2.5, 1, 0)) %>% # xbar values are
summarise( ProportionGreater = mean(Greater) ) # greater than 2.5?
```
**Summary**
Often we have sampled $n$ elements from some population $Y_{1},Y_{2},\dots,Y_{n}$ independently and $E\left(Y_{i}\right)=\mu$ and $Var\left(Y_{i}\right)=\sigma^{2}$ and we want to understand the distribution of the sample mean, that is we want to understand how the sample mean varies from sample to sample.
$E\left(\bar{Y}\right)=\mu$. That states that the distribution of the sample mean will centered at $\mu$. We expect to sometimes take samples where the sample mean is higher than $\mu$ and sometimes less than $\mu$, but the average underestimate is the same magnitude as the average overestimate.
$Var\left(\bar{Y}\right)=\frac{\sigma^{2}}{n}$. This states that as our sample size increases, we trust the sample mean to be close to $\mu$. The larger the sample size, the greater our expectation that the $\bar{Y}$ will be close to $\mu$.
If $Y_{1},Y_{2},\dots,Y_{n}$ were sampled from a $N\left(\mu,\sigma^{2}\right)$ distribution then $\bar{Y}$
is normally distributed.
$$\bar{Y} \sim N\left(\mu_{\bar{Y}}=\mu,\;\;\sigma_{\bar{Y}}^{2}=\frac{\sigma^{2}}{n}\right)$$
If $Y_{1},Y_{2},\dots,Y_{n}$ were sampled from a distribution that is _not_ normal but has mean $\mu$ and variance $\sigma^{2}$, and our sample size is large, then $\bar{Y}$ is _approximately_ normally distributed.
$$\bar{Y} \stackrel{\cdot}{\sim} N\left(\mu_{\bar{Y}}=\mu,\;\;\sigma_{\bar{Y}}^{2}=\frac{\sigma^{2}}{n}\right)$$
## Exercises
1. Suppose that the amount of fluid in a small can of soda can be well approximated by a Normal distribution. Let $X$ be the amount of soda (in milliliters) in a single can and $X\sim N\left(\mu=222,\;\sigma=5\right)$.
a) $P\left(X>230\right)=$
b) Suppose we take a random sample of 6 cans such that the six cans are independent. What is the expected value of the mean of those six cans? In other words, what is $E\left(\bar{X}\right)$?
c) What is $Var\left(\bar{X}\right)$? (Recall we denote this as $\sigma_{\bar{X}}^{2}$)
d) What is the standard deviation of $\bar{X}$? (Recall we denote this as $\sigma_{\bar{X}}$)
e) What is the probability that the sample mean will be greater than 230 ml? That is, find $P\left(\bar{X}>230\right)$.
2. Suppose that the number of minutes that I spend waiting for my order at Big Foot BBQ can be well approximated by a Normal distribution with mean $\mu=10$ minutes and standard deviation $\sigma=1.5$ minutes.
a) Tonight I am planning on going to Big Foot BBQ. What is the probability I have to wait less than 9 minutes?
b) Over the next month, I'll visit Big Foot BBQ 5 times. What is the probability that the mean waiting time of those 5 visits is less than 9 minutes? (This assumes independence of visits but because I don't hit the same restaurant the same night each week, this assumption is probably OK.)
3. 5. Suppose that we have a population with the following distribution that has mean $\mu=5.2$ and standard deviation $\sigma=3.0$:
```{r, echo=FALSE, Ch3_HistogramMatching, cache=TRUE}
df.star <- function(x){
X <- splines::bs(x, knots=c(1.95,4,8), Boundary.knots=c(0,10))
return( X %*% c(0, 2, 0, 0, 2, 0) )
}
C <- integrate(df.star, 0, 10)$value
my.df <- function(x){
df.star(x) / C
}
my.rf <- function(n){
out <- rep(NA, n)
M <- .3
for(i in 1:n){
keep.going <- TRUE
while(keep.going){
x <- runif(1, 0,10)
if(runif(1) < my.df(x)/M){
keep.going <- FALSE
}
}
out[i] <- x
}
return(out)
}
mu <- integrate(function(x){ x*my.df(x) }, 0, 10)$value
#mu
sigma2 <- integrate( function(x){ (x-mu)^2 * my.df(x) }, 0, 10)$value
#sqrt(sigma2)
x <- seq(0,10, length=1001)
Population <- ggplot2::qplot(x, my.df(x), ylab='Density', main='Population Distribution', geom='line') +
labs(x=NULL)
mean.rf <- function(n, sample.size=10){
M <- matrix(my.rf(n*sample.size), nrow=n)
return( as.vector( apply(M, 1, mean) ) )
}
D.D <- data.frame(x = my.rf(10000), Group='D')
D.C <- data.frame(x = mean.rf(5000, 4), Group='C')
D.B <- data.frame(x = mean.rf(5000, 36), Group='B')
D.A <- data.frame(x = c(rnorm(20, mean=1, sd=.4),
rnorm(6000, mean=3.95, sd=.4),
rnorm(5000, mean=6.80, sd=.4),
rnorm(20, mean=8.5, sd=.2)), Group='A')
data <- rbind(D.A, D.B, D.C, D.D) %>%
mutate( Group = factor(Group, levels=c('A','B','C','D')))
summary.stats <- data %>%
group_by(Group) %>%
summarize(xbar = round(mean(x),digits=1),
s=round(sd(x),digits=1))
Possibilities <- ggplot(data, aes(x=x)) +
geom_histogram(binwidth=.2, aes(y = ..density..)) +
facet_wrap( ~ Group ) + theme_bw() +
geom_text( data=summary.stats, x=2, y=.6, aes(label=paste("Mean =",xbar) )) +
geom_text( data=summary.stats, x=8, y=.6, aes(label=paste("StdDev =",s) )) +
theme_bw() + labs(x=NULL, y='Density')
Rmisc::multiplot(Population, Possibilities)
```
a) Which of the histograms would most likely represent the distribution of the sample mean $\bar{x}$ of $n=4$ observations?
b) Which of the histograms would most likely represent the distribution of the sample mean $\bar{x}$ of $n=30$ observations?
c) Justify your choices in parts (a) and (b).
3. A bottling company uses a machine to fill bottles with a tasty beverage. The bottles are advertised to contain 300 milliliters (ml), but in reality the amount varies according to a normal distribution with mean $\mu=298$ ml and standard deviation $\sigma=3$ ml. (For this problem, we'll assume $\sigma$ is known and carry out the calculations accordingly).
a) What is the probability that a randomly chosen bottle contains less than 296 ml?
b) Given a simple random sample of size $n=6$ bottles, what is the probability that the sample mean is less than $296$ ml?
c) What is the probability that a single bottle is filled within $1$ ml of the true mean $\mu=298$ ml? *Hint: Draw the distribution and shade in what probability you want... then convert that to a question about standard normals. To find the answer using a table or R, you need to look up two values and perform a subtraction.*
d) What is the probability that the mean of $10$ randomly selected bottles is within $1$ ml of the mean? What about the mean of a sample of $n=100$ bottles?
e) If a sample of size $n=50$ has a sample mean of $\bar{x}=298$, should this be evidence that the filling machine is out of calibration? i.e., assuming the machine has a mean fill amount of $\mu=300$ and $\sigma=3$, what is $P\left(\bar{X}\le298\right)$?