-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsampling_distributions_Coursera.rmd
416 lines (330 loc) · 17.5 KB
/
sampling_distributions_Coursera.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
---
title: "Foundations for inference - Sampling distributions"
output: statsr:::statswithr_lab
runtime: shiny
---
<div id="instructions">
Complete all **Exercises**, and submit answers to **Questions** on the Coursera
platform.
</div>
## Getting Started
### Load packages
In this lab we will explore the data using the `dplyr` package and visualize it
using the `ggplot2` package for data visualization. The data can be found in the
companion package for this course, `statsr`.
Let's load the packages.
```{r load-packages, message=FALSE}
library(statsr)
library(dplyr)
library(shiny)
library(ggplot2)
```
### The data
We consider real estate data from the city of Ames, Iowa. The details of
every real estate transaction in Ames is recorded by the City Assessor's
office. Our particular focus for this lab will be all residential home sales
in Ames between 2006 and 2010. This collection represents our population of
interest. In this lab we would like to learn about these home sales by taking
smaller samples from the full population. Let's load the data.
```{r load-data}
data(ames)
```
We see that there are quite a few variables in the data set, enough to do a
very in-depth analysis. For this lab, we'll restrict our attention to just
two of the variables: the above ground living area of the house in square feet
(`area`) and the sale price (`price`).
We can explore the distribution of areas of homes in the population of home
sales visually and with summary statistics. Let's first create a visualization,
a histogram:
```{r area-hist}
ggplot(data = ames, aes(x = area)) +
geom_histogram(binwidth = 250)
```
Let's also obtain some summary statistics. Note that we can do this using the
`summarise` function. We can calculate as many statistics as we want using this
function, and just string along the results. Some of the functions below should
be self explanatory (like `mean`, `median`, `sd`, `IQR`, `min`, and `max`). A
new function here is the `quantile` function which we can use to calculate
values corresponding to specific percentile cutoffs in the distribution. For
example `quantile(x, 0.25)` will yield the cutoff value for the 25th percentile (Q1)
in the distribution of x. Finding these values are useful for describing the
distribution, as we can use them for descriptions like *"the middle 50% of the
homes have areas between such and such square feet"*.
```{r area-stats}
ames %>%
summarise(mu = mean(area), pop_med = median(area),
sigma = sd(area), pop_iqr = IQR(area),
pop_min = min(area), pop_max = max(area),
pop_q1 = quantile(area, 0.25), # first quartile, 25th percentile
pop_q3 = quantile(area, 0.75)) # third quartile, 75th percentile
```
1. Which of the following is **false**?
<ol>
<li> The distribution of areas of houses in Ames is unimodal and right-skewed. </li>
<li> 50\% of houses in Ames are smaller than 1,499.69 square feet. </li>
<li> The middle 50\% of the houses range between approximately 1,126 square feet and 1,742.7 square feet. </li>
<li> The IQR is approximately 616.7 square feet. </li>
<li> The smallest house is 334 square feet and the largest is 5,642 square feet. </li>
</ol>
ans:The middle 50\% of the houses range between approximately 1,126 square feet and 1,742.7 square feet..
## The unknown sampling distribution
In this lab we have access to the entire population, but this is rarely the
case in real life. Gathering information on an entire population is often
extremely costly or impossible. Because of this, we often take a sample of
the population and use that to understand the properties of the population.
If we were interested in estimating the mean living area in Ames based on a
sample, we can use the following command to survey the population.
```{r samp1}
samp1 <- ames %>%
sample_n(size = 50)
```
This command collects a simple random sample of `size` 50 from the `ames` dataset,
which is assigned to `samp1`. This is like going into the City
Assessor's database and pulling up the files on 50 random home sales. Working
with these 50 files would be considerably simpler than working with all 2930
home sales.
<div id="exercise">
**Exercise**: Describe the distribution of this sample? How does it compare to the distribution of the population? **Hint:** `sample_n` function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you used in the previous exercise will also be helpful for visualizing and summarizing the sample, however be careful to not label values `mu` and `sigma` anymore since these are sample statistics, not population parameters. You can customize the labels of any of the statistics to indicate that these come from the sample.
</div>
```{r samp1-dist}
# type your code for the Exercise here, and Run Document
```
If we're interested in estimating the average living area in homes in Ames
using the sample, our best single guess is the sample mean.
```{r mean-samp1}
samp1 %>%
summarise(x_bar = mean(area))
```
Depending on which 50 homes you selected, your estimate could be a bit above
or a bit below the true population mean of 1,499.69 square feet. In general,
though, the sample mean turns out to be a pretty good estimate of the average
living area, and we were able to get it by sampling less than 3\% of the
population.
2. Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
<ol>
<li> Sample size of 50. </li>
<li> Sample size of 100. </li>
<li> Sample size of 1000. </li>
</ol>
ans: Sample size of 1000.
Let's take one more sample of size 50, and view the mean area in this sample:
```{r mean-samp2}
ames %>%
sample_n(size = 50) %>%
summarise(x_bar = mean(area))
```
Not surprisingly, every time we take another random sample, we get a different
sample mean. It's useful to get a sense of just how much variability we
should expect when estimating the population mean this way. The distribution
of sample means, called the *sampling distribution*, can help us understand
this variability. In this lab, because we have access to the population, we
can build up the sampling distribution for the sample mean by repeating the
above steps many times. Here we will generate 15,000 samples and compute the
sample mean of each. Note that we are sampling with replacement,
`replace = TRUE` since sampling distributions are constructed with sampling
with replacement.
```{r loop}
sample_means50 <- ames %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
summarise(x_bar = mean(area))
ggplot(data = sample_means50, aes(x = x_bar)) +
geom_histogram(binwidth = 20)
```
Here we use R to take 15,000 samples of size 50 from the population, calculate
the mean of each sample, and store each result in a vector called
`sample_means50`. Next, we review how this set of code works.
<div id="exercise">
**Exercise**: How many elements are there in `sample_means50`? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.
</div>
```{r sampling-dist}
# type your code for the Exercise here, and Run Document
```
## Interlude: Sampling distributions
The idea behind the `rep_sample_n` function is *repetition*. Earlier we took
a single sample of size `n` (50) from the population of all houses in Ames. With
this new function we are able to repeat this sampling procedure `rep` times in order
to build a distribution of a series of sample statistics, which is called the
**sampling distribution**.
Note that in practice one rarely gets to build sampling distributions,
because we rarely have access to data from the entire population.
Without the `rep_sample_n` function, this would be painful. We would have to
manually run the following code 15,000 times
```{r sample-code, eval=FALSE}
ames %>%
sample_n(size = 50) %>%
summarise(x_bar = mean(area))
```
as well as store the resulting sample means each time in a separate vector.
Note that for each of the 15,000 times we computed a mean, we did so from a
**different** sample!
<div id="exercise">
**Exercise**: To make sure you understand how sampling distributions are built, and exactly what the `sample_n` and `do` function do, try modifying the code to create a sampling distribution of **25 sample means** from **samples of size 10**, and put them in a data frame named `sample_means_small`. Print the output. How many observations are there in this object called `sample_means_small`? What does each observation represent?
</div>
```{r practice-sampling-dist}
# type your code for the Exercise here, and Run Document
```
3. How many elements are there in this object called `sample_means_small`?
<ol>
<li> 0 </li>
<li> 3 </li>
<li> 25 </li>
<li> 100 </li>
<li> 5,000 </li>
</ol>
ans:25
```{r sample-means-small}
# type your code for Question 3 here, and Run Document
```
4. Which of the following is **true** about the elements in the sampling distributions you created?
<ol>
<li> Each element represents a mean square footage from a simple random sample of 10 houses. </li>
<li> Each element represents the square footage of a house. </li>
<li> Each element represents the true population mean of square footage of houses. </li>
</ol>
ans:Each element represents the square footage of a house.
## Sample size and the sampling distribution
Mechanics aside, let's return to the reason we used the `rep_sample_n` function: to
compute a sampling distribution, specifically, this one.
```{r hist}
ggplot(data = sample_means50, aes(x = x_bar)) +
geom_histogram(binwidth = 20)
```
The sampling distribution that we computed tells us much about estimating
the average living area in homes in Ames. Because the sample mean is an
unbiased estimator, the sampling distribution is centered at the true average
living area of the population, and the spread of the distribution
indicates how much variability is induced by sampling only 50 home sales.
In the remainder of this section we will work on getting a sense of the effect that
sample size has on our sampling distribution.
<div id="exercise">
**Exercise**: Use the app below to create sampling distributions of means of `area`s from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations?
</div>
```{r shiny, echo=FALSE}
shinyApp(
ui <- fluidPage(
# Sidebar with a slider input for number of bins
sidebarLayout(
sidebarPanel(
selectInput("selected_var",
"Variable:",
choices = list("area", "price"),
selected = "area"),
numericInput("n_samp",
"Sample size:",
min = 1,
max = nrow(ames),
value = 30),
numericInput("n_sim",
"Number of samples:",
min = 1,
max = 30000,
value = 15000)
),
# Show a plot of the generated distribution
mainPanel(
plotOutput("sampling_plot"),
verbatimTextOutput("sampling_mean"),
verbatimTextOutput("sampling_se")
)
)
),
# Define server logic required to draw a histogram
server <- function(input, output) {
# create sampling distribution
sampling_dist <- reactive({
ames[[input$selected_var]] %>%
sample(size = input$n_samp * input$n_sim, replace = TRUE) %>%
matrix(ncol = input$n_samp) %>%
rowMeans() %>%
data.frame(x_bar = .)
#ames %>%
# rep_sample_n(size = input$n_samp, reps = input$n_sim, replace = TRUE) %>%
# summarise_(x_bar = mean(input$selected_var))
})
# plot sampling distribution
output$sampling_plot <- renderPlot({
x_min <- quantile(ames[[input$selected_var]], 0.1)
x_max <- quantile(ames[[input$selected_var]], 0.9)
ggplot(sampling_dist(), aes(x = x_bar)) +
geom_histogram() +
xlim(x_min, x_max) +
ylim(0, input$n_sim * 0.35) +
ggtitle(paste0("Sampling distribution of mean ",
input$selected_var, " (n = ", input$n_samp, ")")) +
xlab(paste("mean", input$selected_var)) +
theme(plot.title = element_text(face = "bold", size = 16))
})
# mean of sampling distribution
output$sampling_mean <- renderText({
paste0("mean of sampling distribution = ", round(mean(sampling_dist()$x_bar), 2))
})
# mean of sampling distribution
output$sampling_se <- renderText({
paste0("SE of sampling distribution = ", round(sd(sampling_dist()$x_bar), 2))
})
},
options = list(height = 500)
)
```
5. It makes intuitive sense that as the sample size increases, the center of the sampling distribution becomes a more reliable estimate for the true population mean. Also as the sample size increases, the variability of the sampling distribution ________.
<ol>
<li> decreases </li>
<li> increases </li>
<li> stays the same </li>
</ol>
ans :decreases
<div id="exercise">
**Exercise**: Take a random sample of size 50 from `price`. Using this sample, what is your best point estimate of the population mean?
</div>
```{r price-sample}
# type your code for this Exercise here, and Run Document
```
<div id="exercise">
**Exercise**: Since you have access to the population, simulate the sampling distribution for $\bar{x}_{price}$ by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called `sample_means50`. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be?
</div>
```{r price-sampling}
# type your code for this Exercise here, and Run Document
```
<div id="exercise">
**Exercise**: Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called `sample_means150`. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
</div>
```{r price-sampling-more}
# type your code for this Exercise here, and Run Document
```
* * *
So far, we have only focused on estimating the mean living area in homes in
Ames. Now you'll try to estimate the mean home price.
Note that while you might be able to answer some of these questions using the app
you are expected to write the required code and produce the necessary plots and
summary statistics. You are welcomed to use the app for exploration.
<div id="exercise">
**Exercise**: Take a sample of size 15 from the population and calculate the mean `price` of the homes in this sample. Using this sample, what is your best point estimate of the population mean of prices of homes?
</div>
```{r price-sample-small}
# type your code for this Exercise here, and Run Document
```
<div id="exercise">
**Exercise**: Since you have access to the population, simulate the sampling distribution for $\bar{x}_{price}$ by taking 2000 samples from the population of size 15 and computing 2000 sample means. Store these means in a vector called `sample_means15`. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
</div>
```{r price-sampling-small}
# type your code for this Exercise here, and Run Document
```
<div id="exercise">
**Exercise**: Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called `sample_means150`. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
</div>
```{r price-sampling-big}
# type your code for this Exercise here, and Run Document
```
6. Which of the following is false?
<ol>
<li> The variability of the sampling distribution with the smaller sample size (`sample_means50`) is smaller than the variability of the sampling distribution with the larger sample size (`sample_means150`). </li>
<li> The means for the two sampling distribtuions are roughly similar. </li>
<li> Both sampling distributions are symmetric. </li>
</ol>
ans:Both sampling distributions are symmetric.
```{r price-sampling-compare}
# type your code for Question 6 here, and Run Document
```
<div id="license">
This is a derivative of an [OpenIntro](https://www.openintro.org/stat/labs.php) lab, and is released under a [Attribution-NonCommercial-ShareAlike 3.0 United States](https://creativecommons.org/licenses/by-nc-sa/3.0/us/) license.
</div>