-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlecture10.Rmd
366 lines (277 loc) · 10.5 KB
/
lecture10.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
---
title: 'SFB 1036 Course on Data Analysis: Lesson 10'
author: "Simon Anders"
output:
html_document:
df_print: paged
---
# Analysis of Variance
```{r}
library( tidyverse )
```
We load some example data found on the web
```{r}
medley <-
read_csv( url( "https://raw.githubusercontent.com/aurielfournier/biometry_materials/master/20160225_biometry_lab_6_anova_assumptions/medley.csv" ) )[,1:3]
colnames(medley) <- str_to_lower( colnames(medley) )
medley
```
This is data from Medley and Clement, *Responses of Diatom Communities to heavy
metals in streams: the influence of longitudinal variation* (Ecological
Applications, [8:631](https://doi.org/10.1890/1051-0761(1998)008[0631:RODCTH]2.0.CO;2) (1998)).
I have taken the idea to use this as example from Quinn and Keough's book *Experimental
design and data analysis for biologists* (Cambridge University Press, 2002).
Medley and Clement have measured, for different sites along five streams, the concentration
of the zinc and the [Shannon-Wiener diversity](https://en.wikipedia.org/wiki/Diversity_index) of [diatom](https://en.wikipedia.org/wiki/Diatom) species. Do high levels of zinc cause a reduction in
species diversity?
In order to discuss one-way anova, let's only use one column, the zinc concentration, for now, and
ignore the stream column.
As a fisrt step, we should plot the data
```{r}
ggplot( medley ) +
geom_point( aes( x = zinc, y = diversity ) )
```
Let's bring the x axis in the correct order and then try again
```{r}
medley$zinc <- fct_relevel( medley$zinc,
c( "BACK", "LOW", "MED", "HIGH" ) )
library( ggbeeswarm )
ggplot( medley ) +
geom_point( aes( x = zinc, y = diversity ) )
```
In most papers, however, you will see this
```{r}
ggplot( medley ) +
geom_boxplot( aes( x = zinc, y = diversity ) )
```
or even this
```{r}
medley %>%
group_by( zinc ) %>%
summarise(
mean = mean( diversity ),
sem = sd( diversity ) / sqrt( n() ) ) %>%
rename( diversity = mean ) %>%
ggplot +
geom_bar( aes( x = zinc, y = diversity, width = .7 ), stat="identity" ) +
geom_errorbar( aes( x = zinc, ymin = diversity-sem, ymax = diversity+sem ),
width = .3, stat="identity" )
```
Don't hide your data, so don't use barcharts.
Maybe combine boxplots and beeswarm plots:
```{r}
ggplot( medley, aes( x = zinc, y = diversity ) ) +
geom_boxplot( color = "gray40" ) + geom_point()
```
Back to the original question.
Is there a difference in species diversity between the four groups?
We could make t tests:
```{r}
t.test(
medley %>% filter( zinc == "BACK" ) %>% pull(diversity),
medley %>% filter( zinc == "MED" ) %>% pull(diversity) )
```
But do we really want to compare each group against every other group? We will only
get a multiple testing problem.
Let's make a short detour to see why this is not a good idea.
We have four groups, and, say 30 measurement values, taken completely at random, with
no connection between measurement values and group. We do all 6 pair-wise comparisons
with t tests. How often do we get a significant result? Run the following code a
couple of times.
```{r}
tibble(
group = sample( c( "A", "B", "C", "D" ), 30, replace=TRUE ),
value = rnorm( 30, mean=10, sd=3 ) ) ->
simdata
expand( simdata, group1=group, group2 = group ) %>%
filter( group1 < group2 ) %>%
group_by_all() %>%
summarise( pval = t.test(
simdata %>% filter( group == group1 ) %>% pull(value),
simdata %>% filter( group == group2 ) %>% pull(value) )$p.value ) %>%
mutate( signif = pval < 0.05 )
```
So, multiple pairwise tests are not a good idea. How can we compare all four groups at once?
Sneak preview: We use one-way ANOVA:
```{r}
anova( lm( diversity ~ zinc, medley ) )
```
As we can see, there is a signifcant p value. But what do all the other numbers mean?
To understand, let's break down the ANOVA calculation in its piece and to it "on foot".
In *analysis of variance*, or ANOVA, we see how much variance the groups explain. FIrst, a graphical representation of the idea
```{r fig.height=2.5, fig.width=7}
cowplot::plot_grid(
ggplot( medley ) + geom_jitter( aes( x = "all", y = diversity ), width=.1 ) +
ylim(0.2,3.3),
ggplot( medley ) + geom_jitter( aes( x = zinc, y = diversity ), width=.1 ) +
ylim(0.2,3.3),
rel_widths = c( 1, 3 ) )
```
The total variance of our diversity measurements is
```{r}
var( medley$diversity )
```
Remember that to get to this number, we calculate the *squared deviations from the mean* and then divide by
the number of samples (minus 1):
```{r}
( medley$diversity - mean(medley$diversity) )^2
sum( ( medley$diversity - mean(medley$diversity) )^2 ) / ( nrow(medley) - 1 )
```
For the moment, let's ignore the denominator and focus on the numerator of the variance, the *sum of squares* (SS or SumSq).
Here we calculate the sum of squares once more, now without dividing by `( nrow(medley) - 1 )`.
```{r}
medley %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) )
```
We call this quantity the *total sum of squares* (TSS).
In the TSS, we have subtracted from each value the mean over *all* samples. What happens
if we take the mean separately for each of the four groups?
```{r}
medley %>%
group_by( zinc ) %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) )
```
Let's sum this up:
```{r}
medley %>%
group_by( zinc ) %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
summarise( ss = sum(ss) )
```
Now, we get a lower sum of squares! Why? Because the means in the individual groups can be closer to the values
than the overall mean. This can be because the values scatter around different, groups-specific true means, or simply by
chance.
First, some terminology: We call this second sum the *residual sum of squares* (RSS), because it is what is left over
after we make use of our knowledge of the differences in zinc concentyration. The ratio, RSS/TSS, is called the *fraction of
variance unexplained*. It seems that here, about 72% of the total varaince is not "explained" by the zinc concentration bin
while 28% of the total variance *is* "explained" by it.
Why the quotation marks? Because even noise can seem to explain something: Let's try the same on our simulation
data.
First the total sum of squares (TSS):
```{r}
simdata %>%
summarise( ss = sum( ( value - mean(value) )^2 ) )
```
Now the residual sum of squares (RSS):
```{r}
simdata %>%
group_by( group ) %>%
summarise( ss = sum( ( value - mean(value) )^2 ) ) %>%
summarise( ss = sum( ss ) )
```
How do we know what fraction of variance we can expect to see "explained"" by chance?
For our simulation, we can just run it many times:
```{r}
replicate( 3000, {
tibble(
group = sample( c( "A", "B", "C", "D" ), 30, replace=TRUE ),
value = rnorm( 30, mean=10, sd=3 ) ) ->
simdata
simdata %>%
summarise( ss = sum( ( value - mean(value) )^2 ) ) %>%
pull( ss ) ->
tss
simdata %>%
group_by( group ) %>%
summarise( ss = sum( ( value - mean(value) )^2 ) ) %>%
summarise( ss = sum( ss ) ) %>%
pull( ss ) ->
rss
rss/tss
}) -> frac_unexpl
hist( frac_unexpl, 30 )
```
Instead of looking at this unexplained fraction of the TSS, it is more common to
look at the so-called F statistic
```{r}
Fstat <- ( 1 - frac_unexpl ) / frac_unexpl * 26/3
```
This is the ratio of explained to unexplained variance. Why the `26/3`? Because we have used
sum of squares but the F statistic is calculated on variances. So, we need to reintroduce
the $(N-1)$ denominator -- only that now it is $(N-4)$, because we had $N=30$ values, and subtracted
$K=4$ means for the $K$ groups instead of the usual 1 mean. This is called the number of *residual degrees of freedom*,
while $N-1$ is the total number of degrees of freedom, and the differnce between them, $K-1=-3$ is the
number of degrees of freedom associated with the grouip factor. If this sopunds complicated,
don't bother -- we need this only for historical reasons, i.e., in order to be able to compare with
standard tables.
Here is the same histogram, but now for F:
```{r}
hist( Fstat, 100, freq=FALSE )
fg <- seq( 0, 10, length.out=100 )
lines( fg, df( fg, 3, 26 ), col="magenta" )
```
The point of the F statistic is that there is a formula for its distribution under
the null hypothesis (i.e., under the assumptions of our null simulation). So, we do not
actually need to perform the simulation (which takes time, and would have taken way too much
time before we had fast computers), we can simply look things up.
Coming back to the zinc data: There, our TSS and RSS values are (see above):
```{r}
medley %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
pull(ss) ->
tss
medley %>%
group_by( zinc ) %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
summarise( ss = sum(ss) ) %>%
pull(ss) ->
rss
c( tss = tss, rss = rss, unexpl_frac = rss/tss)
```
Also, we have $N=34$ samples and $K=4$ groups, and hence $K-1=3$ degrees of freedom associated
with the group (zinc) and $N-K=30$ remaining degrees of freedom. Our F value is hence
```{r}
(tss-rss) / rss * 30/3
```
The probability of explaining more variance by chance than what zinc seems to explain here, i.e., of gettin a lower RSS or a higher F value by chance, is, according to the F distribution formula
```{r}
pf( (tss-rss) / rss * 30/3, 3, 30, lower.tail=FALSE )
```
All this lengthy computation can be done automatically with a single call to
the anova function:
```{r}
anova( lm( diversity ~ zinc, medley ) )
```
Permutation test
```{r}
medley %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
pull ->
tss
medley %>%
group_by( zinc ) %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
summarise( ss = sum(ss) ) %>%
pull ->
rss
replicate( 1000, {
medley %>%
mutate( zinc = sample( zinc ) ) %>%
group_by( zinc ) %>%
summarise( ss = sum( ( diversity - mean(diversity) )^2 ) ) %>%
summarise( ss = sum(ss) ) %>%
pull } ) ->
rss_perm
hist( rss_perm, 30 )
abline( v = rss, col="green" )
abline( v = tss, col="purple" )
```
```{r}
table( rss_perm < rss)
```
```{r}
( sum( rss_perm < rss ) + 1 ) / ( length(rss_perm) + 1 )
```
Two-way anova
-------------
So far, we have ignored the "stream" column. Let's plot it:
```{r}
ggplot( medley ) +
geom_point( aes( x = zinc, y = diversity, col = stream ) ) +
scale_color_brewer( palette="Set3" )
```
Here is how a two-way ANOVA looks:
```{r}
anova( lm( diversity ~ stream + zinc, medley ) )
```
Details to follow.