-
Notifications
You must be signed in to change notification settings - Fork 1
/
04-tgtbf.Rmd
628 lines (458 loc) · 71.9 KB
/
04-tgtbf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
---
bibliography: [library.bib]
link-citations: true
biblio-style: "apalike"
---
# Too good to be false: Nonsignificant results revisited
```{r echo = FALSE}
suppressPackageStartupMessages(library(httr))
suppressPackageStartupMessages(library(latex2exp))
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(car))
suppressPackageStartupMessages(library(xtable))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(kableExtra))
source('assets/functions/tgtbf-functions.R')
```
According to @isbn:9780415278430 falsifiability serves as one of the main demarcating criteria in the social sciences, which stipulates that a hypothesis is required to have the possibility of being proven false to be considered scientific. Within the theoretical framework of scientific hypothesis testing, accepting or rejecting a hypothesis is unequivocal, because the hypothesis is either true or false. Statistical hypothesis testing, on the other hand, is a probabilistic operationalization of scientific hypothesis testing [@doi:10.1016/j.appsy.2004.02.001] and, in view of its probabilistic nature, is subject to decision errors. Such decision errors are the topic of this chapter.
Null Hypothesis Significance Testing (NHST) is the most prevalent paradigm for statistical hypothesis testing in the social sciences [@isbn:9781433805615]. In NHST the hypothesis $H_0$ is tested, where $H_0$ most often regards the absence of an effect. If deemed false, an alternative, mutually exclusive hypothesis $H_1$ is accepted. These decisions are based on the $p$-value; the probability of the sample data, or more extreme data, given $H_0$ is true. If the $p$-value is smaller than the decision criterion $\alpha$ [typically .05; @doi:10.3758/s13428-015-0664-2], $H_0$ is rejected and $H_1$ is accepted.
Table \@ref(tab:tgtbf-tab1) summarizes the four possible situations that can occur in NHST. The columns indicate which hypothesis is true in the population and the rows indicate what is decided based on the sample data. When there is discordance between the true- and decided hypothesis, a decision error is made. More specifically, when $H_0$ is true in the population, but $H_1$ is accepted ($'H_1'$), a Type I error is made ($\alpha$); a false positive (lower left cell). When $H_1$ is true in the population and $H_0$ is not rejected ($'H_0'$), a Type II error is made ($\beta$); a false negative (upper right cell). However, when the null hypothesis is true in the population and $H_0$ is not rejected ($'H_0'$), this is a true negative (upper left cell; $1-\alpha$). The true negative rate is also called specificity of the test. Conversely, when the alternative hypothesis is true in the population and $H_1$ is accepted ($'H_1'$), this is a true positive (lower right cell). The probability of finding a statistically significant result if $H_1$ is true is the power ($1-\beta$), which is also called the sensitivity of the test. Power is a positive function of the (true) population effect size, the sample size, and the alpha of the study, such that higher power can always be achieved by altering either the sample size or the alpha level [@Aberson2010-xa].
```{r tgtbf-tab1, echo=FALSE}
capt <- "Summary table of possible NHST results. Columns indicate the true situation in the population, rows indicate the decision based on a statistical test. The true positive probability is also called power and sensitivity, whereas the true negative rate is also called specificity."
tabdf <- read.csv('assets/tables/ch5-tab1.csv', stringsAsFactors = FALSE)
names(tabdf) <- c('', '', '$H_0$', '$H_1$')
if (!knitr::is_html_output()) {
knitr::kable(tabdf, caption = capt, escape = FALSE, booktabs = TRUE, format = 'latex') %>%
kableExtra::add_header_above(c(' ' = 2, 'Population' = 2)) %>%
kableExtra::kable_styling(latex_options = c('striped', 'hold_position', 'scale_down'), position = 'center')
} else {
knitr::kable(tabdf, caption = capt, escape = FALSE, booktabs = TRUE) %>%
kableExtra::add_header_above(c(' ' = 2, 'Population' = 2)) %>%
kableExtra::kable_styling(position = 'center',
bootstrap_options = c("striped", "hover", "condensed", "responsive", full_width = F))
}
```
Unfortunately, NHST has led to many misconceptions and misinterpretations [@doi:10.1053/j.seminhematol.2008.04.003;@doi:10.1037/h0020412]. The most serious mistake relevant to our chapter is that many researchers accept the null hypothesis and claim no effect in case of a statistically nonsignificant effect [about 60%, see @doi:10.3758/bf03213921]. Hence, most researchers overlook that the outcome of hypothesis testing is probabilistic (if the null hypothesis is true, or the alternative hypothesis is true and power is less than 1) and interpret outcomes of hypothesis testing as reflecting the absolute truth. At least partly because of mistakes like this, many researchers ignore the possibility of false negatives and false positives and they remain pervasive in the literature.
Readers should be aware that outcomes of inferential statistics such as $p$-values are generally highly unstable [e.g., @isbn:9780415879682], and that we should consider statistical results as more incomplete and uncertain than is currently the norm [@doi:10.1080/00031305.2018.1543137]. Nonetheless, as NHST and decisions based on that are still the norm in many empirical sciences, the present chapter focuses on the NHST framework and erroneous decisions based on this framework.
Recent debate about false positives has received much attention in science and psychological science in particular. The Reproducibility Project Psychology (RPP), which replicated 100 effects reported in prominent psychology journals in 2008, found that only 36% of these effects were statistically significant in the replication [@doi:10.1126/science.aac4716]. Besides in psychology, reproducibility problems have also been indicated in economics [@doi:10.1126/science.aaf0918] and medicine [@doi:10.1038/483531a]. Although these studies suggest substantial evidence of false positives in these fields, replications show considerable variability in resulting effect size estimates [@doi:10.1027/1864-9335/a000178;@doi:10.1177/1745691614528518]. Therefore caution is warranted when wishing to draw conclusions on the presence of an effect in individual (original or replication) studies [@doi:10.1126/science.aac4716;@doi:10.1126/science.aad7243;@doi:10.1126/science.aad9163].
The debate about false positives is driven by the current overemphasis on statistical significance of research results [@doi:10.1177/1745691612457576]. This overemphasis is substantiated by the finding that more than 90% of results in the psychological literature are statistically significant [@doi:10.1126/science.aac4716;@doi:10.2307/2684823;@doi:10.2307/2282137] despite low statistical power due to small sample sizes [@doi:10.1037/h0045186;@doi:10.1037/0033-2909.105.2.309;@doi:10.2466/03.11.pms.112.2.331-348;@doi:10.1177/1745691612459060]. Consequently, publications have become biased by overrepresenting statistically significant results [@doi:10.1037/h0076157], which generally results in effect size overestimation in both individual studies [@doi:10.3758/s13428-015-0664-2] and meta-analyses [@doi:10.1037/met0000025;@doi:10.1111/j.2044-8317.1978.tb00578.x;@isbn:9780470870150;@isbn:9781119964377]. The overemphasis on statistically significant effects has been accompanied by questionable research practices [QRPs; @doi:10.1177/0956797611430953] such as erroneously rounding $p$-values towards significance, which for example occurred for 13.8% of all $p$-values reported as "$p =.05$" in articles from eight major psychology journals in the period 1985-2013 [@doi:10.7717/peerj.1935].
The concern for false positives has overshadowed the concern for false negatives in the recent debate, which seems unwarranted. @doi:10.1037/h0045186 was the first to indicate that psychological science was (severely) underpowered, which is defined as the chance of finding a statistically significant effect in the sample being lower than 50% when there is truly an effect in the population. This has not changed throughout the subsequent fifty years [@doi:10.1177/1745691612459060;@doi:10.1371/journal.pone.0109019]. Given that the complement of true positives (i.e., power) are false negatives, no evidence either exists that the problem of false negatives has been resolved in psychology. Moreover, @doi:10.1177/1745691612462587 expressed the concern that an increased focus on false positives is too shortsighted because false negatives are more difficult to detect than false positives. They also argued that, because of the focus on statistically significant results, negative results are less likely to be the subject of replications than positive results, decreasing the probability of detecting a false negative. Additionally, the Positive Predictive Value [PPV, the number of statistically significant effects that are true; @doi:10.1371/journal.pmed.0020124] has been a major point of discussion in recent years, whereas the Negative Predictive Value (NPV) has rarely been mentioned.
The research objective of the current chapter is to examine evidence for false negative results in the psychology literature. Assuming the framework of NHST, this chapter operates under the premise of naive realism about the interpretation of statistically nonsignificant results, where naive realism would result in neglecting the probabilistic nature of these results. To this end, we inspected a large number of nonsignificant results from eight flagship psychology journals. First, we compared the observed effect distributions of nonsignificant results for eight journals (combined and separately) to the expected null distribution based on simulations, where a discrepancy between observed and expected distribution was anticipated (i.e., presence of false negatives). Second, we propose to use the Fisher test to test the hypothesis that $H_0$ is true for all nonsignificant results reported in a paper, which we show to have high power to detect false negatives in a simulation study. Third, we applied the Fisher test to the nonsignificant results in 14,765 psychology papers from these eight flagship psychology journals to inspect how many papers show evidence of at least one false negative result. Fourth, we examined evidence of false negatives in reported gender effects. Gender effects are particularly interesting, because gender is typically a control variable and not the primary focus of studies. Hence we expect little $p$-hacking and substantial evidence of false negatives in reported gender effects in psychology. Finally, as another application, we applied the Fisher test to the 64 nonsignificant replication results of the RPP [@doi:10.1126/science.aac4716] to examine whether at least one of these nonsignificant results may actually be a false negative.
The approach in this chapter contrasts with research on statistical power in the current psychological literature. Research on statistical power in a literature or set of papers typically asks the *theoretical* question 'what is the power of detecting true effect size $x$ with the sample size of paper $y$?' for all papers $y$ in that set of papers or literature, as a function of $x$. Alternatively, this chapter answers the *empirical* question 'do we reject the null-hypothesis of no effect based on a set of statistically nonsignificant $p$-values', although we also examine the statistical power of the procedure to answer this question.
## Theoretical framework
We begin by reviewing the probability density function of both an individual $p$-value and a set of independent $p$-values as a function of population effect size. Subsequently, we apply the Kolmogorov-Smirnov test to inspect whether a collection of nonsignificant results across papers deviates from what would be expected under the $H_0$. We also propose an adapted Fisher method to test whether nonsignificant results deviate from $H_0$ within a paper. These methods will be used to test whether there is evidence for false negatives in the psychology literature.
### Distributions of _p_-values
The distribution of one $p$-value is a function of the population effect, the observed effect and the precision of the estimate. When the population effect is zero, the probability distribution of one $p$-value is uniform. When there is a non-zero effect, the probability distribution is right-skewed. More specifically, as sample size or true effect size increases, the probability distribution of one $p$-value becomes increasingly right-skewed. These regularities also generalize to a set of independent $p$-values, which are uniformly distributed when there is no population effect and right-skew distributed when there is a population effect, with more right-skew as the population effect and/or precision increases [@Fisher1925-jl].
Considering that the present chapter focuses on false negatives, we primarily examine nonsignificant $p$-values and their distribution. Since the test we apply is based on nonselected $p$-values, it requires random variables distributed between 0 and 1. We apply the following transformation to each nonsignificant $p$-value that is selected:
\begin{equation}
p^*_i=\frac{p_i-\alpha}{1-\alpha}
(\#eq:pistar)
\end{equation}
where $p_i$ is the reported nonsignificant $p$-value, $\alpha$ is the selected significance cutoff (i.e., $\alpha=.05$), and $p^*_i$ the transformed $p$-value. Note that this transformation retains the distributional properties of the original $p$-values for the selected nonsignificant results. Both one-tailed and two-tailed tests can be included in this way.
### Testing for false negatives: the Fisher test
We applied the Fisher test to inspect whether the distribution of observed nonsignificant $p$-values deviates from those expected under $H_0$. The Fisher test was initially introduced as a meta-analytic technique to synthesize results across studies [@Fisher1925-jl;@Hedges1985-dy]. When applied to transformed nonsignificant $p$-values (see Equation \@ref(eq:pistar)) the Fisher test tests for evidence against $H_0$ in a set of nonsignificant $p$-values. In other words, the null hypothesis we test with the Fisher test is that all included nonsignificant results are true negatives. The Fisher test statistic is calculated as
\begin{equation}
\chi^2_{2k}=-2\sum\limits^k_{i=1}ln(p^*_i)
(\#eq:fishertest)
\end{equation}
where $k$ is the number of nonsignificant $p$-values and $\chi^2$ has $2k$ degrees of freedom. A larger $\chi^2$ value indicates more evidence for at least one false negative in the set of $p$-values. We conclude that there is sufficient evidence of at least one false negative result, if the Fisher test is statistically significant at $\alpha=.10$, similar to tests of publication bias that also use $\alpha=.10$ [@doi:10.1016/s0895-43560000242-0;@doi:10.1177/1740774507079441;@doi:10.3758/s13423-012-0227-9].
We estimated the power of detecting false negatives with the Fisher test as a function of sample size $N$, true correlation effect size $\eta$, and $k$ nonsignificant test results (the full procedure is described in Appendix A). The three levels of sample size used in our simulation study (33, 62, 119) correspond to the 25th, 50th (median) and 75th percentiles of the degrees of freedom of reported $t$, $F$, and $r$ statistics in eight flagship psychology journals (see Application 1 below). Degrees of freedom of these statistics are directly related to sample size, for instance, for a two-group comparison including 100 people, $df=98$.
Table \@ref(tab:tgtbf-tab2) summarizes the results for the simulations of the Fisher test when the nonsignificant $p$-values are generated by either small- or medium population effect sizes. Results for all 5,400 conditions can be found on the OSF ([osf.io/qpfnw](https://osf.io/qpfnw)). The results indicate that the Fisher test is a powerful method to test for a false negative among nonsignificant results. For example, for small true effect sizes ($\eta=.1$), 25 nonsignificant results from medium samples result in 85% power (7 nonsignificant results from large samples yield 83% power). For medium true effects ($\eta=.25$), three nonsignificant results from small samples ($N=33$) already provide 89% power for detecting a false negative with the Fisher test. For large effects ($\eta=.4$), two nonsignificant results from small samples already almost always detects the existence of false negatives (not shown in Table \@ref(tab:tgtbf-tab2)).
```{r tgtbf-tab2, echo = FALSE}
capt <- "Power of Fisher test to detect false negatives for small- and medium effect sizes (i.e., $\\eta=.1$ and $\\eta=.25$), for different sample sizes (i.e., $N$) and number of test results (i.e., $k$). Results of each condition are based on 10,000 iterations. Power was rounded to 1 whenever it was larger than .9995."
tabdf <- read.csv('assets/tables/ch5-tab2.csv', stringsAsFactors = FALSE)
names(tabdf) <- c('', '$N=33$', '$N=62$', '$N=119$', '$N=33$', '$N=62$', '$N=119$')
if (!knitr::is_html_output()) {
knitr::kable(tabdf, caption = capt, booktabs = TRUE, escape = FALSE, format = 'latex') %>%
kableExtra::add_header_above(c(" " = 1, "$\\\\\\eta=.1$" = 3, "$\\\\\\eta=.25$" = 3), escape = FALSE) %>%
kableExtra::kable_styling(latex_options = c('striped', 'hold_position'), position = 'center')
} else {
knitr::kable(tabdf, caption = capt, booktabs = TRUE, escape = FALSE) %>%
kableExtra::add_header_above(c(" " = 1, "$\\\\\\eta=.1$" = 3, "$\\\\\\eta=.25$" = 3), escape = FALSE) %>%
kableExtra::kable_styling(position = 'center',
bootstrap_options = c("striped", "hover", "condensed", "responsive", full_width = F))
}
```
```{r echo = FALSE}
tmp_power <- function(es, n, digits){
# cv
tCV <- qt(0.1, df=n-1,lower.tail=F)
# Step 2 - noncentrality parameter
f2 <- es^2/(1-es^2)
ncp <- f2*n
# Step 3 - power
power <- pt(q=tCV,df=n-1,ncp=ncp, lower.tail=F)
cbind(es,n,tCV,ncp,power)
return(round(power, digits))
}
```
To put the power of the Fisher test into perspective, we can compare its power to reject the null based on one statistically nonsignificant result ($k=1$) with the power of a regular $t$-test to reject the null. If $\eta=.1$, the power of a regular $t$-test equals `r tmp_power(.1, 33, 3)`, `r tmp_power(.1, 62, 3)`, `r tmp_power(.1, 119, 3)` for sample sizes of 33, 62, 119, respectively; if $\eta$ = .25, power values equal `r tmp_power(.25, 33, 3)`, `r tmp_power(.25, 62, 3)`, `r tmp_power(.25, 119, 3)` for these sample sizes. The power values of the regular $t$-test are higher than that of the Fisher test, because the Fisher test does not make use of the more informative statistically significant findings.
## Application 1: Evidence of false negatives in articles across eight major psychology journals
To show that statistically nonsignificant results do not warrant the interpretation that there is truly no effect, we analyzed statistically nonsignificant results from eight major psychology journals. First, we investigate if and how much the distribution of reported nonsignificant effect sizes deviates from what the expected effect size distribution is if there is truly no effect (i.e., $H_0$). Second, we investigate how many research articles report nonsignificant results and how many of those show evidence for at least one false negative using the Fisher test [@Fisher1925-jl]. Note that this application only investigates the evidence of false negatives in articles, not how authors might interpret these findings (i.e., we do not assume all these nonsignificant results are interpreted as evidence for the null).
### Method
APA style $t$, $r$, and $F$ test statistics were extracted from eight psychology journals with the `R` package `statcheck` [@doi:10.3758/s13428-015-0664-2;@statcheck]. APA style is defined as the format where the type of test statistic is reported, followed by the degrees of freedom (if applicable), the observed test value, and the $p$-value [e.g., $t(85)=2.86, p=.005$; @isbn:9781433805615]. The `statcheck` package also recalculates $p$-values. We reuse the data from Nuijten et al. [https://osf.io/gdr4q; @doi:10.3758/s13428-015-0664-2]. Table \@ref(tab:tgtbf-tab3) depicts the journals, the timeframe, and summaries of the results extracted. The database also includes $\chi^2$ results, which we did not use in our analyses because effect sizes based on these results are not readily mapped on the correlation scale. Two erroneously reported test statistics were eliminated, such that these did not confound results.
```{r echo = FALSE, results = 'hide'}
# Downloads the data if not available locally
# The data from the Nuijten et al paper
if(!file.exists('assets/data/statcheck_full_anonymized.csv'))
{
GET('https://osf.io/gdr4q/?action=download',
write_disk('assets/data/statcheck_full_anonymized.csv', overwrite = TRUE))
}
if(!file.exists('assets/data/datafilegender500_post.csv'))
{
# the gender related results (ALL)
GET('https://raw.githubusercontent.com/chartgerink/2014tgtbf/master/data/datafilegender500_post.csv',
write_disk('assets/data/datafilegender500_post.csv', overwrite = TRUE))
}
if(!file.exists('assets/data/gendercoded cleaned and discussed.csv'))
{
# the coded gender results
GET('https://raw.githubusercontent.com/chartgerink/2014tgtbf/master/data/gendercoded%20cleaned%20and%20discussed.csv',
write_disk('assets/data/gendercoded cleaned and discussed.csv', overwrite = TRUE))
}
dat <- read.csv2('assets/data/statcheck_full_anonymized.csv',
stringsAsFactors = F,
dec = ",",
sep = ";")[-1]
# There are two test statistic indicators that are NA
# Manually correct these
dat$Statistic[is.na(dat$Statistic)] <- "F"
# Computing unadjusted and adjusted effect sizes (OBSERVED)
dat <- cbind(dat, esComp.statcheck(dat))
dat$adjESComp[dat$adjESComp < 0] <- 0
# Turning df1 for t and r into 1.
dat$df1[dat$Statistic == "t" | dat$Statistic == "r"] <- 1
# Select out incorrectly exttracted r values
dat <- dat[!(dat$Statistic=="r" & dat$Value > 1),]
# Select out irrefutably wrong df reporting
dat <- dat[!dat$df1 == 0,]
# select out NA computed p-values
dat <- dat[!is.na(dat$Computed),]
# Selecting only the t, r and F values
dat <- dat[dat$Statistic == 't' | dat$Statistic == 'r' | dat$Statistic == 'F',]
nsig <- dat$Computed >= .05
esR <- c(.1, .25, .4)
alpha = .05
alphaF = .1
# Statistical properties of the Fisher method -----------------------------
# condition setting
N <- c(as.numeric(summary(dat$df2)[2]), # 25th percentile
as.numeric(summary(dat$df2)[3]), # 50th percentile
as.numeric(summary(dat$df2)[5]) # 75th percentile
)
ES <- c(.00,
seq(.01, .99, .01))
P <- c(seq(1, 10, 1), seq(15, 50, 5))
alpha <- .05
alphaF <- 0.10
n.iter <- 10000
# NOTE: runs if the results file (N_33.csv, N_62.csv, N_119.csv) are absent
if(!file.exists('assets/data/N_33.csv') &
!file.exists('assets/data/N_62.csv') &
!file.exists('assets/data/N_119.csv'))
{
set.seed(35438759)
source('assets/functions/simCode.R')
}
# Load all results back in
files <- list.files('assets/data')
files <- files[grepl(pattern = "N_", files)]
names <- str_sub(files,start=1L, end=-5L)
for(i in 1:length(files)){
assign(x = names[i], read.csv(sprintf('assets/data/%s', files[i])))
assign(x = names[i], t(get(x = names[i])[ ,-1]))
}
# Data for table
# rows are k
# columns are effect size
# N = 33
t(get(x = names[2]))
# N = 62
t(get(x = names[3]))
# N = 119
t(get(x = names[1]))
# Table 3 power computations
ser <- 1/sqrt(c(33, 62, 119)-3)
rho <- .1
zcv <- 1.282
rcv <- (exp(2*(zcv*ser))-1)/(exp(2*(zcv*ser))+1)
zrcv <- .5*log((1+rcv)/(1-rcv))
zrho <- .5*log((1+rho)/(1-rho))
# round(1-pnorm(zrcv, mean=zrho, sd=ser),4)
rho <- .25
rcv <- (exp(2*(zcv*ser))-1)/(exp(2*(zcv*ser))+1)
zrcv <- .5*log((1+rcv)/(1-rcv))
zrho <- .5*log((1+rho)/(1-rho))
# round(1-pnorm(zrcv, mean=zrho, sd=ser),4)
# Agresti-Coull CI
.1 - qnorm(.95, 0, 1) * (sqrt((1/10000) * .1 * .9))
.1 + qnorm(.95, 0, 1) * (sqrt((1/10000) * .1 * .9))
```
The analyses reported in this chapter use the recalculated $p$-values to eliminate potential errors in the reported $p$-values [@doi:10.3758/s13428-011-0089-5;@doi:10.3758/s13428-015-0664-2]. However, our recalculated $p$-values assumed that all other test statistics (degrees of freedom, test values of $t$, $F$, or $r$) are correctly reported. These errors may have affected the results of our analyses. Since most $p$-values and corresponding test statistics were consistent in our data set (`r round((dim(dat)[1] - sum(dat$Error[!is.na(dat$Error)])) / dim(dat)[1] * 100,1)`%), we do not believe these typing errors substantially affected our results and conclusions based on them.
```{r tgtbf-tab3, echo = FALSE}
capt <- "Summary table of articles downloaded per journal, their mean number of results, and proportion of (non)significant results. Statistical significance was determined using $\\alpha=.05$, two-tailed test."
if (knitr::is_html_output()) {
fil <- 'assets/tables/ch5-tab3-html.csv'
} else {
fil <- 'assets/tables/ch5-tab3-latex.csv'
}
tabdf <- read.csv(fil, stringsAsFactors = FALSE)
names(tabdf) <- c('Journal Acronym',
'Time frame',
'Results',
'Mean results per article',
'Significant (\\%)',
'Nonsignificant (\\%)')
if (!knitr::is_html_output()) {
knitr::kable(tabdf, caption = capt, booktabs = TRUE, escape = FALSE, format = 'latex') %>%
landscape() %>%
kableExtra::kable_styling(latex_options = c('striped', 'scale_down'), position = 'center')
} else {
knitr::kable(tabdf, caption = capt, booktabs = TRUE, escape = FALSE) %>%
kableExtra::scroll_box(width = "100%") %>%
kableExtra::kable_styling(position = 'center',
bootstrap_options = c("striped", "hover", "condensed", "responsive", full_width = F))
}
```
```{r echo = FALSE}
# Descriptives data set
journals <- sort(unique(dat$journals.jour.))
# for(j in 1:length(journals)){
# selJournal <- dat$journals.jour. == journals[j]
# meanK <- mean(table(dat$Source[selJournal]))
# len <- length(dat$Computed[selJournal & !is.na(dat$Computed)])
# sigRes <- sum(dat$Computed[selJournal & !is.na(dat$Computed)] < .05)
# )
# }
iccSS <- Anova(lm(dat$Computed[nsig] ~ dat$Source[nsig]), type="III")
pstar <- ((dat$Computed[nsig & !(dat$Computed == 1)] - .05) / (1 - .05))
tmp <- log(pstar / (1 - pstar))
iccSStmp <- Anova(lm(tmp ~ dat$Source[nsig & !(dat$Computed == 1)]), type="III")
# df <- read.csv('assets/tables/tgtbf-descriptive.csv')
# knitr::kable(df, caption = 'Summary table of articles downloaded per journal, their mean number of results, and proportion of (non)significant results. Statistical significance was determined using alpha=.05, two-tailed test.') %>%
# kable_styling(latex_options = c('scale_down'))
```
First, we compared the observed nonsignificant effect size distribution (computed with observed test results) to the expected nonsignificant effect size distribution under $H_0$. The expected effect size distribution under $H_0$ was approximated using simulation. We first randomly drew an observed test result (with replacement) and subsequently drew a random nonsignificant $p$-value between 0.05 and 1 (i.e., under the distribution of the $H_0$). Based on the drawn $p$-value and the degrees of freedom of the drawn test result, we computed the accompanying test statistic and the corresponding effect size (for details on effect size computation see Appendix A). This procedure was repeated 163,785 times, which is three times the number of observed nonsignificant test results (54,595). The collection of simulated results approximates the expected effect size distribution under $H_0$, assuming independence of test results in the same paper. We inspected this possible dependency with the intraclass correlation ($ICC$), where $ICC=1$ indicates full dependency and $ICC=0$ indicates full independence. For the set of observed results, the ICC for nonsignificant $p$-values was `r round(iccSS$Sum[2]/(iccSS$Sum[3]+iccSS$Sum[2]), 3)`, indicating independence of $p$-values within a paper (the ICC of the log odds transformed $p$-values was similar, with $ICC=`r round(iccSStmp$Sum[2]/(iccSStmp$Sum[3]+iccSStmp$Sum[2]), 5)`$ after excluding $p$-values equal to 1 for computational reasons). The resulting, expected effect size distribution was compared to the observed effect size distribution (i) across all journals and (ii) per journal. To test for differences between the expected and observed nonsignificant effect size distributions we applied the Kolmogorov-Smirnov test. This is a non-parametric goodness-of-fit test for equality of distributions, which is based on the maximum absolute deviation between the independent distributions being compared [denoted $D$; @doi:10.2307/2280095].
Second, we applied the Fisher test to test how many research papers show evidence of at least one false negative statistical result. To recapitulate, the Fisher test tests whether the distribution of observed nonsignificant $p$-values deviates from the uniform distribution expected under $H_0$. In order to compute the result of the Fisher test, we applied equations \@ref(eq:pistar) and \@ref(eq:fishertest) to the recalculated nonsignificant $p$-values in each paper ($\alpha=.05$).
### Results
#### Observed effect size distribution.
```{r percentages, echo = FALSE}
small <- sum(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))] < .1) / length(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))]) * 100
medium <- sum(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))] >= .1 & sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))] < .25) / length(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))]) * 100
large <- sum(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))] >= .25 & sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))] < .4) / length(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))]) * 100
larger <- sum(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))] >= .4) / length(sqrt(dat$esComp)[!is.na(sqrt(dat$esComp))]) * 100
fig1cap <- sprintf('Density of observed effect sizes of results reported in eight psychology journals, with %s percent of effects in the category none-small, %s percent small-medium, %s percent medium-large, and %s percent large and beyond.', round(small, 0), round(medium, 0), round(large, 0), round(larger, 0))
```
Figure \@ref(fig:tgtbf-fig1) shows the distribution of observed effect sizes (in $|\eta|$) across all articles and indicates that, of the 223,082 observed effects, `r round(small, 0)`% were zero to small (i.e., $0\leq|\eta|<.1$), `r round(medium, 0)`% were small to medium (i.e., $.1\leq|\eta|<.25$), `r round(large, 0)`% medium to large (i.e., $.25\leq|\eta|<.4$), and `r round(larger, 0)`% large or larger [i.e., $|\eta|\geq.4$; @isbn:9780805802832]. This suggests that the majority of effects reported in psychology is medium or smaller (i.e., `r round(medium+small,0)`%), which is somewhat in line with a previous study on effect distributions [@doi:10.1016/j.paid.2016.06.069]. Of the full set of 223,082 test results, 54,595 (24.5%) were nonsignificant, which is the data set for our main analyses.
```{r tgtbf-fig1, fig.cap=fig1cap, echo=FALSE, fig.align = 'center', out.width = '100%', fig.pos = 'h'}
knitr::include_graphics('assets/figures/tgtbf-fig1.pdf.svg.png', auto_pdf = TRUE)
```
Our data set indicated that more nonsignificant results are reported throughout the years, strengthening the case for inspecting potential false negatives. The proportion of reported nonsignificant results showed an upward trend, as depicted in Figure \@ref(fig:tgtbf-fig2), from approximately 20% in the eighties to approximately 30% of all reported APA results in 2015.
```{r tgtbf-fig2, fig.cap="Observed proportion of nonsignificant test results per year.", echo=FALSE, fig.align = 'center', out.width = '100%', fig.pos = 'h'}
knitr::include_graphics('assets/figures/tgtbf-fig2.pdf.svg.png', auto_pdf = TRUE)
```
### Expected effect size distribution.
For the entire set of nonsignificant results across journals, Figure \@ref(fig:tgtbf-fig3) indicates that there is substantial evidence of false negatives. Under $H_0$, 46% of all observed effects is expected to be within the range $0\leq|\eta|<.1$, as can be seen in the left panel of Figure \@ref(fig:tgtbf-fig3) highlighted by the lowest grey line (dashed). However, of the observed effects, only 26% fall within this range, as highlighted by the lowest black line. Similarly, we would expect 85% of all effect sizes to be within the range $0\leq|\eta|<.25$ (middle grey line), but we observed 14 percentage points less in this range (i.e., 71%; middle black line); 96\% is expected for the range $0\leq|\eta|<.4$ (top grey line), but we observed 4 percentage points less (i.e., 92\%; top black line). These differences indicate that larger nonsignificant effects are reported in papers than expected under a null effect. This indicates the presence of false negatives, which is confirmed by the Kolmogorov-Smirnov test, $D=0.3$, $p<.000000000000001$. Results were similar when the nonsignificant effects were considered separately for the eight journals, although deviations were smaller for the Journal of Applied Psychology (see [https://osf.io/au3wv/](https://osf.io/au3wv/) for results per journal).
```{r tgtbf-fig3, fig.cap="Observed and expected (adjusted and unadjusted) effect size distribution for statistically nonsignificant APA results reported in eight psychology journals. Grey lines depict expected values; black lines depict observed values. The three vertical dotted lines correspond to a small, medium, large effect, respectively. Header includes Kolmogorov-Smirnov test results.", echo=FALSE, fig.align = 'center', out.width = '100%', fig.pos = 'h'}
knitr::include_graphics('assets/figures/tgtbf-fig3.pdf.svg.png', auto_pdf = TRUE)
```
Because effect sizes and their distribution typically overestimate population effect size $\eta^2$, particularly when sample size is small [@doi:10.1027/1614-2241.3.1.35;@doi:10.3102/10769986006002107], we also compared the observed and expected adjusted nonsignificant effect sizes that correct for such overestimation of effect sizes (right panel of Figure \@ref(fig:tgtbf-fig3); see Appendix A). Such overestimation affects all effects in a model, both focal and non-focal. The distribution of adjusted effect sizes of nonsignificant results tells the same story as the unadjusted effect sizes; observed effect sizes are larger than expected effect sizes. For instance, the distribution of adjusted reported effect size suggests 49% of effect sizes are at least small, whereas under the $H_0$ only 22% is expected.
### Evidence of false negatives in articles.
The Fisher test was applied to the nonsignificant test results of each of the 14,765 papers separately, to inspect for evidence of false negatives. More technically, we inspected whether $p$-values within a paper deviate from what can be expected under the $H_0$ (i.e., uniformity). If $H_0$ is in fact true, our results would be that there is evidence for false negatives in 10\% of the papers (a meta-false positive). Table \@ref(tab:tgtbf-tab4) shows the number of papers with evidence for false negatives, specified per journal and per $k$ number of nonsignificant test results. The first row indicates the number of papers that report no nonsignificant results. When $k=1$, the Fisher test is simply another way of testing whether the result deviates from a null effect, conditional on the result being statistically nonsignificant. Overall results (last row) indicate that 47.1% of all articles show evidence of false negatives (i.e. 6,951 articles). Of articles reporting at least one nonsignificant result, 66.7% show evidence of false negatives, which is much more than the 10% predicted by chance alone. Results did not substantially differ if nonsignificance is determined based on $\alpha=.10$ (the analyses can be rerun with any set of $p$-values larger than a certain value based on the code provided on OSF; <https://osf.io/qpfnw>.
```{r tgtbf-tab4, echo = FALSE}
capt <- "Summary table of Fisher test results applied to the nonsignificant results ($k$) of each article separately, overall and specified per journal. A significant Fisher test result is indicative of a false negative (FN). DP = Developmental Psychology; FP = Frontiers in Psychology; JAP = Journal of Applied Psychology; JCCP = Journal of Consulting and Clinical Psychology; JEPG = Journal of Experimental Psychology: General; JPSP = Journal of Personality and Social Psychology; PLOS = Public Library of Science; PS = Psychological Science."
if (knitr::is_html_output()) {
fil <- 'assets/tables/ch5-tab4-html.csv'
tabdf <- read.csv(fil, stringsAsFactors = FALSE)
names(tabdf)[1:2] <- ''
knitr::kable(tabdf, caption = capt, booktabs = TRUE, escape = TRUE) %>%
kableExtra::scroll_box(width = "100%") %>%
kableExtra::kable_styling(position = 'center',
bootstrap_options = c("striped", "hover", "condensed", "responsive", full_width = F))
} else {
fil <- 'assets/tables/ch5-tab4-latex.csv'
tabdf <- read.csv(fil, stringsAsFactors = FALSE)
names(tabdf)[1:2] <- ''
knitr::kable(tabdf, caption = capt, format = 'latex', booktabs = TRUE, escape = FALSE) %>%
kableExtra::kable_styling(latex_options = c('striped', 'hold_position', 'scale_down'), position = 'center')
}
```
Table \@ref(tab:tgtbf-tab4) also shows evidence of false negatives for each of the eight journals. The lowest proportion of articles with evidence of at least one false negative was for the Journal of Applied Psychology (49.4\%; penultimate row). The remaining journals show higher proportions, with a maximum of 81.3\% (Journal of Personality and Social Psychology). Researchers should thus be wary to interpret negative results in journal articles as a sign that there is no effect; at least half of the papers provide evidence for at least one false negative finding.
As would be expected, we found a higher proportion of articles with evidence of at least one false negative for higher numbers of statistically nonsignificant results ($k$; see Table \@ref(tab:tgtbf-tab4)). For instance, 84\% of all papers that report more than 20 nonsignificant results show evidence for false negatives, whereas 57.7\% of all papers with only 1 nonsignificant result show evidence for false negatives. Consequently, we observe that journals with articles containing a higher number of nonsignificant results, such as JPSP, have a higher proportion of articles with evidence of false negatives. This is the result of higher power of the Fisher method when there are more nonsignificant results and does not necessarily reflect that a nonsignificant $p$-value in e.g. JPSP has a higher probability of being a false negative than one in another journal.
```{r echo = FALSE}
load('assets/data/fishRes')
load('assets/data/fishDF')
alphaF <- 0.10
# Compute amount of papers and proportion of sig/nsig/NA Fisher Method tests
final <- NULL
kLen <- c(1, 2, 3, 4, 5, 10, 20)
for(journals in sort(unique(dat$journals.jour.))){
sel <- fishDF$journal == journals
# Amount of papers in a journal
amount <- length(fishDF$FisherP[sel])
# Proportion of significant fisher results
amountSig <- sum(fishDF$FisherP[sel & !is.na(fishDF$FisherP)] < alphaF)
# Amount of papers in a journal without significant results
countNA <- sum(is.na(fishDF$FisherP[sel]))
journalSet <- NULL
# Writing out the results
for(k in 1:length(kLen)){
if(k == 7){
x <- sum(fishDF$FisherP[sel & !is.na(fishDF$FisherP) & fishDF$kRes >= kLen[k]] < alphaF) / length(fishDF$FisherP[sel & fishDF$kRes >= kLen[k]])
journalSet <- cbind(journalSet, x)
}
else if(k == 5 | k ==6){
x <- sum(fishDF$FisherP[sel & !is.na(fishDF$FisherP) & fishDF$kRes >= kLen[k]& fishDF$kRes < kLen[k+1]] < alphaF) / length(fishDF$FisherP[sel & fishDF$kRes >= kLen[k]& fishDF$kRes < kLen[k+1]])
journalSet <- cbind(journalSet, x)
} else{
x <- sum(fishDF$FisherP[sel & !is.na(fishDF$FisherP) & fishDF$kRes == kLen[k]] < alphaF) / length(fishDF$FisherP[sel & fishDF$kRes == kLen[k]])
journalSet <- cbind(journalSet, x)}
}
temp <- cbind(journals,
journalSet,
amountSig / amount,
countNA,
amountSig,
amount)
# This is the result that goes into the table
final <- rbind(final, temp)
}
temp <- "Overall"
for(i in 1:length(kLen)){
if(i == 7){
temp[i+1] <- sum(fishDF$FisherP[fishDF$kRes >= kLen[i] & !is.na(fishDF$FisherP)] < alphaF )/ length(fishDF$FisherP[fishDF$kRes >= kLen[i]])
}
else if(i == 5|i == 6){
temp[i+1] <- sum(fishDF$FisherP[fishDF$kRes >= kLen[i] &fishDF$kRes < kLen[i+1] & !is.na(fishDF$FisherP)] < alphaF )/ length(fishDF$FisherP[fishDF$kRes >= kLen[i] & fishDF$kRes < kLen[i+1]])
}
else{
temp[i+1] <- sum(fishDF$FisherP[fishDF$kRes == kLen[i] & !is.na(fishDF$FisherP)] < alphaF )/ length(fishDF$FisherP[fishDF$kRes == kLen[i]])
}
}
temp <- c(temp,
sum(as.numeric(as.character(final[,dim(final)[2]-1])))/sum(as.numeric(as.character(final[,dim(final)[2]]))),
sum(as.numeric(as.character(final[,dim(final)[2]-2]))),
sum(as.numeric(as.character(final[,dim(final)[2]-1]))),
sum(as.numeric(as.character(final[,dim(final)[2]]))))
final <- rbind(as.character(temp), final)
final <- as.data.frame(final)
names(final) <- c('journals', paste0('k', kLen), 'overall', 'countNA', 'amountSig', 'nrpapers')
# write.csv(final, 'table4.csv', row.names=F)
# Overall
# table(fishDF$kRes)
# # Per journal
# table(fishDF$kRes[fishDF$journal=="DP"])
# table(fishDF$kRes[fishDF$journal=="FP"])
# table(fishDF$kRes[fishDF$journal=="JAP"])
# table(fishDF$kRes[fishDF$journal=="JCCP"])
# table(fishDF$kRes[fishDF$journal=="JEPG"])
# table(fishDF$kRes[fishDF$journal=="JPSP"])
# table(fishDF$kRes[fishDF$journal=="PLOS"])
# table(fishDF$kRes[fishDF$journal=="PS"])
# Median per journal
x = as.numeric(as.character(final$amountSig)) / (as.numeric(as.character(final$nrpapers)) - as.numeric(as.character(final$countNA)))
y = c(median(fishDF$kRes[fishDF$journal=="DP"]),
median(fishDF$kRes[fishDF$journal=="FP"]),
median(fishDF$kRes[fishDF$journal=="JAP"]),
median(fishDF$kRes[fishDF$journal=="JCCP"]),
median(fishDF$kRes[fishDF$journal=="JEPG"]),
median(fishDF$kRes[fishDF$journal=="JPSP"]),
median(fishDF$kRes[fishDF$journal=="PLOS"]),
median(fishDF$kRes[fishDF$journal=="PS"]))
# cor(x[-1], y)
# Computing the number of significant Fisher results per year
# As proportion of all papers reporting nonsignificant results
fishDF$logicalP <- ifelse(fishDF$FisherP < .1, 1, 0)
fisherYear <- ddply(fishDF, .(year), summarise, propYear=mean(logicalP, na.rm=TRUE))
knsYear <- ddply(fishDF, .(year), summarise, kYear=mean(kRes, na.rm=TRUE))
pmeanYear <- ddply(dat[dat$Reported.Comparison == '=' & dat$Reported.P.Value > .05, ],
.(years.y.), summarise, pYear=mean(Reported.P.Value, na.rm=TRUE))
esYear <- ddply(dat[dat$Reported.Comparison == '=' & dat$Reported.P.Value > .05, ],
.(years.y.), summarise, esYear=mean(esComp, na.rm=TRUE))
mydf <- data.frame(x = fisherYear$year,
y = fisherYear$propYear,
count = knsYear$kYear)
# mydf
# sample size development over time
medianN <- NULL
p25 <- NULL
p75 <- NULL
i <- 1
for(y in 1985:2013){
temp <- summary(dat$df2[dat$years.y. == y])
medianN[i] <- temp[3]
p25[i] <- temp[2]
p75[i] <- temp[5]
i <- i + 1
}
```
We also checked whether evidence of at least one false negative at the article level changed over time. Figure \@ref(fig:tgtbf-fig4) depicts evidence across all articles per year, as a function of year (1985-2013); point size in the figure corresponds to the mean number of nonsignificant results per article (mean $k$) in that year. Interestingly, the proportion of articles with evidence for false negatives decreased from `r round(mydf$y[1]*100, 0)`\% in 1985 to `r round(mydf$y[29]*100, 0)`\% in 2013, despite the increase in mean $k$ (from `r round(mydf$count[1], 2)` in 1985 to `r round(mydf$count[29], 2)` in 2013). This decreasing proportion of papers with evidence over time cannot be explained by a decrease in sample size over time, as sample size in psychology articles has stayed stable across time (see Figure \@ref(fig:tgtbf-fig5); degrees of freedom is a direct proxy of sample size resulting from the sample size minus the number of parameters in the model). One (at least partial) explanation of this surprising result is that in the early days researchers primarily reported fewer APA results and used to report relatively more APA results with 'marginally significant' $p$-values (i.e., $p$-values slightly larger than .05), compared to nowadays. This explanation is supported by both a smaller number of reported APA results in the past and the smaller mean reported nonsignificant $p$-value (`r round(mean(dat$Reported.P.Value[dat$Reported.Comparison == '=' & dat$years.y. == 1985 & dat$Reported.P.Value > .05]), 3)` in 1985, `r round(mean(dat$Reported.P.Value[dat$Reported.Comparison == '=' & dat$years.y. == 2013 & dat$Reported.P.Value > .05]), 3)` in 2013). We do not know whether these marginally significant $p$-values were interpreted as evidence in favor of a finding (or not) and how these interpretations changed over time. Another potential explanation is that the effect sizes being studied have become smaller over time (mean correlation effect $r=$ `r round(mean(sqrt(dat$esComp[dat$Reported.Comparison == '=' & dat$years.y. == 1985 & dat$Reported.P.Value > .05])), 3)` in 1985, `r round(mean(sqrt(dat$esComp[dat$Reported.Comparison == '=' & dat$years.y. == 2013 & dat$Reported.P.Value > .05])), 3)` in 2013), which results in both higher $p$-values over time and lower power of the Fisher test. Using the data at hand, we cannot distinguish between the two explanations.
```{r tgtbf-fig4, fig.cap="Proportion of papers reporting nonsignificant results in a given year, showing evidence for false negative results. Larger point size indicates a higher mean number of nonsignificant results reported in that year.", echo=FALSE, fig.align = 'center', out.width = '100%', fig.pos = 'h'}
knitr::include_graphics('assets/figures/tgtbf-fig4.pdf.svg.png')
```
```{r tgtbf-fig5, fig.cap="Sample size development in psychology throughout 1985-2013, based on degrees of freedom across 258,050 test results. P25 = 25th percentile. P50 = 50th percentile (i.e., median). P75 = 75th percentile.", echo=FALSE, fig.align = 'center', out.width = '100%', fig.pos = 'h'}
knitr::include_graphics('assets/figures/tgtbf-fig5.pdf.svg.png')
```
### Discussion
The result that 2 out of 3 papers containing nonsignificant results show evidence of at least one false negative, when applying the NHST framework, empirically verifies previously voiced concerns about insufficient attention for false negatives [@doi:10.1177/1745691612462587]. The Fisher test proved a powerful test to inspect for false negatives in our simulation study, where three nonsignificant results already results in high power to detect evidence of a false negative if sample size is at least 33 per result and the population effect is medium. Journals differed in the proportion of papers that showed evidence of false negatives, but this was largely due to differences in the number of nonsignificant results reported in these papers. More generally, we observed that more nonsignificant results were reported in 2013 than in 1985.
The repeated concern about power and false negatives throughout the last decades seems not to have trickled down into substantial change in psychology research practice. @doi:10.1037/h0045186 and @doi:10.1037/0033-2909.105.2.309 already voiced concern decades ago and showed that power in psychology was low. @doi:10.1177/1745691612462587 contended that false negatives are harder to detect in the current scientific system and therefore warrant more concern. Despite recommendations of increasing power by increasing sample size, we found no evidence for increased sample size (see Figure \@ref(fig:tgtbf-fig5)). To the contrary, the data indicate that average sample sizes have been remarkably stable since 1985, despite the improved ease of collecting participants with data collection tools such as online services.
However, what has changed is the amount of nonsignificant results reported in the literature. Our data show that more nonsignificant results are reported throughout the years (see Figure \@ref(fig:tgtbf-fig2)), which seems contrary to findings that indicate that relatively more significant results are being reported [@doi:10.1007/s11192-011-0494-7;@doi:10.2307/2684823;@doi:10.2307/2282137;@doi:10.7717/peerj.733]. It would seem the field is not shying away from publishing negative results per se, as proposed before [@doi:10.1007/s11192-011-0494-7;@doi:10.1037/h0076157;@doi:10.1177/1745691612459058;@doi:10.1037/0033-2909.86.3.638;@doi:10.1037/a0029487], but whether this is also the case for results relating to hypotheses of explicit interest in a study and not all results reported in a paper, requires further research. Other research strongly suggests that most reported results relating to hypotheses of explicit interest are statistically significant [@doi:10.1126/science.aac4716].
## Application 2: Evidence of false negative gender effects in eight major psychology journals
In order to illustrate the practical value of the Fisher test to test for evidential value of (non)significant $p$-values with the NHST framework, we investigated gender related effects in a random subsample of our database. Gender effects are particularly interesting because gender is typically a control variable and not the primary focus of studies. Hence, we expect little $p$-hacking and substantial evidence of false negatives in reported gender effects in psychology. We apply the Fisher test to significant and nonsignificant gender results to test for evidential value [@doi:10.1037/met0000025;@doi:10.1037/a0033242]. More precisely, we investigate whether evidential value depends on whether or not the result is statistically significant, and whether or not the results were in line with expectations expressed in the paper.
### Method
We planned to test for evidential value in six categories (expectation [3 levels] $\times$ significance [2 levels]). Expectations were specified as '$H_1$ expected', '$H_0$ expected', or 'no expectation'. Prior to data collection, we assessed the required sample size for the Fisher test based on research on the gender similarities hypothesis [@doi:10.1037/0003-066x.60.6.581]. We calculated that the required number of statistical results for the Fisher test, given $r=.11$ [@doi:10.1037/0003-066x.60.6.581] and 80\% power, is 15 $p$-values per condition, requiring 90 results in total. However, the six categories are unlikely to occur equally throughout the literature, hence we sampled 90 significant and 90 nonsignificant results pertaining to gender, with an expected cell size of 30 if results are equally distributed across the six cells of our design. Significance was coded based on the reported $p$-value, where $\leq.05$ was used as the decision criterion to determine significance [@doi:10.3758/s13428-015-0664-2].
We sampled the 180 gender results from our database of over 250,000 test results in four steps. First, we automatically searched for "gender", "sex", "female" AND "male", " man" AND " woman" [sic], or " men" AND " women" [sic] in the $100$ characters before the statistical result and $100$ after the statistical result (i.e., range of $200$ characters surrounding the result), which yielded 27,523 results. Second, the first author inspected $500$ characters before and after the first result of a randomly ordered list of all 27,523 results and coded whether it indeed pertained to gender. This was done until 180 results pertaining to gender were retrieved from 180 different articles. Third, these results were independently coded by all authors with respect to the expectations of the original researcher(s) (coding scheme available at [https://osf.io/9ev63](https://osf.io/9ev63)). The coding included checks for qualifiers pertaining to the expectation of the statistical result (confirmed/theorized/hypothesized/expected/etc.). If researchers reported such a qualifier, we assumed they correctly represented these expectations with respect to the statistical significance of the result. For example, if the text stated "as expected no evidence for an effect was found, $t(12)=1, p=.337$" we assumed the authors expected a nonsignificant result. Fourth, discrepant codings were resolved by discussion (25 cases [13.9\%]; two cases remained unresolved and were dropped). 178 valid results remained for analysis.
Prior to analyzing these 178 $p$-values for evidential value with the Fisher test, we transformed them to variables ranging from 0 to 1. Statistically nonsignificant results were transformed with Equation \@ref(eq:pistar); statistically significant $p$-values were divided by alpha .05 [@doi:10.1037/met0000025;@doi:10.1037/a0033242].
### Results
The coding of the 178 results indicated that results rarely specify whether these are in line with the hypothesized effect (see Table \@ref(tab:tgtbf-tab5). For the 178 results, only 15 clearly stated whether their results were as expected, whereas the remaining 163 did not. Illustrative of the lack of clarity in expectations is the following quote: "*As predicted, there was little gender difference [...] p < .06.*" There were two results that were presented as significant but contained $p$-values larger than .05; these two were dropped (i.e., 176 results were analyzed). As a result, the conditions significant-$H_0$ expected, nonsignificant-$H_0$ expected, and nonsignificant-$H_1$ expected contained too few results for meaningful investigation of evidential value (i.e., with sufficient statistical power).
```{r tgtbf-tab5, echo = FALSE}
capt <- "Number of gender results coded per condition in a 2 (significance: significant or nonsignificant) by 3 (expectation: $H_0$ expected, $H_1$ expected, or no expectation) design. Cells printed in bold had sufficient results to inspect for evidential value."
if (knitr::is_html_output()) {
fil <- 'assets/tables/ch5-tab5-html.csv'
tabdf <- read.csv(fil, stringsAsFactors = FALSE)
names(tabdf) <- c('', '$H_0$ expected', '$H_1$ expected', 'No expectation')
knitr::kable(tabdf, caption = capt, booktabs = TRUE, escape = FALSE) %>%
kableExtra::kable_styling(position = 'center',
bootstrap_options = c("striped", "hover", "condensed", "responsive", full_width = F))
} else {
fil <- 'assets/tables/ch5-tab5-latex.csv'
tabdf <- read.csv(fil, stringsAsFactors = FALSE)
names(tabdf) <- c('', '$H_0$ expected', '$H_1$ expected', 'No expectation')
knitr::kable(tabdf, caption = capt, format = 'latex', booktabs = TRUE, escape = FALSE) %>%
kableExtra::kable_styling(latex_options = c('striped', 'hold_position'), position = 'center')
}
```
Figure \@ref(fig:tgtbf-fig6) presents the distributions of both transformed significant and non-significant $p$-values. For significant results, applying the Fisher test to the $p$-values showed evidential value for a gender effect both when an effect was expected ($\chi^2(22)=358.904$, $p<.001$) and when no expectation was presented at all ($\chi^2(15)=1094.911$, $p<.001$). Similarly, applying the Fisher test to nonsignificant gender results without stated expectation yielded evidence of at least one false negative ($\chi^2(174)=324.374$, $p<.001$). Unfortunately, we could not examine whether evidential value of gender effects is dependent on the hypothesis/expectation of the researcher, because these effects are most frequently reported without stated expectations.
```{r tgtbf-fig6, fig.cap="Probability density distributions of the $p$-values for gender effects, split for nonsignificant and significant results. A uniform density distribution indicates the absence of a true effect.", echo=FALSE, fig.align = 'center', out.width = '100%', fig.pos = 'h'}
knitr::include_graphics('assets/figures/tgtbf-fig6.pdf.svg.png', auto_pdf = TRUE)
```
### Discussion
We observed evidential value of gender effects both in the statistically significant (no expectation or $H_1$ expected) and nonsignificant results (no expectation) using the NHST framework. The data from the 178 results we investigated indicated that in only 15 cases the expectation of the test result was clearly explicated. This indicates that based on test results alone, it is very difficult to differentiate between results that relate to a priori hypotheses and results that are of an exploratory nature. The importance of being able to differentiate between confirmatory and exploratory results has been previously demonstrated [@doi:10.1177/1745691612463078] and has been incorporated into the Transparency and Openness Promotion guidelines [TOP; @doi:10.1126/science.aab2374] with explicit attention paid to preregistration.
## Application 3: Reproducibility Project Psychology
Out of the 100 replicated studies in the RPP, 64 did not yield a statistically significant effect size, despite the fact that high replication power was one of the aims of the project [@doi:10.1126/science.aac4716]. Regardless, the authors suggested "*...that at least one replication could be a false negative*" (p. aac4716-4). Here we estimate how many of these nonsignificant replications might be false negative, by applying the Fisher test to these nonsignificant effects.
### Method
Of the 64 nonsignificant studies in the RPP data ([https://osf.io/fgjvw](https://osf.io/fgjvw)), we selected the 63 nonsignificant studies with a test statistic. We eliminated one result because it was a regression coefficient that could not be used in the following procedure. We first applied the Fisher test to the nonsignificant results, after transforming them to variables ranging from 0 to 1 using equations \@ref(eq:pistar) and \@ref(eq:fishertest). Denote the value of this Fisher test by $Y$; note that under the $H_0$ of no evidential value $Y$ is $\chi^2$-distributed with 126 degrees of freedom.
Subsequently, we hypothesized that $X$ out of these 63 nonsignificant results had a weak, medium, or strong population effect size [i.e., $\rho=.1$, $.3$, $.5$, respectively; @isbn:9780805802832] and the remaining $63-X$ had a zero population effect size. For each of these hypotheses, we generated 10,000 data sets (see next paragraph for details) and used them to approximate the distribution of the Fisher test statistic (i.e., $Y$). Using this distribution, we computed the probability that a $\chi^2$-value exceeds $Y$, further denoted by $p_Y$. We then used the inversion method [@isbn:9780534243128] to compute confidence intervals of $X$, the number of nonzero effects. Specifically, the confidence interval for $X$ is ($X_{LB};X_{UB}$), where $X_{LB}$ is the value of $X$ for which $p_Y$ is closest to $.025$ and $X_{UB}$ is the value of $X$ for which $p_Y$ is closest to $.975$. We computed three confidence intervals of $X$: one for the number of weak, medium, and large effects.
We computed $p_Y$ for a combination of a value of $X$ and a true effect size using 10,000 randomly generated data sets, in three steps. For each data set we:
+ Randomly selected $X$ out of 63 effects which are supposed to be generated by true nonzero effects, with the remaining $63-X$ supposed to be generated by true zero effects;
+ Given the degrees of freedom of the effects, we randomly generated $p$-values under the $H_0$ using the central distributions and noncentral distributions (for the $63-X$ and $X$ effects selected in step 1, respectively);
+ The Fisher statistic $Y$ was computed by applying Equation \@ref(eq:fishertest) to the transformed $p$-values (see Equation \@ref(eq:pistar)) of step 2.
Probability $p_Y$ equals the proportion of 10,000 data sets with $Y$ exceeding the value of the Fisher statistic applied to the RPP data. See [osf.io/egnh9](https://osf.io/egnh9) for the analysis script to compute the confidence intervals of $X$.
### Results
Upon reanalysis of the 63 statistically nonsignificant replications within RPP we determined that many of these "failed" replications say hardly anything about whether there are truly no effects when using the adapted Fisher method. The Fisher test of these 63 nonsignificant results indicated some evidence for the presence of at least one false negative finding ($\chi^2(126)=155.2382$, $p=0.039$). Assuming $X$ small nonzero true effects among the nonsignificant results yields a confidence interval of 0-63 (0-100\%). More specifically, if all results are in fact true negatives then $p_Y=.039$, whereas if all true effects are $\rho=.1$ then $p_Y=.872$. Hence, the 63 statistically nonsignificant results of the RPP are in line with any number of true small effects --- from none to all. Consequently, we cannot draw firm conclusions about the state of the field psychology concerning the frequency of false negatives using the RPP results and the Fisher test, when all true effects are small. Assuming $X$ medium or strong true effects underlying the nonsignificant results from RPP yields confidence intervals 0-21 (0-33.3\%) and 0-13 (0-20.6\%), respectively. In other words, the 63 statistically nonsignificant RPP results are also in line with some true effects actually being medium or even large.
### Discussion
The reanalysis of the nonsignificant RPP results using the Fisher method and the NHST framework demonstrates that any conclusions on the validity of individual effects based on "failed" replications, as determined by statistical significance, is unwarranted. This was also noted by both the original RPP team [@doi:10.1126/science.aac4716;@doi:10.1126/science.aad9163] and in a critique of the RPP [@doi:10.1126/science.aad7243]. Replication efforts such as the RPP or the Many Labs project remove publication bias and result in a less biased assessment of the true effect size. Nonetheless, single replications should not be seen as the definitive result, considering that these results indicate there remains much uncertainty about whether a nonsignificant result is a true negative or a false negative. The explanation of this finding is that most of the RPP replications, although often statistically more powerful than the original studies, still did not have enough statistical power to distinguish a true small effect from a true zero effect [@doi:10.1037/a0039400]. Interpreting results of replications should therefore also take the precision of the estimate of both the original and replication into account [@doi:10.1177/0956797613504966] and publication bias of the original studies [@doi:10.1371/journal.pone.0149794].
Very recently four statistical papers have reanalyzed the RPP results to either estimate the frequency of studies testing true zero hypotheses or to estimate the individual effects examined in the original and replication study. All four papers account for the possibility of publication bias in the original study. @doi:10.1080/01621459.2016.1240079 estimated a Bayesian statistical model including a distribution of effect sizes among studies for which the null hypothesis is false. On the basis of their analyses they conclude that at least 90\% of psychology experiments tested negligible true effects. Johnson et al.'s model as well as our Fisher's test are not useful for estimation and testing of individual effects examined in original and replication study. Interpreting results of individual effects should take the precision of the estimate of both the original and replication into account [@doi:10.1177/0956797613504966]. @doi:10.1371/journal.pone.0149794 reanalyzed the RPP at the level of individual effects, using Bayesian models incorporating publication bias. They concluded that 64\% of individual studies did not provide strong evidence for either the null or the alternative hypothesis in either the original of the replication study. This agrees with our interpretaion of the RPP findings and that of @doi:10.1037/a0039400. As opposed to @doi:10.1371/journal.pone.0149794, @doi:10.3758/s13428-017-0967-6 use a statistically significant original and a replication study to evaluate the common true underlying effect size, adjusting for publication bias. From their Bayesian analysis [@doi:10.1371/journal.pone.0175302] assuming equally likely zero, small, medium, large true effects, they conclude that only 13.4\% of individual effects contain substantial evidence (Bayes factor > 3) of a true zero effect. For a staggering 62.7\% of individual effects no substantial evidence in favor of zero, small, medium, or large true effect size was obtained. All in all, conclusions of our analyses using the Fisher test are in line with other statistical papers reanalyzing the RPP data [with the exception of @doi:10.1080/01621459.2016.1240079] suggesting that studies in psychology are typically not powerful enough to distinguish zero from nonzero true findings.
## General Discussion
Much attention has been paid to false positive results in recent years. Our study demonstrates the importance of paying attention to false negatives alongside false positives. We examined evidence for false negatives in nonsignificant results in three different ways, using the NHST framework. Specifically, we adapted the Fisher method to detect the presence of at least one false negative in a set of statistically nonsignificant results. Simulations indicated the adapted Fisher test to be a powerful method for that purpose. The three applications indicated that (i) approximately two out of three psychology articles reporting nonsignificant results contain evidence for at least one false negative, (ii) nonsignificant results on gender effects contain evidence of true nonzero effects, and (iii) the statistically nonsignificant replications from the Reproducibility Project Psychology (RPP) do not warrant strong conclusions about the absence or presence of true zero effects underlying these nonsignificant results (RPP does yield less biased estimates of the effect; the original studies severely overestimated the effects of interest).
The methods used in the three different applications provide crucial context to interpret the results. In applications 1 and 2, we did not differentiate between main and peripheral results. Hence, the interpretation of a significant Fisher test result pertains to the evidence of at least one false negative in all reported results, not the evidence for at least one false negative in the main results. Nonetheless, even when we focused only on the main results in application 3, the Fisher test does not indicate specifically which result is false negative, rather it only provides evidence for a false negative in a set of results. As such, the Fisher test is primarily useful to test a set of potentially underpowered results in a more powerful manner, albeit that the result then applies to the complete set. Additionally, in applications 1 and 2 we focused on results reported in eight psychology journals; extrapolating the results to other journals might not be warranted given that there might be substantial differences in the type of results reported in other journals or fields.
More generally, our results in these three applications confirm that the problem of false negatives in psychology remains pervasive. Previous concern about power [@doi:10.1037/h0045186;@doi:10.1037/0033-2909.105.2.309;@doi:10.1177/1745691612459060;@doi:10.2466/03.11.pms.112.2.331-348], which was even addressed by an APA Statistical Task Force in 1999 that recommended increased statistical power [@doi:10.1037/0003-066x.54.8.594], seems not to have resulted in actual change [@doi:10.2466/03.11.pms.112.2.331-348]. Potential explanations for this lack of change is that researchers overestimate statistical power when designing a study for small effects [@doi:10.1177/0956797616647519], use $p$-hacking to artificially increase statistical power, and can act strategically by running multiple underpowered studies rather than one large powerful study [@doi:10.1177/1745691612459060]. The effects of $p$-hacking are likely to be the most pervasive, with many people admitting to using such behaviors at some point [@doi:10.1177/0956797611430953] and publication bias pushing researchers to find statistically significant results. As such, the problems of false positives, publication bias, and false negatives are intertwined and mutually reinforcing.
Reducing the emphasis on binary decisions in individual studies and increasing the emphasis on the precision of a study might help reduce the problem of decision errors [@doi:10.1177/0956797613504966]. For example, a large but statistically nonsignificant study might yield a confidence interval (CI) of the effect size of [-0.01; 0.05], whereas a small but significant study might yield a CI of [0.01; 1.30]. In a purely binary decision mode, the small but significant study would result in the conclusion that there is an effect because it provided a statistically significant result, despite it containing much more uncertainty than the larger study about the underlying true effect size. In a precision mode, the large study provides a more certain estimate and therefore is deemed more informative and provides the best estimate. Using meta-analyses to combine estimates obtained in studies on the same effect may further increase the overall estimate's precision. Although the emphasis on precision and the meta-analytic approach is fruitful in theory, we should realize that publication bias will result in precise but biased (overestimated) effect size estimation of meta-analyses [@doi:10.1037/gpr0000034].
### Limitations and further research
For all three applications, the Fisher tests' conclusions are limited to detecting at least one false negative in a *set of results*. The method cannot be used to draw inferences on individuals results in the set. To draw inferences on the true effect size underlying one specific observed effect size, generally more information (i.e., studies) is needed to increase the precision of the effect size estimate.
Another potential caveat relates to the data collected with the `R` package `statcheck` and used in applications 1 and 2. `statcheck` extracts inline, APA style reported test statistics, but does not include results included from tables or results that are not reported as the APA prescribes. Consequently, our results and conclusions may not be generalizable to *all* results reported in articles.
Given that the results indicate that false negatives are still a problem in psychology, albeit slowly on the decline in published research, further research is warranted. Further research could focus on comparing evidence for false negatives in main and peripheral results. Our results in combination with results of previous studies suggest that publication bias mainly operates on results of tests of main hypotheses, and less so on peripheral results. Another venue for future research is using the Fisher test to re-examine evidence in the literature on certain other effects or often-used covariates, such as age and race, or to see if it helps researchers prevent dichotomous thinking with individual $p$-values [@doi:10.3758/bf03213921]. Finally, the Fisher test may and is also used to meta-analyze effect sizes of different studies. Whereas Fisher used his method to test the null hypothesis of an underlying true zero effect using several studies' $p$-values, the method has recently been extended to yield unbiased effect estimates using only statistically significant $p$-values. The principle of uniformly distributed $p$-values given the true effect size on which the Fisher method is based, also underlies newly developed methods of meta-analysis that adjust for publication bias, such as $p$-uniform [@doi:10.1037/met0000025] and $p$-curve [@doi:10.1037/a0033242]. Extensions of these methods to include nonsignificant as well as significant $p$-values and to estimate heterogeneity are still under construction. Other approaches to examine statistically nonsignificant results may include inspecting effect sizes directly, similar to our approach used to construct Figure \@ref(fig:tgtbf-fig3), instead of relying on the NHST framework.
To conclude, our three applications indicate that false negatives remain a problem in the psychology literature, despite the decreased attention and that we should be wary to interpret statistically nonsignificant results as there being no effect in reality. One way to combat this interpretation of statistically nonsignificant results is to incorporate testing for potential false negatives, which the Fisher method facilitates in a highly approachable manner (a spreadsheet for carrying out such a test is available at [https://osf.io/tk57v/](https://osf.io/tk57v/)).