02-peerj_reanalyzing.Rmd

# Reanalyzing Head et al. (2015): investigating the robustness of widespread $p$-hacking

```{r echo=FALSE}
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(kableExtra))
```
@doi:10.1371/journal.pbio.1002106 provided a large collection of $p$-values that, from their perspective, indicates widespread statistical significance seeking (i.e., $p$-hacking) throughout the sciences. This result has been questioned from an epistemological perspective because analyzing all reported $p$-values in research articles answers the supposedly inappropriate question of evidential value across all results [@doi:10.1037/xge0000104]. Adjacent to epistemological concerns, the robustness of widespread $p$-hacking in these data can be questioned due to the large variation in a priori choices with regards to data analysis. @doi:10.1371/journal.pbio.1002106 had to make several decisions with respect to the data analysis, which might have affected the results. In this chapter I evaluate the data analysis approach with which @doi:10.1371/journal.pbio.1002106 found widespread $p$-hacking and propose that this effect is not robust to several justifiable changes. The underlying models for their findings have been discussed in several preprints [e.g., @doi:10.7287/peerj.preprints.1266v1;@doi:10.6084/m9.figshare.1500901.v1] and publications [e.g., @doi:10.1037/xge0000104;@doi:10.1371/journal.pone.0149144], but the data have not extensively been reanalyzed for robustness.

The $p$-value distribution of a set of true- and null results without $p$-hacking should be a mixture distribution of only the uniform $p$-value distribution under the null hypothesis $H_0$ and right-skew $p$-value distributions under the alternative hypothesis $H_1$. $P$-hacking behaviors affect the distribution of statistically significant $p$-values, potentially resulting in left-skew below .05 (i.e., a bump), but not necessarily so [@doi:10.7717/peerj.1935;@doi:10.1080/17470218.2014.982664;@doi:10.7717/peerj.1715]. An example of a questionable behavior that can result in left-skew is optional stopping (i.e., data peeking) if the null hypothesis is true [@doi:10.1080/17470218.2014.982664].

Consequently, @doi:10.1371/journal.pbio.1002106 correctly argue that an aggregate $p$-value distribution could show a bump below .05 when left-skew $p$-hacking occurs frequently. Questionable behaviors that result in seeking statistically significant results, such as (but not limited to) the aforementioned optional stopping under $H_0$, could result in a bump below .05. Hence, a systematic bump below .05 (i.e., not due to sampling error) is a sufficient condition for the presence of specific forms of $p$-hacking. However, this bump below .05 is not a necessary condition, because other types of $p$-hacking can still occur without a bump below .05 presenting itself [@doi:10.7717/peerj.1935;@doi:10.1080/17470218.2014.982664;@doi:10.7717/peerj.1715]. For example, one might use optional stopping when there is a true effect or conduct multiple analyses, but only report that statistical test which yielded the smallest $p$-value. Therefore, if no bump of statistically significant $p$-values is found, this does not exclude that $p$-hacking occurs at a large scale.

In the current chapter, the conclusion from @doi:10.1371/journal.pbio.1002106 is inspected for robustness. Their conclusion is that the data fullfill the sufficient condition for $p$-hacking (i.e., show a systematic bump below .05), hence, provides evidence for the presence of specific forms of $p$-hacking. The robustness of this conclusion is inspected in three steps: (i) explaining the data and data analysis strategies (original and reanalysis), (ii) reevaluating the evidence for a bump below .05 (i.e., the sufficient condition) based on the reanalysis, and (iii) discussing whether this means that there is widespread $p$-hacking in the literature. 

## Data and methods

In the original paper, over two million reported $p$-values were mined from the [Open Access subset of PubMed central](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). PubMed central indexes the biomedical and life sciences and permits bulk downloading of full-text Open Access articles. By text-mining these full-text articles for $p$-values, @doi:10.1371/journal.pbio.1002106 extracted more than two million $p$-values in total. Their text-mining procedure extracted all reported $p$-values, including those that were reported without an accompanying test statistic. For example, the $p$-value from the result $t(59)=1.75,p>.05$ was included, but also a lone $p<.05$. Subsequently, @doi:10.1371/journal.pbio.1002106 analyzed a subset of statistically significant $p$-values (assuming $\alpha=.05$) that were exactly reported (e.g., $p=.043$; the same subset is analyzed in this chapter).

@doi:10.1371/journal.pbio.1002106 their data analysis approach focused on comparing frequencies in the last and penultimate bins from .05 at a binwidth of .005 (i.e., $.04<p< .045$ versus $.045<p<.05$). Based on the tenet that a sufficient condition for $p$-hacking is a systematic bump of $p$-values below .05 [@doi:10.1037/a0033242], sufficient evidence for $p$-hacking is present if the last bin has a significantly higher frequency than the penultimate bin in a binomial test. Applying the binomial test (i.e., Caliper test) to two frequency bins has previously been used in publication bias research [@doi:10.1177/1532673x09350979;@doi:10.1371/journal.pone.0105825], applied here specifically to test for $p$-hacking behaviors that result in a bump below .05. The binwidth of .005 and the bins $.04<p<.045$ and $.045<p<.05$ were chosen by @doi:10.1371/journal.pbio.1002106 because they expected the signal of this form of $p$-hacking to be strongest in this part of the distribution (regions of the $p$-value distribution closer to zero are more likely to contain evidence of true effects than regions close to .05). They excluded $p=.05$ "because [they] suspect[ed] that many authors do not regard $p=0.05$ as significant" (p.4). 

```{r head-hist, fig.cap="Histograms of $p$-values as selected in Head et al. (in green; $.04 < p < .045$ versus $.045 < p < .05$), the significant $p$-value distribution as selected in Head et al. (in grey; $0<p\\leq.00125$, $.00125<p\\leq.0025$, ..., $.0475<p\\leq.04875$, $.04875<p<.05$, binwidth = .00125). The green and grey histograms exclude $p=.045$ and $p=.05$; the black histogram shows the frequencies of results that are omitted because of this ($.04375<p\\leq.045$ and $.04875<p\\leq.05$, binwidth = .00125).", out.width="80%", fig.align="center", echo=FALSE}
knitr::include_graphics('assets/figures/head-fig1.png', auto_pdf = TRUE)
```

Figure \@ref(fig:head-hist) shows the selection of $p$-values in @doi:10.1371/journal.pbio.1002106 in two ways: (1) in green, which shows the results as analysed by Head et al. (i.e., $.04<p<.045$ versus $.045<p<.05$), and (2) in grey, which shows the entire distribution of significant $p$-values (assuming $\alpha=.05$) available to Head et al. after eliminating $p=.045$ and $p=.05$ (depicted by the black bins). The height of the two green bins (i.e., the sum of the grey bins in the same range) show a bump below .05, which indicates $p$-hacking. The grey histogram in Figure \@ref(fig:head-hist) shows a more fine-grained depiction of the $p$-value distribution and does not clearly show a bump below .05, because it is dependent on which bins are compared. However, the grey histogram clearly indicates that results around the second decimal tend to be reported more frequently when $p\geq.01$. 

Theoretically, the $p$-value distribution should be a smooth, decreasing function, but the grey distribution shows systematically more reported $p$-values for .01, .02, .03, .04 (and .05 when the black histogram is included). As such, there seems to be a tendency to report $p$-values to two decimal places, instead of three. For example, $p=.041$ might be correctly rounded down to $p=.04$ or $p=.046$ rounded up to $p=.05$. A potential post-hoc explanation is that three decimal reporting of $p$-values is a relatively recent standard, if a standard at all. For example, it has only been prescribed since 2010 in psychology [@isbn:9781433805615], where it previously prescribed two decimal reporting [@American_Psychological_Association1983-yf;@American_Psychological_Association2001-uw]. Given the results, it seems reasonable to assume that other fields might also report to two decimal places instead of three, most of the time.

Moreover, the data analysis approach used by @doi:10.1371/journal.pbio.1002106 eliminates $p=.045$ for symmetry of the compared bins and $p=.05$ based on a potentially invalid assumption of when researchers regard results as statistically significant. $P=.045$ is not included in the selected bins ($.04<p<.045$ versus $.045<p<.05$), while this could affect the results. If $p=.045$ is included, no evidence of a bump below .05 is found (the left black bin in Figure \@ref(fig:head-hist) is then included; frequency $.04<p\leq.045=20114$ versus $.045<p<.05=18132$). However, the bins are subsequently asymmetrical and require a different analysis. To this end, I supplement the Caliper tests with Fisher's method [@Fisher1925-jl;@doi:10.2307/2681650] based on the same range analyzed by @doi:10.1371/journal.pbio.1002106. This analysis includes $.04<p<.05$ (i.e., it does not exclude $p=.045$ as in the binned Caliper test). Fisher's method tests for a deviation from uniformity and was computed as 
\begin{equation} 
  \chi^2_{2k}=-2\sum^k_{i=1}ln(\frac{p_i-.04}{.01})
  (\#eq:fishmeth)
\end{equation}
where $p_i$ are the $p$-values between $.04<p<.05$. Effectively, Equation \@ref(eq:fishmeth) tests for a bump between .04 and .05 (i.e., the transformation ensures that the transformed $p$-values range from 0-1 and that Fisher's method inspects left-skew instead of right-skew). $P=.05$ was consistently excluded by @doi:10.1371/journal.pbio.1002106 because they assumed researchers did not interpret this as statistically significant. However, researchers interpret $p=.05$ as statistically significant more frequently than they thought: 94\% of 236 cases investigated by @doi:10.3758/s13428-015-0664-2 interpreted $p=.05$ as statistically significant, indicating this assumption might not be valid. 

Given that systematically more $p$-values are reported to two decimal places and the adjustments described in the previous paragraph, I did not exclude $p=.045$ and $p=.05$ and I adjusted the bin selection to $.03875<p\leq.04$ versus $.04875<p\leq.05$. Visually, the newly selected data are the grey and black bins from Figure \@ref(fig:head-hist) combined, where the rightmost black bin (i.e., $.04875<p\leq.05$) is compared with the large grey bin at .04 (i.e., $.03875<p\leq.04$). The bins $.03875<p\leq.04$ and $.04875<p\leq.05$ were selected to take into account that $p$-values are typically rounded (both up and down) in the observed data. Moreover, if incorrect or excessive rounding-down of $p$-values occurs strategically [e.g., $p=.054$ reported as $p=.05$; @doi:10.1080/19312458.2015.1096333], this can be considered $p$-hacking. If $p=.05$ is excluded from the analyses, these types of $p$-hacking behaviors are eliminated from the analyses, potentially decreasing the sensitivity of the test for a bump.

The reanalysis approach for the bins $.03875<p\leq.04$ and $.04875<p\leq.05$ is similar to @doi:10.1371/journal.pbio.1002106 and applies the Caliper test to detect a bump below .05, with the addition of Bayesian Caliper tests. The Caliper test investigates whether the bins are equally distributed or that the penultimate bin (i.e., $.03875<p\leq.04$) contains more results than the ultimate bin (i.e., $.04875<p\leq.05$; $H_0:Proportion\leq.5$). Sensitivity analyses were also conducted, altering the binwidth from .00125 to .005 and .01. Moreover, the analyses were conducted for both the $p$-values extracted from the abstracts- and the results sections separately.

The results from the Bayesian Caliper test and the traditional, frequentist Caliper test give results with different interpretations. The $p$-value of the Caliper test gives the probability of more extreme results if the null hypothesis is true, but does not quantify the probability of the null- and alternative hypothesis. The added value of the Bayes Factor ($BF$) is that it does quantify the probabilities of the hypotheses in the model and creates a ratio, either as $BF_{10}$, the alternative hypothesis versus the null hypothesis, or vice versa, $BF_{01}$. A $BF$ of 1 indicates that both hypotheses are equally probable, given the data. All Bayesian proportion tests were conducted with highly uncertain priors ($r=1$, 'ultrawide' prior) using the `BayesFactor` package [@bf]. In this specific instance, $BF_{10}$ is computed and values $>1$ can be interpreted, for our purposes, as: the data are more likely under $p$-hacking that results in a bump below .05 (i.e., left-skew $p$-hacking) than under no left-skew $p$-hacking. $BF_{10}$ values $<1$ indicate that the data are more likely under no left-skew $p$-hacking than under left-skew $p$-hacking. The further removed from $1$, the more evidence in the direction of either hypothesis is available.

## Reanalysis results

Results of Fisher's method for all $p$-values between $.04<p<.05$ and does not exclude $p=.045$ fails to find evidence for a bump below .05, $\chi^2(76492)=70328.86,p>.999$. Additionally, no evidence for a bump below .05 remains when I focus on the more frequently reported second-decimal bins, which could include $p$-hacking behaviors such as incorrect or excessive rounding down to $p=.05$. Reanalyses showed no evidence for left-skew $p$-hacking, $Proportion=.417,p>.999, BF_{10}<.001$ for the Results sections and $Proportion=.358,p>.999,BF_{10}<.001$ for the Abstract sections. Table \@ref(tab:caliper-table) summarizes these results for alternate binwidths (.00125, .005, and .01) and shows results are consistent across different binwidths. Separated per discipline, no binomial test for left-skew $p$-hacking is statistically significant in either the Results- or Abstract sections (see the Supporting Information). This indicates that the evidence for $p$-hacking that results in a bump below .05, as presented by @doi:10.1371/journal.pbio.1002106, seems to not be robust to minor changes in the analysis such as including $p=.045$ by evaluating $.04<p<.05$ continuously instead of binning, or when taking into account the observed tendency to round $p$-values to two decimal places during the bin selection.

```{r caliper-table, echo=FALSE, results='asis'}
caliper <- read.csv('assets/tables/caliper-table.csv', header = TRUE)
# caliper$X.1 <- c('')

names(caliper)[1:2] <- ''

if (!knitr::is_html_output()) {
  knitr::kable(caliper, format = 'latex',
             caption = "Results of the reanalysis across various binwidths (i.e., .00125, .005, .01) and different sections of the paper.",
             booktabs=TRUE, escape = FALSE) %>%
  kableExtra::kable_styling(latex_options = c('striped', 'hold_position'), position = 'center')
} else {
  knitr::kable(caliper,
             caption = "Results of the reanalysis across various binwidths (i.e., .00125, .005, .01) and different sections of the paper.",
             booktabs=TRUE, escape = FALSE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", full_width = F))
  }
```

## Discussion

@doi:10.1371/journal.pbio.1002106 collected $p$-values from full-text articles and analyzed these for $p$-hacking, concluding that "$p$-hacking is widespread throughout science" [see abstract; @doi:10.1371/journal.pbio.1002106]. Given the implications of such a finding, I inspected whether evidence for widespread $p$-hacking was robust to some substantively justified changes in the data selection. A minor adjustment from comparing bins to continuously evaluating $.04<p<.05$, the latter not excluding .045, already indicated this finding seems to not be robust. Additionally, after altering the bins inspected due to the observation that systematically more $p$-values are reported to the second decimal and including $p=.05$ in the analyses, the results indicate that evidence for widespread $p$-hacking, as presented by @doi:10.1371/journal.pbio.1002106 is not robust to these substantive changes in the analysis. Moreover, the frequency of $p=.05$ is directly affected by $p$-hacking, when rounding-down of $p$-values is done strategically. The conclusion drawn by @doi:10.1371/journal.pbio.1002106 might still be correct, but the data do not undisputably show so. Moreover, even if there is no $p$-hacking that results in a bump of $p$-values below .05, other forms of $p$-hacking that do not cause such a bump can still be present and prevalent [@doi:10.7717/peerj.1935;@doi:10.1080/17470218.2014.982664;@doi:10.7717/peerj.1715].

Second-decimal reporting tendencies of $p$-values should be taken into consideration when selecting bins for inspection because this data set does not allow for the elimination of such reporting tendencies. Its substantive consequences are clearly depicted in the results of the reanalysis and Figure \@ref(fig:head-hist) illustrates how the theoretical properties of $p$-value distributions do not hold for the reported $p$-value distribution. Previous research has indicated that when the recalculated $p$-value distribution is inspected, the theoretically expected smooth distribution re-emerges even when the reported $p$-value distribution shows reporting tendencies [@doi:10.7717/peerj.1935;@doi:10.1371/journal.pone.0127872]. Given that the text-mining procedure implemented by @doi:10.1371/journal.pbio.1002106 does not allow for recalculation of $p$-values, the effect of reporting tendencies needs to mitigated by altering the data analysis approach. 

Even after mitigating the effect of reporting tendencies, these analyses were all conducted on a set of aggregated $p$-values, which can either detect $p$-hacking that results in a bump of $p$-values below .05 if it is widespread, but not prove that no $p$-hacking is going on in any of the individual papers. Firstly, there is the risk of an ecological fallacy. These analyses take place at the aggregate level, but there might still be research papers that show a bump below .05 at the paper level. Secondly, some forms of $p$-hacking also result in right-skew, which is not picked up in these analyses and is difficult to detect in a set of heterogeneous results [attempted in @doi:10.7717/peerj.1935]. As such, if any detection of $p$-hacking is attempted, this should be done at the paper level and after careful scrutiny of which results are included [@doi:10.1037/xge0000104;@doi:10.7717/peerj.1715].

## Limitations and conclusion

In this reanalysis two limitations remain with respect to the data analysis. First, selecting the bins just below .04 and .05 results in selecting non-adjacent bins. Hence, the test might be less sensitive to detect a bump below .05. In light of this limitation I ran the original analysis from @doi:10.1371/journal.pbio.1002106, but included the second decimal (i.e., $.04\leq p<.045$ versus $.045<p\leq.05$). This analysis also yielded no evidence for a bump of $p$-values below .05, $Proportion=.431,p>.999,BF_{10}<.001$. Second, the selection of only exactly reported $p$-values might have distorted the $p$-value distribution due to reporting tendencies in rounding. For example, a researcher with a $p$-value of .047 might be more likely to report $p<.05$ than a researcher with a $p$-value of .037 reporting $p<.04$. Given that these analyses exclude all values reported as $p<X$, this could have affected the results. There is some indication that this tendency to round up is relatively stronger around .05 than around .04 [a factor of 1.25 approximately based on the [original Figure 5](https://doi.org/10.1371/journal.pone.0127872.g005); @doi:10.1371/journal.pone.0127872], which might result in an underrepresentation of $p$-values around .05.

Given the implications of the findings by @doi:10.1371/journal.pbio.1002106, it is important that these findings are robust to choices that can vary. Moreover, the absence of a bump below .05 seems to be stronger than its presence throughout the literature: a reanalysis of a previous paper, which found evidence for a bump below .05 [@doi:10.1080/17470218.2012.711335], yielded no evidence for a bump below .05 [@doi:10.1080/17470218.2014.982664]; two new data sets also did not reveal a bump below .05 [@doi:10.7717/peerj.1935;@doi:10.1080/19312458.2015.1096333]. Consequently, findings that claim there is a bump below .05 need to be robust. In this chapter, I explained why a different data analysis approach to the data of @doi:10.1371/journal.pbio.1002106 can be justified and as a result no evidence of widespread $p$-hacking that results in a bump of $p$-values below .05 is found. Although this does not mean that no $p$-hacking occurs at all, the conclusion by @doi:10.1371/journal.pbio.1002106 should not be taken at face value considering that the results are not robust to (minor) choices in the data analysis approach. As such, the evidence for widespread left-skew $p$-hacking is ambiguous at best.

## Supporting Information

S1 File. Full reanalysis results per discipline: [https://doi.org/10.7717/peerj.3068/supp-1](https://doi.org/10.7717/peerj.3068/supp-1).