10-summary.Rmd

# Summary {-}

This dissertation focuses on either understanding and detecting threats to the epistemology of science (chapters 1-6) or making practical advances to remedy epistemological threats (chapters 7-9)

Chapter 1 reviews the literature on responsible conduct of research, questionable
research practices, and research misconduct. Responsible conduct of
research is often defined in terms of a set of abstract, normative
principles, professional standards, and ethics in doing research. In
order to accommodate the normative principles of scientific research,
the professional standards, and a researcher's moral principles,
transparent research practices can serve as a framework for responsible
conduct of research. Here I suggest a 'prune-and-add' project structure to
enhance transparency and by extension, responsible conduct of research.
Questionable research practices are defined as practices that are
detrimental to the research process. The prevalence of questionable
research practices remains largely unknown and reproducibility of
findings has been shown to be problematic. Questionable practices are
discouraged by transparent practices because practices that arise from
them will become more apparent to scientific peers. Most effective might
be preregistrations of research design, hypotheses, and analyses, which
reduce particularism of results by providing an a priori research
scheme. Research misconduct has been defined as fabrication,
falsification, and plagiarism (FFP), which is clearly the worst type of
research practice. Despite it being clearly wrong, it can be approached
from a scientific and legal perspective. The legal perspective sees
research misconduct as a form of white-collar crime. The scientific
perspective seeks to answer the question "were results invalidated
because of the misconduct?" I review how misconduct is typically
detected, how its detection can be improved, and how prevalent it might
be. Institutions could facilitate detection of data fabrication and
falsification by implementing data auditing. Nonetheless, the effect of
misconduct is pervasive: many retracted articles are still cited after
the retraction has been issued.

@doi:10.1371/journal.pbio.1002106 provided a large collection of
$p$-values that, from their perspective, indicates widespread
statistical significance seeking (i.e., $p$-hacking). Chapter 2
inspects this result for robustness. Theoretically, the $p$-value
distribution should be a smooth, decreasing function, but the
distribution of reported $p$-values shows systematically more reported
$p$-values for .01, .02, .03, .04, and .05 than $p$-values reported to
three decimal places, due to apparent tendencies to round $p$-values
to two decimal places. @doi:10.1371/journal.pbio.1002106 correctly
argue that an aggregate $p$-value distribution could show a bump below
.05 when left-skew $p$-hacking occurs frequently. Moreover, the
elimination of $p=.045$ and $p=.05$, as done in the original paper, is
debatable. Given that eliminating $p=.045$ is a result of the need for
symmetric bins and systematically more $p$-values are reported to two
decimal places than to three decimal places, I did not exclude
$p=.045$ and $p=.05$. I applied Fisher's method on $.04<p<.05$ and
reanalyzed the data by adjusting the bin selection to
$.03875<p\leq.04$ versus $.04875<p\leq.05$. Results of the reanalysis
indicate that no evidence for left-skew $p$-hacking remains when I
look at the entire range between $.04<p<.05$ or when I inspect the
second-decimal. Taking into account reporting tendencies when
selecting the bins to compare is especially important because this
dataset does not allow for the recalculation of the
$p$-values. Moreover, inspecting the bins that include two-decimal
reported $p$-values potentially increases sensitivity if strategic
rounding down of $p$-values as a form of $p$-hacking is
widespread. Given the far-reaching implications of supposed widespread
$p$-hacking throughout the sciences @doi:10.1371/journal.pbio.1002106,
it is important that these findings are robust to data analysis
choices if the conclusion is to be considered unequivocal. Although no
evidence of widespread left-skew $p$-hacking is found in this
reanalysis, this does not mean that there is no $p$-hacking at
all. These results nuance the conclusion by
@doi:10.1371/journal.pbio.1002106, indicating that the results are not
robust and that the evidence for widespread left-skew $p$-hacking is
ambiguous at best.

Chapter 3 examined 258,050 test results across 30,710 articles from eight high impact journals to investigate the existence of a peculiar prevalence of $p$-values just below .05 (i.e., a bump) in the psychological literature, and a potential increase thereof over time. I indeed found evidence for a bump just below .05 in the distribution of exactly reported $p$-values in the journals Developmental Psychology, Journal of Applied Psychology, and Journal of Personality and Social Psychology, but the bump did not increase over the years and disappeared when using recalculated $p$-values. I found clear and direct evidence for the QRP  "incorrect rounding of $p$-value" [@doi:10.1177/0956797611430953] in all psychology journals. Finally, I also investigated monotonic excess of $p$-values, an effect of certain QRPs that has been neglected in previous research, and developed two measures to detect this by modeling the  distributions of statistically significant $p$-values. Using simulations and applying the two measures to the retrieved test results, I argue that, although one of the measures suggests the use of QRPs in psychology, it is difficult to draw general conclusions concerning QRPs based on modeling of $p$-value distributions.

In Chapter 4 I examined evidence for false negatives in nonsignificant
results in three different ways. I adapted the Fisher method to detect
the presence of at least one false negative in a set of statistically
nonsignificant results. Simulations show that the adapted Fisher
method generally is a powerful method to detect false negatives. I
examined evidence for false negatives in the psychology literature in
three applications of the adapted Fisher method. These applications
indicate that (i) the observed effect size distribution of
nonsignificant effects exceeds the expected distribution assuming a
null-effect, and approximately two out of three (66.7%) psychology
articles reporting nonsignificant results contain evidence for at
least one false negative, (ii) nonsignificant results on gender
effects contain evidence of true nonzero effects, and (iii) the
statistically nonsignificant replications from the Reproducibility
Project Psychology (RPP) do not warrant strong conclusions about the
absence or presence of true zero effects underlying these
nonsignificant results. I conclude that false negatives deserve more
attention in the current debate on statistical practices in
psychology. Potentially neglecting effects due to a lack of
statistical power can lead to a waste of research resources and stifle
the scientific discovery process.

Chapter 5 describes a dataset that is the result of content mining
167,318 published articles for statistical test results reported
according to the standards prescribed by the American Psychological
Association (APA). Articles published by the APA, Springer, Sage, and
Taylor & Francis were included (mining from Wiley and Elsevier was
actively blocked). As a result of this content mining, 688,112 results
from 50,845 articles were extracted. In order to provide a
comprehensive set of data, the statistical results are supplemented
with metadata from the article they originate from. The dataset is
provided in a comma separated file (CSV) in long-format. For each of
the 688,112 results, 20 variables are included, of which seven are
article metadata and 13 pertain to the individual statistical results
(e.g., reported and recalculated $p$-value). A five-pronged approach
was taken to generate the dataset: (i) collect journal lists; (ii)
spider journal pages for articles; (iii) download articles; (iv) add
article metadata; and (v) mine articles for statistical results. All
materials, scripts, etc. are available at
[https://github.com/chartgerink/2016statcheck_data](https://github.com/chartgerink/2016statcheck_data)^[This
GitHub repository has been deleted since this chapter was previously
published. The links are included to remain consistent with the
published version.] and preserved at
[http://dx.doi.org/10.5281/zenodo.59818](http://doi.org/10.5281/zenodo.59818).

In Chapter 6, I test the validity of statistical methods to detect
fabricated data in two studies. In Study 1, I tested the validity of
statistical methods to detect fabricated data at the study level using
summary statistics. Using (arguably) genuine data from the Many Labs 1
project on the anchoring effect ($k=36$) and fabricated data for the
same effect by our participants ($k=39$), I tested the validity of our
newly proposed 'reversed Fisher method', variance analyses, and
extreme effect sizes, and a combination of these three indicators
using the original Fisher method. Results indicate that the variance
analyses perform fairly well when the homogeneity of population
variances is accounted for and that extreme effect sizes perform
similarly well in distinguishing genuine from fabricated data. The
performance of the 'reversed Fisher method' was poor and depended on
the types of tests included. In Study 2, I tested the validity of
statistical methods to detect fabricated data using raw data. Using
(arguably) genuine data from the Many Labs 3 project on the classic
Stroop task ($k=21$) and fabricated data for the same effect by our
participants ($k=28$), I investigated the performance of digit
analyses, variance analyses, multivariate associations, and extreme
effect sizes, and a combination of these four methods using the
original Fisher method. Results indicate that variance analyses,
extreme effect sizes, and multivariate associations perform fairly
well to excellent in detecting fabricated data using raw data, while
digit analyses perform at chance levels. The two studies provide mixed
results on how the use of random number generators affects the
detection of data fabrication. Ultimately, I consider the variance
analyses, effect sizes, and multivariate associations valuable tools
to detect potential data anomalies in empirical (summary or raw)
data. However, I argue against widespread (possible automatic)
application of these tools, because some fabricated data may be
irregular in one aspect but not in another. Considering how violations
of the assumptions of fabrication detection methods may yield high
false positive or false negative probabilities, I recommend comparing
potentially fabricated data to genuine data on the same topic.

Chapter 7 tackles the issue of data extraction. It is common for
authors to communicate their results in graphical figures, but those
data are frequently unavailable for reanalysis. Reconstructing data
points from a figure manually requires the author to measure the
coordinates either on printed pages using a ruler, or from the display
screen using a cursor. This is time-consuming (often hours) and
error-prone, and limited by the precision of the display or
ruler. What is often not realised is that the data themselves are held
in the PDF document to much higher precision (usually 0.0-0.01
pixels), if the figure is stored in vector format. We developed alpha
software to automatically reconstruct data from vector figures and
tested it on funnel plots in the meta-analysis literature. Our results
indicate that reconstructing data from vector based figures is
promising, where I correctly extracted data for 12 out of 24 funnel
plots with extracted data (50%). However, I observed that vector based
figures are relatively sparse (15 out of 136 papers with funnel plots)
and strongly insist publishers to provide more vector based data
figures in the near future for the benefit of the scholarly community.

Scholarly research faces threats to its sustainability on multiple
domains (access, incentives, reproducibility, inclusivity). In Chapter 8 I argue that "after-the-fact" research papers do not help and
actually cause some of these threats because the chronology of the
research cycle is lost in a research paper. I propose to give up the
academic paper and propose a digitally native "as-you-go"
alternative. In this design, modules of research outputs are
communicated along the way and are directly linked to each other to
form a network of outputs that can facilitate research
evaluation. This embeds chronology in the design of scholarly
communication and facilitates recognition of more diverse outputs that
go beyond the paper (e.g., code, materials). Moreover, using network
analysis to investigate the relations between linked outputs could
help align evaluation tools with evaluation questions. I illustrate
how such a modular "as-you-go" design of scholarly communication could
be structured and how network indicators could be computed to assist
in the evaluation process, with specific use cases for funders,
universities, and individual researchers.

A scholarly communication system needs to register, distribute,
certify, archive, and incentivize knowledge production. Chapter 9
proposes that the current article-based system technically fulfills
these functions, but suboptimally. I propose a module-based
communication infrastructure that attempts to take a wider view of
these functions and optimize the fulfillment of the five functions of
scholarly communication. Scholarly modules are conceptualized as the
constituent parts of a research process as determined by a
researcher. These can be text, but also code, data, and any other
relevant piece of information. The chronology of these modules is
registered by iteratively linking to each other, creating a provenance
record of parent- and child modules (and a network of modules). These
scholarly modules are linked to scholarly profiles, creating a network
of profiles, and a network of profiles and their constituent
modules. All these scholarly modules would be communicated on the new
peer-to-peer Web protocol Dat
([datproject.org](https://datproject.org)), which provides a
decentralized register that is immutable, facilitates greater content
integrity through verification, and is open by design. Open by design
would also allow diversity in the way content is consumed, discovered,
and evaluated to arise. This initial proposal needs to be refined and
developed further based on technical developments of the Dat protocol
and its implementations, and discussions within the scholarly
community to evaluate the qualities claimed here. Nonetheless, a
minimal prototype is available today and this is technically feasible.