Why the Way We Use Statistical Significance Has Created a Crisis in Science

31/03/2019

p-value, asymptotic significance, statistical hypothesis testing, statistical significance, null hypothesis, probability, statistical modelling, Andrew Gelman, American Statistical Association, John Ioannidis, replication crisis, psychology replication crisis, prestigious journals,

In a 1919 paper titled ‘Mathematical vs. Scientific Significance’, Edwin G. Boring, an American psychologist, tried to explain why basing scientific intuition on mathematical results alone was misguided. He pointed out that “scientific generalisation is a broader question than mathematical description.”

One hundred years later, the same argument against the flattening of science to one magical number has reared its head once more. In March 2019, the American Statistical Association (ASA) focused a special issue of its journal American Statistician on how to move to a world beyond “p < 0.05″. Simultaneously, three scientists from different fields, Valentin Amrhein, Sander Greenland and Blake McShane, wrote an editorial in the journal Nature backed by the signatures of 800 colleagues calling to retire “statistical significance”.

These two incidents showcase how both the expert analysis and the popular sentiment support the need for a paradigm shift in how we talk about statistics and scientific evidence. And after more than a century of debate, the tide might finally be turning.

A p-value is a number in a field called statistical hypothesis testing. When researchers want to test if a claim based on some data could be true, they use statistical methods to determine if its counterclaim could be false. And the counterclaim is rejected as false – and the claim’s trueness accepted as statistically significant – if they calculate p to be less than or equal to an arbitrarily defined number, such as 0.05 or 0.01.

Also read: A Statistical Fix for the Replication Crisis in Science

The case against p-values and statistical significance is not a criticism of the concepts themselves but of their misuse. As the ASA editorial states:

… no p-value can reveal the plausibility, presence, truth or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical non-significance lead to the association or effect being improbable, absent, false or unimportant. Yet the dichotomisation into ‘significant’ and ‘not significant’ is taken as an imprimatur of authority on these characteristics.

P-values have not only become strong evidence in favour of or against a theory; they have also morphed into a mandatory requirement to be published in a journal of repute. They have become a threshold that separates results into two neat categories: a dichotomy of ‘yes’ or ‘no’, ‘valid’ or ‘invalid’, ‘evidence’ or ‘trash’.

One reason for this situation is poor statistical literacy among scientists. In 2016, when the ASA first came out with a statement on p-values, Andrew Gelman, a statistician in Columbia University, New York, wrote, “Statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an ‘uncertainty laundering’ that begins with data and concludes with success as measured by statistical significance.”

But placing the burden on individual scientists would ignore the huge role that journals, policymakers and funding agencies have played. As the trio of authors wrote in the Nature editorial:

The false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature… [Any] discussion that focuses on estimates chosen for their significance will be biased.

These bodies incentivise the use of these techniques through their selection mechanisms. So when they ignore the increasing clamour for reform, they perpetuate bad science. In areas like the biomedical sciences, where the need for a ‘yes’ or ‘no’ answer is greatest, scientists’ abuse of threshold p-values as decision-making tools has the most potential for harm.

While there is no call to eliminate p-values entirely, the way scientists use them must change. Instead of a binary pass-or-fail, scientists are encouraged to report them as a continuous quantity (like p = 0.025) as well as describe in clear language what scientific meaning can be interpreted from this value.

When they use more descriptive language, scientists – and science – can make more room for more nuance and also better communicate what weight they place on their result. Additionally, eliminating p-values might not even work as there is some evidence that that could lead scientists to overstate the meaning of their results.

Also read: Don’t Say Science Is Self-Correcting – Two Studies Show It Isn’t

The special issue of American Statistician offers some concrete alternatives, including using second generation p-values, “analysis of credibility” that takes confidence intervals and other factors into account, and complementing p-values with measures of “false positive risk”. Alternatively, various scientists have discussed the idea of manuscripts being evaluated in a manner that is “results-blind”.

This would then put the onus on research design and importance rather than on whether the result was spectacular, null or anything in between. The hope is that, as a result, the focus would be back on ensuring rigorous processes rather than picking out exaggerated outcomes.

Of course, none of these ideas claim to be silver bullets. They are simply ways of moving in the right direction. It is clear from the recent editorials that there is much healthy disagreement about how to go forward – and that is undoubtedly a good thing.

Interestingly, the ASA editorial’s strongest suggestion for how to go forward is a seven-word mantra: “Accept uncertainty. Be thoughtful, open and modest.” In their unpacking of this simple message, the authors attack the current structure of scientific publishing and research incentives, call for a greater commitment to open science and reproducible research, and criticise the hype and bombast that is the enemy of careful progress.

It is a rallying cry for serious reform, not only in the formal institutions of science publishing and funding but the more informal cultures of research practice in universities and labs everywhere.

Thomas Manuel is the winner of The Hindu Playwright Award 2016.