Should science abandon “statistical significance” or just stop mis-using the concept?

In a recent Comment in Nature (20 March 2019; 567:3057), Amrhein, Greenland, and McShane make a case for banishing the concept of “statistical significance” from scientific reasoning.   Is this big change really called for?  The idea of statistical significance came from Ronald A. Fisher, a founder of modern statistics, to help deal with the persistent problem of random variability in observations.  In trying to understand the world, science often needs to determine whether some factor, say an experimental treatment, has a effect onr not. In simple classical form, the experimenter takes two groups of subjects from the same population but treats only one, expecting that a genuine treatment effect will show up as a difference between the two groups. However, because of random variability, any two groups, even untreated groups from the same population, will not be exactly the same.  So how can we tell if the treatment did anything?  Fisher suggested that if the two groups were different enough according to some appropriate measure, it would be reasonable to suspect that the treatment was effective. For instance, if the difference was expected to occur only 1 time in 20 by random chance alone, this could be a “significant” difference and would justify further study of the treatment.  This led to the idea of the P value, in particular to the P < 0.05 value that is standard in many fields.  It all sounds very reasonable.

Amrhein et al., feel that things have gotten out of hand; the concepts have become so distorted that the entire notion of statistical significance must be discarded.  Among other things, they argue that P values are misused to make rigid, dichotomous, which was not Fisher’s intention. They ridicule the mistaken belief that a statistically insignificant P value means that there is “no difference” between the groups being compared, and the related notion that only statistically significant effects are “real” ones.  In sum, they consider that misuse of the P value is so rampant that it is time “for the entire concept of statistical significance to be abandoned.”

There is no doubt that Amrhein et al. have serious, legitimate concerns, however, their argument goes too far in some respects and is ultimately unconvincing.  Several correspondents to Nature strongly disagree with Amrhein on a number of points and I’ll offer a few more. As Amrhein et al. are aware, decision-making is a complex matter of human psychology that goes beyond statistical practice.  I believe the problems they bring up also reveal short-comings in scientific training, especially regarding philosophical subtleties in the nature of science, and suggest that we can make major improvements by enriching scientific education and practice

For instance, Amrhein et al. are much opposed to the use of P values in “categorizing” or “dichotomizing” – e.g., resting binary decisions on distinctions between so-called significant and non-significant test results. They do note some exceptions, e.g. quality control in manufacturing – where categorization is mandatory (the gear is either acceptable or it isn’t), they still feel that categorization should be avoided as much as possible.  Instead, they say that we should favor a flexible, nuanced evaluative process that weighs degrees of “compatibity” between data and a hypothesis.

There is much to be said for keeping an open mind and interpreting data expansively rather than narrowly, but the authors’ view of dichotomyous categorization is overly pinched.  Many kinds of decisions in areas far removed from manufacturing quality control settings depend on dichotomous decisions.  Scientists must decide to either to take action or not despite having limited time, money, and so forth. We take in information, evaluate it, and then decide what to do: we decide between true and false hypotheses, between which predictions to test and which not, which experiment to do, etc. etc.  These are all dichotomous decisions.  Thus the problem is not with categorization itself, but on how we categorize and the finality with which we take the results. In making these decisions, there is no obvious reason to exclude relevant information from P values, exactly as Fisher envisioned.

The second major issue involves psychological matters. The article implies that the concept of statistical significance leads automatically and unavoidably to mis-interpretation. Its proposed remedy is to identify “compatibility intervals” for data and teach scientists to interpret them “in a way that avoids overconfidence.” But this recommendation is too vague to be helpful.  If scientists can learn to “avoid overconfidence” in interpreting compatibility intervals, why couldn’t they learn to stop mis-interpreting P values?  In their Comment, the authors present data showing that ~50% of scientists already do interpret P values appropriately. Is there evidence that the rest are untrainable?   Rather than wiping the slate clean and starting from scratch with compatibility intervals, surely we can do a better job of educating scientists and incentivizing the statistically correct interpretation of P values? 

In fact, the complaint expressed by Amrhein et al. appears to be more about deficiencies in scientists’ understanding of the nature of science than about concrete statistical matters. The authors flatly reject the use of P values “to decide whether a result supports or refutes a scientific hypothesis,” which raises an explicitly philosophical issue about the relationship between data and hypotheses.  Abolishing statistical significance won’t address any of these issues. It may be more productive to increase the awareness of science students, scientists, and the public to philosophical issues that we usually skip over.  Karl Popper was the most influential philosopher of science of the 20th century and I suggest that his work is a good place to start.

Popper was deeply affected by problem of uncertainty: how can science make progress in understanding nature in the face of such uncertainty?  Most people have heard of Popper’s principle of “falsification,” i.e., that we can’t prove that a hypothesis is true, but with care, we determine if it might be false.  Therefore, he says that we should design experiments that have the capability of revealing that a hypothesis is wrong if it is truly wrong. There are always multiple ways of explaining some natural phenomenon and so finding data compatible with an explanation is neither very difficult nor very informative.  We should aim to challenge a hypothesis by probing for its weaknesses, not to try to “support” it.  Simply taking this principle seriously should allay some of Amrhrein et al.’s concerns.

But what happens if we test a hypothesis and find that it is not wrong? Logically, we can’t say that it is “truer,” or even “more likely to be true,” after the testing, because we still don’t know what the real truth is.  For Popper, nothing special happens to a hypothesis that passes a test; it remains a possibly true explanation.  As a purely practical matter, of course, it is reasonable to take a rigorously tested hypotheses as the basis for taking action (such hypotheses constitute the scientific “facts” that we have to work with).  Nevertheless, Popper does recognize that this reasoning, while solid, has an awkward consequence.  Referring to “tested-and-not-falsified” hypotheses is indisputably unsatisfying. We seem to have a deep-seated psychological urge to conclude that we have actually acquired new positive knowledge as a result of a test.  This is what leads to our misuse of the language regarding “statistical significance” and “support” for hypotheses.  The problem again lies not with the statistical concept, but with our careless language combined. 

Zooming out, our appreciation of “the nature of science” also influences the ways in which “statistical significance” affects us. Popper often underscores the point that “falsification is never final:” like all scientific conclusions, our judgement may change as more information becomes available. Science, for Popper, is a constant process of trial and error. This means that however we interpret our results – with either compatibility analysis or P value – the end result is not a fact etched in stone.  Hence, even though they use different terms, the spirit of Popper’s philosophy overlaps with that of Amrhein, et al.. Notably, they both demand a critical yet flexible evaluations of data and both reject the notion that experiments can “support” hypotheses.  Yet Popper’s program offers practical advice about how scientists should proceed. We should work to ensure that our hypotheses are not wrong; not to find data that are compatible with them. 

Finally, I turn to a topic that Amrhein et al. don’t overtly consider, but which has implications for their arguments. They reserve their most intense ire for dichotomous decisions that are based on single results, despite the fact that conclusions of hypothesis-testing basic laboratory research almost never rest on single results.  Rather, important scientific conclusions are based on an aggregate assessment the outcomes of many tests.  The overall assessment is more meaningful than the outcome of any single test.  Thus, even if every one of the tests were assessed at a P value = 0.05  it would make no sense to consider the probability of the net outcome of a group of related tests as P = 0.05.  This is a point that worries about single P values miss. A single P value feeds into the overall conclusion, but does not determine it. Regrettably, science has not adopted a standard way of combining P values to arrive at a comprehensive quantitative estimate of this value. Yet there are actually several alternatives for doing this statistically (see, https://www.biorxiv.org/content/10.1101/537365v1 for a note outlining one of them, Fisher’s Method, which was devised by R.A. Fisher) The main point here is that, to the extent that science does not place major weight on single tests, some of the worry about the individual tests expressed by Amrhrein et al. is overblown.

In quantitatively summarizing the likelihood of a collection of results, Fisher’s Method does not assume or require any level of statistical significance for the individual P values that it combines.  If a test result has P = 0.082, then 0.082 goes into the calculation. This alone has two benefits:  it helps deflect the excessive focus on P values and it ensures the inclusion of results presently considered insignificant, which is one of Amrhrein et al.’s objectives. The Method should also remove the temptation to conclude there is “no difference” between two groups just because a particular P value is “insignificant.” Thus the inclusiveness of Fisher’s Method accomplishes the objective of taking evidence of varying degrees of strength into acoutn.  

An additional benefit of Fisher’s Method is that it demands careful thought about the internal logic of a scientific project.  Only results that genuinely test predictions of the same hypothesis can be combined.  Extraneous observations or tests of other hypotheses are excluded.  It will be up to a given research community to decide on how to evaluate the parameter coming out of the Fisher’s Method calculation (I’ve suggested “PFM” as apparently Fisher didn’t name it).  In any case, PFM should reduce or eliminate errors in interpretation that can arise because of random sample variability, which is also one of Amrhein et al.’s concerns.  Can Fisher’s Method be improperly or unethically manipulated?  Of course; any system can be gamed.  We’re not hoping for perfection, we are hoping to achieve thoughtful, balanced, sound ways of evaluating and presenting the conclusions of experimental tests.  Instilling an appreciation of Karl Popper’s philosophy, a clearer understanding of how basic experimental science is done, and adopting ways of aggregating the results of complex, multi-part studies such as Fisher’s Method, should all be steps in the right direction.