P-Values and Throwing Out the Baby With the Bath Water

So last week, the science bloggysphere was briefly abuzz (atwitter?) about the decision by the journal Basic and Applied Social Psychology (BASP) to ban p-value statistics from their articles:

The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP.

They are currently on the fence regarding Bayesian/likelihood approaches, but seem inclined, by my reading, to not like them very much either. What they do call for are more descriptive statistics:

BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem. However, we will stop short of requiring particular sample sizes, because it is possible to imagine circumstances where more typical sample sizes might be justifiable.

I get their emphasis on descriptive statistics and effect size, and I also agree with Potnia Theron:

My simple point was that effect size is as important as significance. I’ve talked about this some, before. Something can be significant (because of pseudo-replication, over sampling, etc) but that the effect size is trivial, or less than your resolution of measurement. For (made-up) example, say you find a difference in a variable measuring “age at disease onset” between males and females, but that difference is on the order of days, i.e., males 60 years, 2 months and 3 days, females, 60 years, 2 months and 4 days. What does that mean in any kind of measure of effect size other than indicating there is probably some other bias lurking in your data. This one is obvious because we have an intuitive sense of what the data mean. Other problems are often less obvious.

But here’s what I don’t get. Even when you’re doing something very simple, such as determining whether two treatments are different (e.g., with and without a drug), it’s helpful to understand the probability to which the pattern you observe isn’t due to random chance (“The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true”). If your p-value is 0.4, that result is probably not worth following up. But when it comes to hypothesis generation–or identifying patterns that could lead to hypotheses–null hypothesis significance testing has utility, though it shouldn’t be slavishly followed, and you shouldn’t necessarily abandon a result, especially when your power of test is low.

They also don’t seem keen on Bayesian/likelihood approaches, especially when you’re trying to assess different hypotheses, such as is “if we start fishing again, will my fishery grow, remain stable, or shrink?” In many cases, it’s difficult to conclusively exclude one or more outcomes, and some sort of quantitative evaluation procedure is needed.

I suppose if your entire experimental methodology is solely looking for differences among treatments, then maybe this will work. But biology is often more complicated than that.

It will be interesting to see how things turn out at BASP.

This entry was posted in Statistics. Bookmark the permalink.

4 Responses to P-Values and Throwing Out the Baby With the Bath Water

  1. Pingback: Down with P-values! | The Jackson Lab Blog

  2. Robert L Bell says:

    I am following this discussion with great interest, as I am in the camp that hold p values to be tools – nothing more, nothing less. The scandal lies in the number of papers published in the serious literature where the investigators clearly do not understand what they are doing and the referees do not understand enough to stop them. The resolution lies in more effective education (and perhaps the culling of the incurably inept from the research herd): merely banning one tool in the hope that people will magically understand better tools is an act of desperation.

  3. Tracy Lightcap says:

    They’re wrong, pure and simple. What they are doing is stopping researchers from presenting frequentist hypothesis testing because they won’t ask those researchers to:

    a. Distinguish between Fisher-style and Neyman-Pearson hypothesis testing (spoiler alert: they’re very different) and

    b. Ask researchers to be more transparent about presenting causal effects.

    They could have down this pretty easily by a series of editorial requirements, but they decided to go all “we-like-you-to-do-this-but-we-won’t-require-it” until they saw clearly that nobody was paying attention. Then they decided to simply stop contributors from submitting anything that did things they think they don’t like. Of course, the easiest thing to do would be to have a, you know, editorial policy in the first place.

    I’m a dyed-in-the-wool frequentist myself; I think this is the only way to actually test scientific models. But I’m also a political scientist and, as a consequence, I don’t have to put up with this foolishness. Good.

  4. Pingback: P-values deemed no longer significant | Brokering Closure

Comments are closed.