The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP.
They are currently on the fence regarding Bayesian/likelihood approaches, but seem inclined, by my reading, to not like them very much either. What they do call for are more descriptive statistics:
BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem. However, we will stop short of requiring particular sample sizes, because it is possible to imagine circumstances where more typical sample sizes might be justifiable.
My simple point was that effect size is as important as significance. I’ve talked about this some, before. Something can be significant (because of pseudo-replication, over sampling, etc) but that the effect size is trivial, or less than your resolution of measurement. For (made-up) example, say you find a difference in a variable measuring “age at disease onset” between males and females, but that difference is on the order of days, i.e., males 60 years, 2 months and 3 days, females, 60 years, 2 months and 4 days. What does that mean in any kind of measure of effect size other than indicating there is probably some other bias lurking in your data. This one is obvious because we have an intuitive sense of what the data mean. Other problems are often less obvious.
But here’s what I don’t get. Even when you’re doing something very simple, such as determining whether two treatments are different (e.g., with and without a drug), it’s helpful to understand the probability to which the pattern you observe isn’t due to random chance (“The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true”). If your p-value is 0.4, that result is probably not worth following up. But when it comes to hypothesis generation–or identifying patterns that could lead to hypotheses–null hypothesis significance testing has utility, though it shouldn’t be slavishly followed, and you shouldn’t necessarily abandon a result, especially when your power of test is low.
They also don’t seem keen on Bayesian/likelihood approaches, especially when you’re trying to assess different hypotheses, such as is “if we start fishing again, will my fishery grow, remain stable, or shrink?” In many cases, it’s difficult to conclusively exclude one or more outcomes, and some sort of quantitative evaluation procedure is needed.
I suppose if your entire experimental methodology is solely looking for differences among treatments, then maybe this will work. But biology is often more complicated than that.
It will be interesting to see how things turn out at BASP.