With the recent release of the results from the Reproducibility Project–an effort to replicate 100 experimental findings in social psychology–there has been a lot of commentary and coverage about the finding that the majority of the studies failed to replicate. Most of the discussion I’ve seen has focused on issues such as selective reporting of positive results (and failing to report negative) and various forms of ‘p-hacking’ (p-hacking is where you attempt to tweak your analysis until you get a p-value of less than 0.05, which means that the probability that the effect you observe is due to random chance is less than five percent).
P-hacking could be the explanation, along with the Decline Effect. It doesn’t seem to be an issue of ‘pre-treating‘ the data, as the analyses appeared to performed using the same methods. But I’m always suspicious of pat, one-size-fits-all answers.
The less exciting, more difficult, yet far more interesting explanation–and one which might improve the quality of the science–has to do with methodology. To understand why two supposed replicates differed, we have to understand the experimental details, which is the antithesis of one-size-fits-all punditry. One methodological explanation is that what we thought were identical methods actually differed in subtle (to us, not the test subjects) ways. That’s really not a problem–we just have to develop better technique and control the variables better. This makes science better.
The more disturbing problem is that these could be inherently variable experiments. While in some theoretical sense, we could minimize variability if we understood every source of variation, in reality, that’s not possible. So some experiments are hard to reproduce. To use a genomics example (cuz everyone picks on psychology), the dirty secret of genome sequencing (for the cognoscenti, I’m referring to Illumina sequencing) is that we don’t get the same bacterial sequence twice. We get something really close: usually the genomic content might differ by a small fraction (two genomes will differ by ~0.0001%, give or take), and of the regions the two replicates share, there will often be a couple errors per million bases (there are ways to reduce the false positives in terms of the errors). But this level of variability–or ‘lack’ of reproducibility–is tolerable. But some fields–for that matter, some methods–might yield experiments that, at least today, produce much more variable and less reproducible results–and there isn’t a whole lot we can do about that. To the extent we can do something, these will be very domain-specific.
Put another way, every unreproducible experiment might be unreproducible in its own way.
Which doesn’t make for very easy punditry.
I don’t think it’s an either-or, but, while some things may be wrong on a per-experiment basis, others probably are overarching. I do think p-values, along with other “settings” for other statistical controls, are part of the problem. I’m not expecting psychology, or medicine, to be as “rigorous” as physics, but why shouldn’t it move from, say, 0.05 on p-values to 0.03?
That said, methodology may also be part of the problem. And, here, it could be multiple methodology issues. Lack of adequate class of members in a study; in psych, lack of adequate definition of the issue under study, WEIRD issues or more.
I do agree that things are usually muliti-causal. That’s just a riff on Idries Shah, and his “there are never just two sides to an issue” aphorism, which deserves to be a lot better known.