A couple of weeks ago, Jonah Lehrer wrote about the Decline Effect, where the support for a scientific claim often tends to decrease or even disappear over time (ZOMG! TEH SCIENTISMZ R FALSE!). There’s been a lot of discussion explaining why we see this effect and how TEH SCIENTISMZ are doing ok: PZ Myers gives a good overview, and David Gorski and Steven Novella also address this topic. Monday, a long-time reader asked me about the article, and I raised another issue that hadn’t–or so I thought–been broached:
There’s also a sample size issue: if the effect is should be weak and sample sizes aren’t that large, any statistically significant result will likely be false and not real (i.e., the result you see is probably a statistical fluke and not the biology, since the biology would result in a much weaker difference that statistically you can’t detect). The way around this is either retrospective power tests (i.e., trying to figure out what you could actually detect and determining if your result makes any sense biologically) or a Bayesian approach (i.e., given a prediction of a weak effect, I would expect to see outcome X; the only problem here is that your prior needs to be appropriate).
Fortunately, by way of a follow-up piece by Lehrer, I stumbled across a similar explanation by Andrew Gelman. Lehrer, unfortunately, doesn’t really describe what Gelman is getting at, however.
Gelman (and he has some good slides over at his post) is claiming, correctly, that if the effect is weak and you don’t have enough samples (e.g., subjects enrolled in the study), any statistically significant result will be so much greater than what the biology would provide that it’s probably spurious. You might get lucky and have a spurious result that points in the same direction as the real phenomenon, but that’s just luck.
Let me give an example. If I flip an evenly-weighted coin 100 times, 95% of the time, I will see 43-57 heads. So if I flip the coin 100 times, and get 55 heads and 45 tails, I can’t conclude that my coin is biased. Now suppose I nick the coin to give it a slight bias: 51% heads, 49% tails. Let’s say this time I also wind up with 55 heads and 45 tails. I still can’t conclude that the coin is biased*. If I flipped the coin eleventy gajillion times (i.e., lots), and wound up with a ratio of 55 heads:45 tails, I could conclude that the coin is biased. But in the first case of 100 flips of a slightly biased coin, there is a ~6.6% chance that I could wind up with a result of 58 heads or more**, a result that also happens to be statistically significant. But that’s a fluke: most of the time, the coin will show a weak bias, about which I will be unable to conclude anything (other than that my thumb, after flipping a coin a bunch of times, gets very tired).
In the coin-flipping example above, I know a priori it’s a weak effect. But the problem with real-life experimentation is that we often don’t have any idea what the outcome should look like. Do I have the Bestest Cancer Drug EVAH!, or simply one that has a small, but beneficial effect? If you throw in a desire, not always careerist or greedy (cancer does suck), to want or overestimate a potential large effect, the healthy habit of skepticism sometimes is observed in the breach. Worse, if you’re ‘data-mining’, you often have no a priori assumptions at all!
Note that this is not a multiple corrections or ‘p-value’ issue–the point isn’t that sometimes you’ll get a significant result by chance. The problem has to do with detection: with inadequate study sizes plus weak effects, anything you detect is spurious, albeit sometimes fortuitous.
So what someone will do is report the statistically significant result (since we tend to not report the insignificant ones). But further experiments, which often aren’t well designed either, fail to pick up an effect. The ones that are well designed and have a large sample size will either identify a very weak real effect, leading to a consensus in the field of “Meh”, or correctly fail to find a non-existent effect.
Sounds like the Decline Effect to me.
Do your power calculations, kids….
*Likelihood and Bayesian methods will also fail in this particular example.
**There is also a 4.4% chance that I will observe 42 or fewer heads. This means that 11% of the time, I will conclude that the coin is biased, but, in those eleven percent of cases, forty percent of the time, I will think the coin is biased in the wrong direction.
Excellent. I haven’t worked my way through your math, but get the overall idea. I truly didn’t understand the big deal about the Lehrer article. I’m no scientist but I knew every reason for the DE that he discussed. It’s extremely interesting, but it is, to put it mildly, absurd to conclude from all those confounding possibilities that what to believe about an experiment is a matter of choice, let alone that emprical science is in trouble from some mysterious Unknown Force.
Forgive some extra words that make part of it easier for me (and perhaps others).
Toss fair coin 100 times, then get p<.05 5% of the time, p<.01 1% of the time, etc.
Toss fair coin 10000 times, then p-value has the same distribution.
1) if I do get p<.01, the estimated probability of heads will on average be far larger for the 100 toss experiment. I hope that is obvious. For example estimated probability of .52 will not give p<.01 for 100 tosses, but would if the sample size were large enough. Or again: getting estimated probability of .58 is “easy” for 100 tosses, but “hard” for large samples.
2) If the coin was not really fair, that is still true.
Nerd version: the critical region for the proportion changes with the sample size.
A very nice summary. I’ll need to figure out how to impress this info on a few of my labmates.
I woke up thinking about this this morning and had to rush over to the first blog I could find about decline effect on google 🙂 Maybe this has been mentioned before but…
Isn’t one problem with the Decline Effect that only successful experiments usually get replicated? So effectively ‘the only way is down’. It’s not as if scientists are replicating the unsuccessful experiments and discovering that they actually work the second time round 🙂