P-Values and Power of Test: Why So Many Results Can’t Be Replicated

It’s all the rage to argue that science is facing a crisis because so many results can’t be repeated. David Colquhoun has a really nice explanation why replication experiments fail so often. Before we get to his explanation, we’ll have to march through a couple of ugly terms (but it will be worth it!). First, there’s “P(real)” which is the probability that there is a real effect. For example, if ten percent of drugs tested have a real effect (assume we’re omniscient), then P(real) would equal 0.1. “Power” is the probably that a given experiment will detect that real effect. If the power = 0.8, then eighty percent of the time when there is a real biological effect, it will be observed (most studies shoot for a power of 0.8 or higher); of course, this also means twenty percent of real effects will missed. There’s the significance level, usually referred to as p (or a ‘p-value’). I’ll use Colquhoun’s definition:

The P value is the probability that you would find a difference as big as that observed, or a still bigger value, if in fact A and B were identical.

With a p-value of 0.05, the typical minimum threshold, there’s a five percent chance that your seemingly significant result is actually spurious. Finally, there’s the false discovery rate (FDR), which is the percentage of positive results that are actually false positives (i.e., you believe your drug works, but it doesn’t). Colquhoun makes a cartoon example using some reasonable values (power = 0.8, P(real) = 0.1, p =0.05):


Yep, 36 percent false positives with a p-value of 0.05. Depressingly, many studies are underpowered (i.e., power is lower than 0.8) and most treatments won’t work, so P(real) is much less than 0.1. If you want to play around, here’s a generalized version of the cartoon:


With Big Data and data mining, P(real) will often be very small. Thanks to limited budgets, power will also be low. Rub these two things together, and we should expect high false discovery rates. Put another way, we get the science we pay for: mining existing data and low numbers of replicates is cheaper than funding research at the scale the experiments require.

Though it does keep things interesting

An aside: Don’t even think the word Bayesian. We’re not having that argument…

This entry was posted in Funding, Statistics. Bookmark the permalink.

7 Responses to P-Values and Power of Test: Why So Many Results Can’t Be Replicated

  1. cettel22 says:

    And I thought that “FDR” referred to Franklin Delano Roosevelt.”

  2. Paul Orwin says:

    Is it just my poor blog reading skills, or did you not link to the OP in this description (good one, btw)?

  3. Jonathan Kaplan says:

    (cough cough “P(real)” sounds a lot like a Bayesian prior cough cough) 🙂

  4. While medicine will still want a “looser” p-value than the natural sciences, for good reasons, I can’t see why, in today’s world, we can’t tighten it to, say, 0.03.

  5. Darkling says:

    cough effect size cough multiple comparison correction cough publication bias cough If only there was some literature out there explaining why P-values aren’t good science cough.

    Strictly speaking the p-value is the odds f getting the data you observed if the null hypothesis was correct. In most situations that’s trivially true because no two populations are the same, the interesting question isn’t if there’s a difference, it’s the direction and the size.

    The problems with NHST seem to come up in every field, eventually.

  6. Pingback: Washington Equitable Center for Growth | Things to Read on the Morning of April 15, 2014

  7. Pingback: Arguably the Best Paper Abstract Ever | Mike the Mad Biologist

Comments are closed.