A few weeks ago, there was a minor ruckus about a post that claimed that the shutdown of Chrysler dealerships was biased to protect dealers who supported Clinton’s campaign. At the time, I ignored it, figuring it was just another case of Conservative Clinton Derangement Syndrome (BILL CLINTON’S PENIS!!! BILLARY IS A SHE DEVIL!! AAAIIIEEE!!!). But I finally got around to reading it, and guess what?
It’s stupid, and based on a sloppy understanding of probability theory.
The authors of the post performed multiple linear regressions to determine if dealerships owned by campaign supporters of various presidential candidates were more or less likely to be closed. What they found is that the regression for Clinton supporters, which indicated that Clinton supporters were more likely to survive the shutdown, had a p-value of 0.125.
A “p-value” is the probability that a given observation occurred by chance and is not a ‘real’ phenomenon. So, in this case, there is an 87.5% (1 – 0.125) probability that the bias towards Clinton supporters is not a sampling accident.
ZOMG!!! TEH CLINTONZ R EVULS!! Except for one thing: multiple tests were performed, and a correction was not applied. What this means in English is that if you perform enough statistical comparisons, eventually one will be significant by chance alone. This is why, when polls are discussed, people will often claim that one out of twenty answers is wrong*. (Polls typically use the scientifically accepted p-value of 0.05, meaning that, on average, one of twenty (five percent) of significant differences in a poll is unfounded and occurred by chance).
The authors, in an update, recognize this, and remark (italics mine):
A word about multiple experiments:
We found what I will call the “Clinton Effect” after running the data in separate regressions just with Clinton, Obama, McCain. The rest of the variables we added in later testing. One could make the argument that Zero Hedge was “data mining” or “fishing” with multiple experiments eventually bound to find something. Readers will have to judge the import of this observation for themselves.
Actually, we don’t have to “judge the import” of anything, we can use math to figure this out. If we make N comparisons, the probability that one or more observations will have a p-value of p is equal to:
If one performs three comparisons (and I think the ‘minimum’ set should have been four, and have included the ‘None’–no donation–category), there is a 33% chance of observing one or more tests with a p-value of 0.125. With four tests, it increases to 41%. Five tests, you get 49%, and six tests, yields 55%.
So, judging the import of these findings, I say they’re bullshit.
There is an interesting scientific problem here. When you conduct large-scale surveys, such as are done in human genomics, where you look for correlations with hundreds of thousands of genetic markers, finding something that is truly significant becomes very difficult.
It’s also humbling to think that, over one’s career, if you have conducted many experiments (non-reinforcing ones), that a small fraction of what you think is real is simply artifact.
*This is actually sloppy and incorrect, but it’s too much to ask our pundit class to understand statistics. Actually, it’s a miracle they can even dress themselves.