There’s an 538 article “Science Isn’t Broken” that’s winding its way through the tubes of the internet. It has a cool ‘p-hacking’ interactive feature, which is probably why most of the associated commentary has focused on the problem of p-hacking (p-hacking is where you attempt to tweak your analysis until you get a p-value of less than 0.05, which means that the probability that the supposed effect you observe is due to random chance is less than five percent). But I think the interactive feature has distracted people from the real problem: the treatment of data before you even run a statistical test. I’ve described how this happens in genomics as well as in a preliminary analysis of D.C. student test scores. In both cases, how you decide to treat your data (e.g., which kids are defined as poor, or which assembly method to use when reconstructing a genome) can qualitatively influence your outcomes.
To me, this example described in the 538 story is critical (boldface mine):
Nosek’s team invited researchers to take part in a crowdsourcing data analysis project. The setup was simple. Participants were all given the same data set and prompt: Do soccer referees give more red cards to dark-skinned players than light-skinned ones? They were then asked to submit their analytical approach for feedback from other teams before diving into the analysis.
Twenty-nine teams with a total of 61 analysts took part. The researchers used a wide variety of methods, ranging — for those of you interested in the methodological gore — from simple linear regression techniques to complex multilevel regressions and Bayesian approaches. They also made different decisionsabout which secondary variables to use in their analyses.
Despite analyzing the same data, the researchers got a variety of results. Twenty teams concluded that soccer referees gave more red cards to dark-skinned players, and nine teams found no significant relationship between skin color and red cards.
The variability in results wasn’t due to fraud or sloppy work. These were highly competent analysts who were motivated to find the truth, said Eric Luis Uhlmann, a psychologist at the Insead business school in Singapore and one of the project leaders. Even the most skilled researchers must make subjective choices that have a huge impact on the result they find.
And of course, this kind of study doesn’t even take into account experimental methods (e.g., lab technique, experimental lines, engineering capabilities), which can be very difficult. Think of it this way: how many statisticians would you trust to do brain surgery? Some experiments are very hard to conduct and/or measure.
For me, the real story in the 538 article is that much of the variability occurs before you run any statistical analyses. That’s why science, as the article puts, is really fucking hard.