When worked on the human microbiome, I regularly confronted a problem with the data. Species frequencies are almost never normally distributed (‘the bell curve’), and if you want to use standard statistical techniques the data should be normally distributed. The second problem is that the data often have a lot of zero values. That is, if I look a bunch of gut samples from people (actually the data–the samples are VERY STINKY!), in many samples, a bacterial species* will be quite frequent (2-20%), but in other samples, it will be very rare (0.01%) or completely absent (i.e., 0%).
Often, people will use a log transformation of the data, but that presents problems if you have zeros (the log of 0 is undefined). One transformation that can handle zeros and frequency data** is the arcsine square root transformation. It turns out that economists have to deal with the same issue: some people earn millions, while others earn nothing. So what do they use?
They use the inverse hyperbolic sine transformation. Here’s how Frances Woolley describes it:
Happily, there’s an easy solution to this problem: the inverse hyperbolic sine transformation. It sounds intimidating and impressive; it isn’t.
The inverse hyperbolic sine transformation is defined as:
Except for very small values of y, the inverse sine is approximately equal to log(2yi) or log(2)+log(yi), and so it can be interpreted in exactly the same way as a standard logarithmic dependent variable. For example, if the coefficient on “urban” is 0.1, that tells us that urbanites have approximately 10 percent higher wealth than non-urban people.
But unlike a log variable, the inverse hyperbolic sine is defined at zero.
So why don’t people use it? Why did I find myself this morning, once again, writing a revise-and-resubmit letter along the lines of “and re-do the estimation using a inverse hyperbolic sine transofrmation.”
It’s not that the inverse hyperbolic sine is fancy and new – John Burbidge, Lonnie Magee and Les Robb wrote a nice paper on it back in 1988, and that paper cites a 1949 piece by Johnson.
I think it’s just a matter of ignorance. Most of the time, a log transformation will do the job, so that’s what most people are familiar with. Plus now there are newer and sexier alternatives to the IHS, like quantile regression.
Seems like it could be useful.
*Actually, we typically use either genera, or operational taxonomic units (‘OTUs’), which are a set of closely related bacteria.
**The issue with frequencies is that it’s impossible to have a negative frequency. The arcsine square root transformation gets around this.