One of the things I’ve repeated on this blog a few times is
people have to like this crap you have to understand the limitations of your data. In a good Nature piece about how the problems with using p-values are only the beginning of our statistical problems, there’s a very good figure illustrating what I mean by the limitations of your data:
Most science, whether you’re doing genomics or blathering on about education, involves multiple transformations of your data, along with decisions about how to analyze your data. Those transformations really matter. Let’s pretend we have a project looking at antimicrobial resistance genes in bacteria–we have 100 isolates each from people exposed to an antibiotic and from people not exposed to an antibiotic, and we want to determine if the number and type of resistance genes differ between the two groups. In principle, a simple thing to do.
In reality, this very simple question involves many decision points.
If we start with a bunch (millions) of sequencing reads (short sequences that are each about 1/20,000 the length of a bacterial genome, or ~200 ‘letters of DNA’), we first have to decide which reads we won’t use at all–the signal from some reads is so bad, we shouldn’t use them at all (they’re essentially random sequence). Once we do that, we have to trim the reads: the start and ends of reads are also low quality, so we need to ignore those regions. Of course, defining what “those regions” can affect what the genome assembly will be.
Did I say assembly? Why, yes, I did. Once we have a bunch of filtered, trimmed reads, we then have to decide which algorithm we’ll use to assemble these reads–and the presence or absence of genes (especially resistance genes) can be affected by how we choose to assemble genomes.
So now we have a genome assembly. Now we gotta find the damn antibiotic resistance genes. The first step is known as gene calling, in which we identify stretches of genome that look like they encode proteins–and we make assumptions about how to do this (again, algorithms matter). Then we have to have a good database to use when we try to find resistance genes*. Then we have to decide what search criteria we use (how similar to known genes; what if there is only a part of a gene, etc…).
And then we do this 199 more times. Only then can we start to worry about the statistics used.
It’s also worth noting that this is potentially a source of reproducibility problems.
Anyway, this is what I mean by ‘you have to understand the limitations of your data.’
*I’ve actually oversimplified here, as one could search the nucleotide sequence of the genome, one could search the nucleotide sequences of the called genes, or one could search the protein sequences of the ‘called genes.’