This post will have to be a little vague, but from my vantage point, I’m seeing two disturbing trends in terms of microbial genomics. For those who aren’t the cognoscenti, the weird thing about open access data arguments for anyone in microbial genomics is that, historically, there has always been open-access data. When most sequencing was done at genomics centers, these centers were committed both as a matter of principle and funding requirements to releasing the raw sequence* data (archived at NCBI/NIH’s Short Read Archive) along with the metadata (information about the sequenced organism). That established a scientific culture of public sequence deposition.
Making these data publicly available has been a real boon. While the popular presentation of a genome is that it’s carved in stone, in reality, a genome assembly–the sequence of A, T, C, G–is a result of multiple decisions: what software to use to construct the genome, the parameters used with a particular piece of software, filtering out low quality raw data, post-assembly decisions to remove possible contaminants, and more. To be consistent when using other groups’ data, going back to the raw sequence and assembling all your genomes the same way (even if you’re ‘consistently wrong’) is critical: you don’t want what looks like an interesting pattern to simply be a result of different methods (e.g., British and U.S. strains of E. coli are different… because the U.K. and U.S. groups that did the original sequencing used different assembly methods. Oops).
Of course, like all fields, technology improves, and assembling genomes is no different–a new assembly algorithm might yield an improved assembly, even with ‘old’ data. And, of course, scientists are human and make mistakes. Looking at the raw data can identify mistakes in the literature** (Got Rogoff/Reinhart?).
So that’s a long introduction to what I see are two coming problems for microbial genomics. The first is that there will be a lot of genomic data produced in the U.S. and abroad by government agencies, and some, perhaps many of those agencies, are not committed to an open vision of microbial genomics. If they do release data, it will have very little metadata attached to it–to the point where there is little or no useful information (e.g., is the bacterial isolate ‘clinical’?–was the microbe from a sick person?).
The second problem is that, increasingly, I’m seeing papers published without any raw sequence deposition: just last week, I found two, one in a American Society of Microbiology journal and one in a PLoS journal. Not exactly open access philosophy I think. Sure, assemblies were available, but to a considerable extent, without the raw data, these genomes are dead to me. I can’t really do much with them (and my nefarious purposes don’t include publishing but identifying disease outbreaks. Not helping).
I hope scientific societies, both as advocacy organizations and as publishers, will support policies requiring the release of raw sequencing data.
*For the cognoscenti, I’m referring to FASTQ files, not images. If you already knew that, don’t be a pedantic twerp.
**Might have identified one; trying to figure out if we’re correct now.