The Closing of Microbial Genomics Data?

This post will have to be a little vague, but from my vantage point, I’m seeing two disturbing trends in terms of microbial genomics. For those who aren’t the cognoscenti, the weird thing about open access data arguments for anyone in microbial genomics is that, historically, there has always been open-access data. When most sequencing was done at genomics centers, these centers were committed both as a matter of principle and funding requirements to releasing the raw sequence* data (archived at NCBI/NIH’s Short Read Archive) along with the metadata (information about the sequenced organism). That established a scientific culture of public sequence deposition.

Making these data publicly available has been a real boon. While the popular presentation of a genome is that it’s carved in stone, in reality, a genome assembly–the sequence of A, T, C, G–is a result of multiple decisions: what software to use to construct the genome, the parameters used with a particular piece of software, filtering out low quality raw data, post-assembly decisions to remove possible contaminants, and more. To be consistent when using other groups’ data, going back to the raw sequence and assembling all your genomes the same way (even if you’re ‘consistently wrong’) is critical: you don’t want what looks like an interesting pattern to simply be a result of different methods (e.g., British and U.S. strains of E. coli are different… because the U.K. and U.S. groups that did the original sequencing used different assembly methods. Oops).

Of course, like all fields, technology improves, and assembling genomes is no different–a new assembly algorithm might yield an improved assembly, even with ‘old’ data. And, of course, scientists are human and make mistakes. Looking at the raw data can identify mistakes in the literature** (Got Rogoff/Reinhart?).

So that’s a long introduction to what I see are two coming problems for microbial genomics. The first is that there will be a lot of genomic data produced in the U.S. and abroad by government agencies, and some, perhaps many of those agencies, are not committed to an open vision of microbial genomics. If they do release data, it will have very little metadata attached to it–to the point where there is little or no useful information (e.g., is the bacterial isolate ‘clinical’?–was the microbe from a sick person?).

The second problem is that, increasingly, I’m seeing papers published without any raw sequence deposition: just last week, I found two, one in a American Society of Microbiology journal and one in a PLoS journal. Not exactly open access philosophy I think. Sure, assemblies were available, but to a considerable extent, without the raw data, these genomes are dead to me. I can’t really do much with them (and my nefarious purposes don’t include publishing but identifying disease outbreaks. Not helping).

I hope scientific societies, both as advocacy organizations and as publishers, will support policies requiring the release of raw sequencing data.

*For the cognoscenti, I’m referring to FASTQ files, not images. If you already knew that, don’t be a pedantic twerp.

**Might have identified one; trying to figure out if we’re correct now.

This entry was posted in Genomics, Public Health, Publishing. Bookmark the permalink.

1 Response to The Closing of Microbial Genomics Data?

  1. antifer says:

    I hope my comment will be of interest to some.

    I have regularly been asked by editors/reviewers to provide raw reads along with a publication, and I complied and believed it was a really good thing. On the other hand, I also have been told off by reviewers for using ‘convenience datasets’ mostly constituted of public genomes for my analyses. And I believe this is an even better thing.

    It’s a good thing that metadata is not so easily available on public genomes, precisely because metadata is ALL. Studies have very different samplings, depending on their focus, and there is a very unhealthy trend right now to make every single isolate information fit dropdown lists of precise labels. The absence of readily-available metadata is a reflection of the difficulty to harmonise everything, rather than malice from the authors. However, I believe it also prevents confusion for those who wouldn’t read the original studies and sampling details anyway and forces them to do so, it prevents blind ‘data hoovering’ without any consideration that what you are comparing is perhaps not comparable, it prevents the over-interpretation the results of hardcore bioinformaticians that frankly most often disregard (or worse, genuinely ignore) basic biology and ecology principles.

    I have met too many of these incredibly and impressively skilled statistical geneticists with wrong ideas on how biology and ecology works, or even how a cell works, or how DNA is extracted. However, I believe that they are a disappearing bunch. Designing and realising a careful and robust sampling, performing sequencing and making the right analyses, WITH REPLICATES, followed up by careful phenotypical validation, WITH REPLICATES, is hard, costly, and time-consuming, and meaningful genomics-based microbiology can only be made by multidisciplinary teams or lab, with a strong and competent coordination.

    After the hype of NGS and the stream of technical manuscripts it brought, biology is thankfully coming back at the heart of bacterial genomics studies. The amount of sequences publicly available is a good sign of this, but there might be an unsolvable problem here concerning metadata.

Comments are closed.