The National Center for Biotechnology Information (NCBI) recently announced that it will shut down the Short Read Archive (SRA). The SRA stored the semi-processed data for genomics projects, so researchers could examine the raw data for a genomics project. The reason given by NCBI is “budget constraints.” While I’m saddened by this, I’m not surprised, since the volume of data produced by a single genome center is tremendous, to the point where the storage and data upload are prohibitive:
when several centers were collaborating to test new sequencing technologies, the data were so large, they actually shipped hard drives to each other to compare results. Well, that’s what might have to happen to upload data:
If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays.
The other possibility is that the raw data, or even ‘first-step’ processed data might not be made publicly available anymore–think of this as the physics model:
At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest.
As genomics and other data-intensive disciplines of biology move towards cloud computing (and I think it will definitely happen), it will be interesting to see how NIH funding shifts.
Well, now we know how one part of that funding will shift.