There’s a very interesting pre-print “The next 20 years of genome research” that made the genomics bloggysphere rounds recently. I like most of what’s in there, but, this being the bloggysphere, I’m going to obsessively dwell on the one part I don’t like, which, as the title suggests, has to do with public genomic repositories (boldface mine):
Most immediately impacted by the massive growth to sequencing and senor technologies will be the computational systems used for storing and transferring biological data. For more than twenty years, NCBI and its international counterparts at the EBI and DDBJ have served as the central clearinghouse for genomic data . Over the next twenty years, these resources will continue to steadily grow, although as the sequencing facilities grow from petabyte to exabyte scale, it will become less and less practical to transfer data into these archives as they exist today. Furthermore, as sequencing shifts from research purposes and into more direct medical applications, the incentive for making the data publically available in a centralized archive will be reduced or perhaps even legally restricted. In its place, we will see the rise of federated approaches for exchanging biological data, especially computing centers dedicated to large sequencing facilities. Already this trend is beginning, and the NCBI Sequence Read Archive (SRA) currently only stores ~1/10th of the worldwide sequence production, around 3.8 Pbp of the more than 35Pbp sequenced so far (http://www.ncbi.nlm.nih.gov/Traces/sra/). Fortunately the rest of the data are not completely lost, and we are beginning to see the emergence of new exchange systems outside of traditional archives. These systems often consolidate regional and/or topical interests inside a dedicated cloud-‐based portal, such as CGHub  or ICGC  for consolidating cancer genomic data, or the recently launched BGI-‐cloud to provide access to the great resources available there (http://bgiamericas.com/data-‐analysis/bgi-‐cloud/). Illumina BaseSpace (https://basespace.illumina.com), DNAnexus (https://www.dnanexus.com/), Google Genomics (https://cloud.google.com/genomics/), and other commercial vendors are also emerging to help manage the deluge of data using commercial cloud platforms.
The problem with this argument is that, at least in infectious disease epidemiology, the power of genomics arises when different groups can look at each other’s data. When that doesn’t happen, we miss important things (boldface mine):
Trevor Bedford, a computational biologist at the Fred Hutchinson Cancer Research Center in Seattle, and Richard Neher, a physicist at the Max Planck Institute for Developmental Biology in Germany, prepared a graphic that is something of a genetic family tree of the outbreak.
“We’re trying to connect things,” Dr. Bedford said. “We basically had no idea what was going on for a long time.”
That was because for many months, samples were not easily transported across borders; the affected countries did not have the technology to sequence samples; and some scientists were reluctant to share data before publishing their results.
“It would have been great if these papers came out six months ago,” Dr. Bedford said. “You could imagine a situation where you don’t really have to publish your Nature paper; instead, you make a blog post. It could have been a bit more timely.”
When we don’t share data–that is, place it in repositories that are accessible to all–we lose an incredible public health tool. When data aren’t deposited in public databases before publication, we all lose. In addition, there’s an opportunity to get data from clinical labs and make those publicly available (obviously, for HIPAA reasons, some data will have to be scrubbed). We really don’t want these stored in private databases.