I spent part of last week at the ASM Conference on Rapid Next-Generation Sequencing and Bioinformatic Pipelines for Enhanced Molecular Epidemiologic Investigation of Pathogens, which for obvious reasons is also referred to as ASMNGS. Lots of good science (though I like science better at 9am in the morning, not 8am. Just saying).

Scott Federhen discussed NCBI’s microbial genomics taxonomy efforts (NCBI is the National Center for Biotechnology Information at NIH). The punchline is this: after consultation with external phylogenetics efforts, NCBI has instituted a policy of correcting genomes that are assigned to the wrong species (e.g., a Klebsiella genome is called an E. coli genome). There are a lot of reasons why genomes would be mislabelled, with the most common reasons being contamination (i.e., someone accidentally mixed together two species and then sequenced them), sample swaps (someone thought he was sequencing sample X, when he sequenced sample Y), data handling fuckups (that’s the highly technical term), or the person who submitted the genome incorrectly identified the genome and didn’t change the submission after sequencing*.

While NCBI is often thought of as a sequence repository (GenBank), it’s actually part of the National Library of Medicine, so changing erroneous genome submissions is a significant shift in policy: imagine if NCBI or NLM changed erroneous articles in PubMed**. That said, the submitters of the genomes are being contacted to inform them of this.

This is a much-needed change. Many research groups as well as public health labs routinely use the genomes in GenBank as part of genomic-based surveillance. Having a few misnamed genomes within a species for which there are hundred or thousands of genomes might not sound like much, but that can really screw up these systems in any number of ways***.

Personally, some of the things I work on have been hampered by this, so, from my perspective, as well as most microbiologists and bioinformaticians, this is a very good development.

*To get into the weeds, you can submit the metadata for a genome (what species it is, where it was isolated, etc.) before any sequencing has begun. Sometimes people fail to correct the metadata after the genome sequencing.

**There are ways to note retractions and for people to leave comments.

***A short, very incomplete, highly technical list of problems:

  1. When trying type strains by placing them in a genome phylogeny, you could end up misidentifying strains.
  2. If you’re developing a typing system for a species (e.g., cgMLST), you don’t want to include data from a completely different species.
  3. If you’re trying to find reference assemblies to improve your own genome assemblies, including incorrectly assigned taxa can screw things up.
  4. Researchers asking basic research question can chase after red herrings because they think they’ve found something unusual in species X, when… you don’t have species X.
  2. Nabeeh says:

    This is excellent. One species I work on has a genome that is either drastically misidentified or a hybrid of sorts. Genome-wide analyses and in silico MLST analyses illustrate the erroneous identification, yet nothing has been done to correct it’s taxonomy. With the complicated BioSample and BioProject registrations, perhaps it would be smart to require 16s rRNA, rpoB, and ITS or some MLST identification prior to NCBI genome submission.

