By way of the Twitterz, we come across this fantastic poster (embiggened version here):
To translate from Genomics, to examine the microbiome–the microorganisms that live on or in a certain habitat (e.g., your gut or skin, the surface of your toothbrush, a tomato, etc.)–we get DNA from that habitat, sequence it, and then assemble those small sequences* into larger sequences. We can then compare these larger sequences genetic databases to determine where those larger sequences came from. If we see that a lot of our sequences match the bacterium E. coli, we can assume that E. coli is an important part of the microbial community we sampled (there are procedures we can use to remove human sequences–in these studies, humans are a contaminant). One of the common programs used to assign sequences to organisms is known as MG-RAST. Gonzalez et al. found (the authors of the poster) that thirty percent of all publicly-available microbiome samples, when analyzed by MG-RAST, contain… Ornithorhynchus anatinus–the platypus.
Which leads the authors to snark, “[this] could lead to believe that platypus prefers warm, saline, and alkaline habitats, but… lives everywhere.”
Heh. And, yes, the software to fix this problem is called Platypus Conquistador.
BONUS VIDEO!
*Without getting into the details, we actually sequence very small pieces of genomes and then stitch them together (this is called ‘assembly’). If you’re a biologist, this is obviously oversimplified–don’t be pedantic (or at least, start your own blog and be pedantic there).
Neither the poster nor your post make clear the reason for these sorts of errors. Is it that there is contamination of samples with genuine platypus-derived nucleic acids? Or is it that the sequence assembly/annotation algorithms are not stringent enough and are misattributing reads?
We have done a shittetonne of de novo transcriptome assembly from Illumina RNAseq data, and we always get a decent number of assembled contigs that are definitely real, but definitely from contamination of the sequence libraries. So we find real sequences that are unambiguously from E. coli K12 (the strain of E. coli used in all molecular biology labs), pUC (common molecular biology plasmid), human, dog and cat (technicians making libraries have pets!), common trees and food plants, etc. These are all at extremely low abundances in our samples, but we always find them.
I’d put most of my money on contamination of the platypus reference sequence with various common molecular tools (like you mentioned) combined with the same type of contamination of the samples that people are collecting and processing. Although there is probably some signal processing error there as well (allowing for various sorts of alignment error, etc). A substantial issue IMHO is the disconnect in expertise between those who come up with these tools and those who want to use them.
The MG-RAST web pages prominently state: ” analysis of viruses and eukaryotic sequences is not currently supported”. From what I remember of my 1960’s vintage biology platypuses are primitive forms of whatever they are but definitely not prokaryotes. How did they get in there?!?
“I’d put most of my money on contamination of the platypus reference sequence with various common molecular tools (like you mentioned) combined with the same type of contamination of the samples that people are collecting and processing.”
Ah. That makes a lot of sense. I hadn’t considered that the reference genome itself would include contaminants. We work either with organisms that have no reference genomes, or with those whose reference genomes are so exhaustively annotated as to not contain any such garbage.
but Platypi *DO* rule the earth. Its just that they are subtle masters, and a bit shy.