Earlier this week, I attended the International Human Microbiome Consortium Meeting (the human microbiome consists of the organisms that live on and in us). I’m not sure to make of the whole microbiome initiative, but one thing is clear to me: this is being driven by the wrong group of scientists.
Instead of being directed by biologists (medical primarily) who have devised a set of important questions, and want to use the power of high throughput genomics, including metagenomics which sequences of all the DNA in a specimen–bacteria, viruses, fungi, protozoa, and, yes, human (which raises all sorts of bioethics questions), the human microbiome is being driven in large part by the major genome centers.
Of course, the major centers need to be involved: they’re the only ones who have the sequencing capacity, along with the sequence data assembly and annotation (figuring out what the sequence encodes) capabilities. Likewise, these centers do have plenty of biologists who are quite competent at designing experiments. What centers often lack, depending on the area, are the experts who know precisely what questions need to be asked, and who have also figured out how to analyze the data.
This brings me to my first concern. Much of the initial focus is on really complex systems, such as the human gut, which contains hundreds of bacterial species (that’s before you get to the viruses and eukaryotes). Because there are so many different genomes, even with massive throughput, most genomes recovered will be fragmentary–very fragmentary. I’m not sure what that will tell us.
Second, we don’t have enough reference genomes–a recent estimate of the number of Streptococcus pneumoniae genomes needed to find ninety percent of the total ‘pan-genome’ of that single species was 142. There are going to be a lot of genes that we won’t be able to figure out where they came from. I’m not really interested in the diversity of gyrase B protein–to a considerable extent, it’s the variable loci that will be interesting, and these will be the hardest to determine to whom they belong.
Third, these will be awful data to analyze. Here’s why. Ideally, you want as many replicates as possible (in this case, human volunteers), and only enough data as needed to answer the question. The last part doesn’t seem to make sense until you consider that when you conduct enough tests, some of them will yield false positives (one in twenty if you use a p = 0.05 cutoff) unless you correct for this*, which means that your power of test (i.e., the p-value) becomes really small (one in a million or worse). This will be the mother of all SNP hunts. In human genomics, they deal with this problem all the time; the latest technology can screen a single genome for 900,000 SNPs–and these studies have to enroll (or combine smaller studies) thousands of people. With the human microbiome, we will have maybe 500 human volunteers, each of which is associated with literally megabytes (if not gigabytes) of data.
The mother of all SNP hunts, indeed.
So what to do? First, all of the genome centers, big and small, need to collaborate on a single system. Second, this system needs to be simple–not very many species–so we can begin to get some kind of replication in microbial communities that can be statistically assessed**. Third, specific questions need to be asked. We can’t just go searching for what is ‘out there.’ We need specific hypotheses so we don’t drown in all of the data. We’ll collect lots of ‘excess’ data whether we like it or not, so the signal to noise ratio needs to maximized as much as possible.
OK, I’ll stop now, so I don’t lose my three remaining readers….
*The Bonferroni correction is one such method.
**As I’ve mentioned before, if we want to do high-throughput Latin binomials (species counting), the problem gets much simpler. It also doesn’t require metagenomics; I’m dealing with metagenomic approaches here.
Mike, not all science needs to be “hypothesis driven”. Whenever a field is in its infancy, “just searching for what is out there” *is* what is needed, because there aren’t sufficent data to make any meaningful hypotheses, Nobody, neither genomicists nor medical researchers, really know enough to ask directed questions about the microbiome at this point.
Mike, you’ve put your finger on a real problem: the science is no longer driving the big genome centers. Once NHGRI (and DOE) created these huge centers, they created a self-perpetuating interest group that really wants to keep the centers going. This group includes not just the centers (obviously), but the funders – the program officers and staff at NIH – as well. They need ever-larger sequencing projects to keep those sequencers fed, and they’re making up the scientific justifications in a post-hoc fashion.
It’s not that the science is bad – actually, I think the human microbiome project is an excellent idea – but rather that the project leaders are the wrong people. So I agree with you, but that being said, I don’t know any way to change the way the Microbiome project is being led. The funders (NIH) are happy, and they like being able to see such tangible productivity (which you don’t always get from classic, hypothesis-driven research), and the big centers are happy too.
The microbiome is being done because it’s relatively easy and quick. Plus the centers appear to have the available capacity. Much harder is understanding the underlying relationships and how the various species interact. Consortiums are a bitch to pick apart and only the simplest are somewhat understood. I agree that a few “simple” systems should be blanketed first.
What Steven Salzberh said is exactly correct. These Centers are like carpenters with big fucking $100,000,000 hammers looking for nails to pound.