Jonathan Eisen has a paper in PLoS One describing software that he’s developed for analyzing 16S rRNA sequence data. Rather than walk through everything, I’ve decided this post will be different: I’m going to treat this as a manuscript that I’m reviewing (there will be some differences, and it won’t be as formally written as a ‘real’ review). But I wanted to phrase some ‘real’ questions, as opposed to extensively distilling it for the ‘lay’ reader so non-scientists could see what we really criticize each other about (hint: it’s not whether evolution is real). Onto the review.
Eisen has a good summary of what the program does:
…it describes automated software for analyzing rRNA sequences that are generated as part of microbial diversity studies. The main goal behind this was to keep up with the massive amounts of rRNA sequences we and others could generate in the lab and to develop a tool that would remove the need for “manual” work in analyzing rRNAs….
The basics of the software are summarized below: (see flow chart too).
- Stage 1: Domain Analysis
- Take a rRNA sequence
- blast it against a database of representative rRNAs from all lines of life
- use the blast results to help choose sequences to use to make a multiple sequence alignment
- infer a phylogenetic tree from the alignment
- assign the sequence to a domain of life (bacteria, archaea, eukaryotes)
- Stage 2: First pass alignment and tree within domain
- take the same rRNA sequence
- blast against a database of rRNAs from within the domain of interest
- use the blast results to help choose sequences for a multiple alignment
- infer a phylogenetic tree from the alignment
- assign the sequence to a taxonomic group
- Stage 3: Second pass alignment and tree within domain
- extract sequences from members of the putative taxonomic group (as well as some others to balance the diversity)
- make a multiple sequence alignment
- infer a phylogenetic tree
From the above path, we end up with an alignment, which is useful for things such as counting number of species in a sample as well as a tree which is useful for determining what types of organisms are in the sample.
Lots of things to comment on here
1. We compared it to blastn because that is the most commonly used method for people to do things in their own computational pipeline.
2. We did do some comparisons to various web tools like those available through RDP, GreenGenes, etc. Not all of these are in the paper but I will try to post on them here or on the PloS One site. The main goal was to make stuff that did not need to do thiings on web servers so really we wanted to compare to things you could install locally.
3. We have not used STAP for short read sequences. I think Mitch Sogin’s approach to this is better than STAP. He basically compares the short reads to full lenngth sequences and then uses the information about the full length sequence to say what the short read might be.
4. As for the robustness to the alignment method — this is an issue. My hope is that we can eventually use programs being developed by Sean Eddy to do the alignment and skip all of the other programs.
5. Not sure what you mean by point #4 minor quibble about DOTUR and RDP open source.
5. Easy to retrain STAP – you just use a different database or taxonomy list for the pre-aligned sequences
In the middle of a few things. Will post more later
Mike,
We will be posting some analysis data soon from using Maq to do alignments with NextGen data on our company blog – FinchTalk. I’ll let you know when it’s posted – that might be helpful when thinking about the best way to align short reads. A lot of people are using Maq for resequencing and SNP analysis.
One of the problems that I see with Eisen’s paper is the lack of discussion concerning sequence quality. A great advancement in the human genome community, as you probably know, came about because phred could evaluate sequence quality and phrap was able to use sequence quality in assembling sequences.
Perhaps these analyses are happening upstream, but I don’t see anything in the Eisen paper that discusses assemblies or quality scores. It seems that those steps would be an important part, early on in the pipeline.
Why do I think quality analyses are important? Because I have been looking at lots of traces from amplified 16S rRNA and I can find many cases (with either phred or KB) where bases have been miscalled – or there are polymorphisms because of the multiple copies of rRNA in the genome. I have some examples here – my examples are from plant genes – but I see the same kinds of things with bacterial rRNA sequences – plus lots of polymorphisms. Assembly tools, like phred, cross_match and phrap (even though they are not open source), that are aware of mixed bases, and can perform trimming and vector masking, would reduce early problems with miscalls and would be a useful addition to any kind of automated pipeline.
And of course, with NextGen sequencing, the whole workflow changes anyway.
Wow, less than an hour to have the author respond, the interweb being put to good use.
Infernal’s almost ready to be slotted into rRNA analysis pipelines. ‘Til recently it had two main problems — too damned slow, and not very good at dealing with local alignment to small (454-sized) reads. Eric Nawrocki has pretty much addressed the first problem, and Eric and Diana Kolbe have developed good local alignment approaches for the second problem (which is more interesting that it may seem at first, because you’re doing a local alignment to a *structure*, not a sequence — but the 454 read may have thrown away parts of the sequence that you needed to see the base pairs). Infernal 1.0 is close to release — we’re in testing now. Release candidate rc1 is out at infernal.janelia.org, and rc2 (fixing some bugs) will be out in the next couple of days.
Eric’s now working with a variety of rRNA folks to bring it up to production strength specifically for rRNA alignment and subsequent phylogenetic analysis.
Interestingly, we’re finding that profile HMMs (sequence consensus only) are nearly as good as Infernal’s profile SCFGs (seq + structure) for most rRNA alignment tasks; both outdo BLASTN alignment by a fair margin. It might ultimately make sense to use profile HMMs for all but the most phylogenetically weird rRNAs (where you need to peer at secondary structure to get the alignment right). Any kind of profile-based method ought to be a leap forward w/ respect to nearest neighbor pairwise alignment.
I for one welcome ‘inside’ discussions among our sequence analysis overlords. I think I actually generally understand everything you are talking, about despite not knowing the specific methods all those strings of capital letters represent.
Anyways, I wish you all the best of luck, because I really don’t want to have to think too deeply about these algorithms 😉 I’ll shut up now and leave you to it.
thanks write…