Jonathan Eisen has a paper in PLoS One describing software that he’s developed for analyzing 16S rRNA sequence data. Rather than walk through everything, I’ve decided this post will be different: I’m going to treat this as a manuscript that I’m reviewing (there will be some differences, and it won’t be as formally written as a ‘real’ review). But I wanted to phrase some ‘real’ questions, as opposed to extensively distilling it for the ‘lay’ reader so non-scientists could see what we really criticize each other about (hint: it’s not whether evolution is real). Onto the review.
Eisen has a good summary of what the program does:
…it describes automated software for analyzing rRNA sequences that are generated as part of microbial diversity studies. The main goal behind this was to keep up with the massive amounts of rRNA sequences we and others could generate in the lab and to develop a tool that would remove the need for “manual” work in analyzing rRNAs….
The basics of the software are summarized below: (see flow chart too).
- Stage 1: Domain Analysis
- Take a rRNA sequence
- blast it against a database of representative rRNAs from all lines of life
- use the blast results to help choose sequences to use to make a multiple sequence alignment
- infer a phylogenetic tree from the alignment
- assign the sequence to a domain of life (bacteria, archaea, eukaryotes)
- Stage 2: First pass alignment and tree within domain
- take the same rRNA sequence
- blast against a database of rRNAs from within the domain of interest
- use the blast results to help choose sequences for a multiple alignment
- infer a phylogenetic tree from the alignment
- assign the sequence to a taxonomic group
- Stage 3: Second pass alignment and tree within domain
- extract sequences from members of the putative taxonomic group (as well as some others to balance the diversity)
- make a multiple sequence alignment
- infer a phylogenetic tree
From the above path, we end up with an alignment, which is useful for things such as counting number of species in a sample as well as a tree which is useful for determining what types of organisms are in the sample.
I note – the key is that it is completely automated and can be run on a single machine or a cluster and produces comparable results to manual methods. In the long run we plan to connect this to other software and other labs develop to build a metagenomics and microbial diversity workflow that will help in the processing of massive amounts of sequence data for microbial diversity studies.
Before I get started, some disclosure: I am overseeing the development of this type of analytical pipeline for a major genome center. We are using some of the methods described here, some which aren’t, and some we built ourselves (or heavily modified).
First, there are some good things:
- The software can be implemented across a cluster, making it very fast even though it’s computationally intensive.
- It’s phylogenetically based, and, at least in principle, should be more accurate than methods that look for sequence identity (sequence identity can evolve by convergent evolution, particular in 16S rRNA where the secondary structure of the rRNA–as opposed to the sequence itself–is also very important).
- The program will provide a phylogeny as output (which can be used in other programs).
Now the questions and issues:
- There are a lot of comparisons between STAP and BLASTN in terms of performance, but none with RDP. How does the performance of STAP compare to that of RDP?
- STAP appears to have the most utility when using long 16S reads. With the advent of 454 technology, where reads will be between 200-400 bp, the conserved regions will be very short, and consequently have little phylogenetic signal. How well does the intermediate classification step work with short reads?
- How robust is the classification to alignment methods (i.e., NAST, MUSCLE, KOFFEE, INFERNAL)? This issue could potentially become even worse with shorter reads. (k-mer based methods are ‘alignment-immune’).
- A minor quibble but the RDP aligner is open source, and the clustering algorithm is simply DOTUR.
- It’s possible to retrain the RDP classifier. How easy is it to retrain STAP if one wants to use a different classification scheme?
I realize that most readers, other than Eisen (if he happens to stumble across this), will have no idea what I’m talking about. Don’t worry, I’ll return soon to calling idiots who desperately need my help fucking morons. But this post does have two uses. First, I do want to know the answers (consider this very-high order blogwhorring). Second, it just highlights how ridiculous creationists are, whether they be young-earth or intelligent design. As I’ve discussed previously, they’re arguing about horse-drawn buggies while we’re roaring past them in supersonic jets. They can’t even understand basic biological concepts, never mind the stuff most biologists do on a daily basis*. They are partying like it’s 1859.
*If any of the creationist Uruk-hai do stumble across this post, you know they’ll argue that this is genetics and not TEH DARWINISMZ!!, which just shows how stupid they are.
This entry was posted in Genomics
, Middle East
. Bookmark the permalink