…assembly and analysis. From the depths of the Mad Biologist’s Archives comes this post.
The Wellcome Trust has a very good (and mostly accurate) article about the ‘next-gen’ sequencing technologies. I’m going to focus on bacterial genomics because humans are boring (seriously, compared to two bacteria in the same species, once you’ve seen one human genome, you’ve seen them all).
Most of the time, when you read articles about sequencing, they focus on the actual production of raw sequence data (i.e., ‘reads’). But that’s not the rate-limiting step. That is, we have now reached the point where working with the data we generate is far more time-consuming.
Whole genomes don’t come flying out of the sequencing machines: we have to take hundreds of thousands or millions of reads and stitch them together–what is known in genomics as assembly. It’s pretty easy and fast to get a pretty good genome. By pretty good, I mean that most of the genome (~99%) is assembled into pieces 50,000 – 1,500,000 bases long*. Where the assemblers get hung up on with bacteria are repeated elements–regions of the genome that are virtually identical (they don’t have to be completely identical, just close enough such that the assembler thinks they’re identical reads with sequencing errors). Because the assembler can’t figure out where to put these reads (they’re all identical), it discards them–that’s where the breaks occur*.
This is a problem because some of the most interesting genes, such as antibiotic resistance genes, are found sandwiched between repeated elements, known as insertion sequence elements (‘IS elements’; IS elements are one of the major reasons resistance genes move from plasmid to plasmid–plasmids are mini-chromosomes that themselves can move from bacterium to bacterium–and from plasmid to chromosome). What this means is that we can assemble an antibiotic resistance gene (or genes) but we might not know if it’s found on a plasmid or on the chromosome–that’s a pretty critical biological question. To further complicate things, different plasmids can have the same IS elements, along with the bacterial chromosome. Not only will these introduce breaks into the assembly, but they can also lead to accidentally assembling plasmids together or incorrectly incorporating them into the genome.
Now, we do have methods to close up these gaps–this process is called finishing, and it involves either targeted sequencing or manually parsing through the existing data. But these are open-ended, slow processes (particularly the targeted sequencing). Worse, this involves thinking, and, relative to computer algorithms, thinking is very slow. This is also really expensive. So we can get a pretty good assembly, but I think a lot of people, thinking back to the Sanger sequencing days, when most bacterial genomes were closed, are going to have to understand that if you want a lot of genomes, they will be ‘pretty good’ assemblies, not closed, finished ones.
The other area is annotation: now that you have a bunch of sequences, you would like to know what genes are found on those sequences. This involves two things: identifying the open reading frame (‘ORF’) of the gene (that is, which nucleotides encode proteins), and then identifying what that open reading frame encodes (I’m making this sound like a two-step process; it’s actually an iterative process, where each step informs the other).
Here too, we have automated gene callers which are very fast. Actually, many different gene calling methods. That’s good! However, they will disagree with each about five to ten percent of the time. By disagree, I don’t just mean that two different methods call the same exact region a different protein (e.g., an aldolase versus a dehydrogenase). We could cope with that for a lot of the downstream analyses we do, as long as we have identified the protein correctly**. The problem really arises when two different, overlapping regions of sequence are identified as ORFs (e.g., program A calls nucleotides 1-300 as a gene, and B calls nucleotides 13-360 as a gene). That is not good, because then a human has to go through the output manually and figure out what the actual ORF is (requiring more thinking which is slow and expensive). I would note that most major sequencing centers do manual annotation, but it is slow.
So, from a bacterial perspective, genome sequencing is really cheap and fast–in about a year, I conservatively estimate (very conservatively) that the cost of sequencing a bacterial genome could drop to about $1,500 (currently, commercial companies will do a high-quality draft for around $5,000- $6,000). We are entering an era where the time and money costs won’t be focused on raw sequence generation, but on the informatics needed to build high-quality genomes with those data.
*There are other technical reasons why breaks occur, but, to me, this is the worst offender.