By now, you might have heard about Oxford Nanopore‘s first preview of their Gridion and Minion technologies at the recent AGBT conference. While I think some caution is in order, I don’t think this will turn out to be ‘vaportech.’ While much of the discussion has understandably focused on what Nanopore means for human genomics, I want to discuss what this technology could–note the world could–mean for bacterial genomics,
since humans are boring (for ease of writing, I’m going to lump Gridion and Minion together and call them ‘Nanopore’).
I see the promise of bacterial genomics built around the sequencing of hundreds of genomes per week. Not that there’s anything wrong with sequencing a small number of genomes, but being able to do high-throughput genomic epidemiology is going to be a key area (and with 3,700 hospitals in the U.S. alone, a potentially very lucrative one). To do this, your processes will have automated, which means you will need very high quality data. In addition, we won’t be using reference genomes, but generating de novo assemblies. To translate this into English (or my approximation thereof), we are looking for new genes (e.g., antibiotic resistance genes) and genetic structures (e.g., plasmids), not asking how a given bacterium differs from a known ‘reference’ sequence (which has the limitation of not being able to examine differences that aren’t found in your reference sequence–you can’t compare something to something you don’t know about…).
Before we get to Nanopore, let’s quickly review sequencing. We don’t actually sequence a genome in one fell swoop: we chop the DNA into pieces, from one hundred to five hundred base pairs long (‘bp’, where one bp is a single ‘letter’ or nucleotide of DNA) and sequence those fragments. For reference, even a stripped down E. coli like E. coli K-12, the archetype microbial lab rat, has 4.6 million bp of DNA. We also then sequence another DNA preparation from the same bacterium (a ‘DNA library’) with larger fragments, usually around 5,000 bp (5 kb), and sequence each end of that fragment–this is called a ‘jumping library’ or ‘jumps’ (I’m leaving out a lot of molecular biology to make this happen).
The small fragments can be piled or ’tiled’ on each other to create a stretch of sequence known as a contig. Like so:
Becomes: AGCTCA (although we typically have much more ‘coverage’–many more reads confirming each base). This works fine until you have repetitive content–identical (or nearly so) sequences that occur throughout the genome. Suppose our genome has unique sequences A, B, and C, with a repetitive region X between them. It’s impossible to figure out from tiling reads if we have A-X-B-X-C or A-X-C-X-B. We solve this by using the bits of sequence at the end of the much larger pieces to stitch the individual contigs into a scaffold. Remember, with most current sequencing technologies we can’t simply sequence the entire large piece of DNA.
That’s where Nanopore is very exciting.
Nanopore claims that they’ve been able to sequence huge pieces of DNA (much larger than 5kb). Not only does this mean we can avoid making two DNA libraries–and in bacterial genomics, the true cost of DNA preparation is currently about 8-9 times that of sequencing the prepared DNA*–but the ‘jumping library’ with the large fragments is the more expensive and time consuming library. It also means that we’ll have fewer stretches of unknown sequence in bacterial genomes (the ‘Xs’ from the previous example). And, without getting into technical details, it’s much easier to assemble fragments than fragments and jumps.
So this could be a ‘game changer.’ If.
I have two concerns. The minor concern is that Nanopore will be more likely to sequence the smaller pieces of DNA than the larger ones. When we chop DNA into pieces, let’s say 5 kb, we actually get a spread of sizes. We can make that spread quite small, but doing so is not fast and requires a lot of labor. That is, money. But that’s a minor concern, since there are probably clever tricks one could do to get around this.
The major concern is the error rate. More specifically, it’s the kind of error. According to Nanopore, the ‘indel’ error rate is about four percent. That is, every four out of one hundred bases, a base is either erroneously added or deleted due to sequencing error. When I’ve seen data of the same strain–a clinical isolate of E. coli–sequenced with different technologies, the overall assembly statistics often look good: we’re able to assemble it into a few pieces (remember, fewer pieces is better). If there’s a known reference, most of the reference is successfully sequenced too.
But the problem comes when we try to annotate the genome–that is, figure out what the genes are contained in the sequence. In my experience, technologies that have a lot of ‘indels’ in the raw sequence end up with lots of ‘broken genes’ because enough of those indels wind up in the final, assembled genome (for the cognoscenti, a 1-in-10,000 final indel rate will bust up lots of genes). To explain this, consider this ‘toy’ gene (genes are much longer, but bear with me):
ATG CCC ATA TGA
This encodes a protein that’s three amino acids long (an amino acid is the smallest subunit of a protein and is represented by three nucleotides known as a codon). The “TGA” is a ‘stop codon’ and represents the end of the protein. Now, imagine we have an indel in our sequence (the boldface is the insertion):
ATG CCC CAT ATG A
OH NOES! Not only have we changed the amino acid in the third position (ATA –> CAT), but we removed the stop codon so the software that identifies genes (a ‘gene caller’) just keeps on going, like the runaway subway car in Pelham 123 (ALL THE LIGHTS ARE GREEN!! AAAIIEEE!!!). We now incorrectly think the protein is much longer. Not only did we screw up this gene, but we might have also read into the next gene (ALL THE LIGHTS ARE GREEN!! AAAIIEEE!!!), thereby screwing that one up too. Like I mentioned, in my experience, indel prone technologies yield many more indels in the final genome sequence. Many more.
Now, a human can catch this and note it, but humans are error-prone and slow. Slow means expensive and puts the kabosh on the whole hundreds per week idea. Unless the error is confirmed though, we have to treat it as ‘real’ for any kind of analysis (otherwise we’re cheating). We can manually confirm these errors, but that’s even slower and more expensive.
What profit a Mad Biologist who gains cheap sequence only to have to pay tons of money on the back end? (I think that’s in the Bible somewhere).
Now, Oxford Nanopore claims that, by the time the sequencers ship, they’ll have fixed this. If they do, this technology is a breakthrough. If not, it still has utility, in that we can use it to stitch contigs together, but then it’s a companion technology, not the
droids breakthrough we’ve been looking for. When research groups put Nanopore through its paces, then we’ll see if it lives up to the promise (or hype, if you prefer).
Hopefully, it will.
*Since I don’t work on human genomes which are about ~1000 times larger than bacterial ones, the actual sequencing costs are trivial. Library construction costs are the same, however, regardless of the size of the genome. I occasional joke that I’m not in the sequencing business but the library construction business, since that’s the expensive part.