So What Could Nanopore Mean for Bacterial Genomics? (And the Pelham 123 Problem)

By now, you might have heard about Oxford Nanopore‘s first preview of their Gridion and Minion technologies at the recent AGBT conference. While I think some caution is in order, I don’t think this will turn out to be ‘vaportech.’ While much of the discussion has understandably focused on what Nanopore means for human genomics, I want to discuss what this technology could–note the world could–mean for bacterial genomics, since humans are boring (for ease of writing, I’m going to lump Gridion and Minion together and call them ‘Nanopore’).

I see the promise of bacterial genomics built around the sequencing of hundreds of genomes per week. Not that there’s anything wrong with sequencing a small number of genomes, but being able to do high-throughput genomic epidemiology is going to be a key area (and with 3,700 hospitals in the U.S. alone, a potentially very lucrative one). To do this, your processes will have automated, which means you will need very high quality data. In addition, we won’t be using reference genomes, but generating de novo assemblies. To translate this into English (or my approximation thereof), we are looking for new genes (e.g., antibiotic resistance genes) and genetic structures (e.g., plasmids), not asking how a given bacterium differs from a known ‘reference’ sequence (which has the limitation of not being able to examine differences that aren’t found in your reference sequence–you can’t compare something to something you don’t know about…).

Before we get to Nanopore, let’s quickly review sequencing. We don’t actually sequence a genome in one fell swoop: we chop the DNA into pieces, from one hundred to five hundred base pairs long (‘bp’, where one bp is a single ‘letter’ or nucleotide of DNA) and sequence those fragments. For reference, even a stripped down E. coli like E. coli K-12, the archetype microbial lab rat, has 4.6 million bp of DNA. We also then sequence another DNA preparation from the same bacterium (a ‘DNA library’) with larger fragments, usually around 5,000 bp (5 kb), and sequence each end of that fragment–this is called a ‘jumping library’ or ‘jumps’ (I’m leaving out a lot of molecular biology to make this happen).

The small fragments can be piled or ’tiled’ on each other to create a stretch of sequence known as a contig. Like so:

AGCT
GCTC
CTCA

Becomes: AGCTCA (although we typically have much more ‘coverage’–many more reads confirming each base). This works fine until you have repetitive content–identical (or nearly so) sequences that occur throughout the genome. Suppose our genome has unique sequences A, B, and C, with a repetitive region X between them. It’s impossible to figure out from tiling reads if we have A-X-B-X-C or A-X-C-X-B. We solve this by using the bits of sequence at the end of the much larger pieces to stitch the individual contigs into a scaffold. Remember, with most current sequencing technologies we can’t simply sequence the entire large piece of DNA.

That’s where Nanopore is very exciting.

Nanopore claims that they’ve been able to sequence huge pieces of DNA (much larger than 5kb). Not only does this mean we can avoid making two DNA libraries–and in bacterial genomics, the true cost of DNA preparation is currently about 8-9 times that of sequencing the prepared DNA*–but the ‘jumping library’ with the large fragments is the more expensive and time consuming library. It also means that we’ll have fewer stretches of unknown sequence in bacterial genomes (the ‘Xs’ from the previous example). And, without getting into technical details, it’s much easier to assemble fragments than fragments and jumps.

So this could be a ‘game changer.’ If.

I have two concerns. The minor concern is that Nanopore will be more likely to sequence the smaller pieces of DNA than the larger ones. When we chop DNA into pieces, let’s say 5 kb, we actually get a spread of sizes. We can make that spread quite small, but doing so is not fast and requires a lot of labor. That is, money. But that’s a minor concern, since there are probably clever tricks one could do to get around this.

The major concern is the error rate. More specifically, it’s the kind of error. According to Nanopore, the ‘indel’ error rate is about four percent. That is, every four out of one hundred bases, a base is either erroneously added or deleted due to sequencing error. When I’ve seen data of the same strain–a clinical isolate of E. coli–sequenced with different technologies, the overall assembly statistics often look good: we’re able to assemble it into a few pieces (remember, fewer pieces is better). If there’s a known reference, most of the reference is successfully sequenced too.

But the problem comes when we try to annotate the genome–that is, figure out what the genes are contained in the sequence. In my experience, technologies that have a lot of ‘indels’ in the raw sequence end up with lots of ‘broken genes’ because enough of those indels wind up in the final, assembled genome (for the cognoscenti, a 1-in-10,000 final indel rate will bust up lots of genes). To explain this, consider this ‘toy’ gene (genes are much longer, but bear with me):

ATG CCC ATA TGA

This encodes a protein that’s three amino acids long (an amino acid is the smallest subunit of a protein and is represented by three nucleotides known as a codon). The “TGA” is a ‘stop codon’ and represents the end of the protein. Now, imagine we have an indel in our sequence (the boldface is the insertion):

ATG CCC CAT ATG A

OH NOES! Not only have we changed the amino acid in the third position (ATA –> CAT), but we removed the stop codon so the software that identifies genes (a ‘gene caller’) just keeps on going, like the runaway subway car in Pelham 123 (ALL THE LIGHTS ARE GREEN!! AAAIIEEE!!!). We now incorrectly think the protein is much longer. Not only did we screw up this gene, but we might have also read into the next gene (ALL THE LIGHTS ARE GREEN!! AAAIIEEE!!!), thereby screwing that one up too. Like I mentioned, in my experience, indel prone technologies yield many more indels in the final genome sequence. Many more.

Now, a human can catch this and note it, but humans are error-prone and slow. Slow means expensive and puts the kabosh on the whole hundreds per week idea. Unless the error is confirmed though, we have to treat it as ‘real’ for any kind of analysis (otherwise we’re cheating). We can manually confirm these errors, but that’s even slower and more expensive.

What profit a Mad Biologist who gains cheap sequence only to have to pay tons of money on the back end? (I think that’s in the Bible somewhere).

Now, Oxford Nanopore claims that, by the time the sequencers ship, they’ll have fixed this. If they do, this technology is a breakthrough. If not, it still has utility, in that we can use it to stitch contigs together, but then it’s a companion technology, not the droids breakthrough we’ve been looking for. When research groups put Nanopore through its paces, then we’ll see if it lives up to the promise (or hype, if you prefer).

Hopefully, it will.

*Since I don’t work on human genomes which are about ~1000 times larger than bacterial ones, the actual sequencing costs are trivial. Library construction costs are the same, however, regardless of the size of the genome. I occasional joke that I’m not in the sequencing business but the library construction business, since that’s the expensive part.

This entry was posted in Genomics. Bookmark the permalink.

7 Responses to So What Could Nanopore Mean for Bacterial Genomics? (And the Pelham 123 Problem)

  1. pm says:

    Even if it has a high indel error rate, you can still use the Nanopore sequence to generate a reference, and then use Illumina sequencing to clean up the indels.

    • Sure, but now I’m back in two library territory. And either I have to pay for MiSeq cost sequencing, or run it on a HiSeq (which isn’t fast).

      • pm says:

        From what I understand, it’s still only one library (the nanopore won’t need a library). Now you can get rid of your jump libraries, and the amount of sequencing that you need from the MiSeq or HiSeq to clean up the indels will be 4 or 5 lower than what you would need to get decent sized contigs.

        And if the read length for the nanopore is true, we are talking contigs in the 100s of kb range.

        I agree it would be ideal if the indel rate wasn’t so high, but I still think its a big f’ing deal if the nanopore sequencer performs as well as they are claiming now (big if, of course). It would mean the ability to create de novo pretty good reference genomes for the species of your choice to align illumina reads to.

      • pm,

        I should have been clearer: you won’t need a library, but there will have to be some preparation that will be different than Illumina fragments. Also, I think we should be careful in overestimating what Illumina can do in terms of error correction. In my experience, error-corrected 454 isn’t as good as ‘pure’ Illumina for instance (although it’s a lot better than 454 alone)

  2. Ellen Clark says:

    Thanks for this post. As a person who has a BS in Biology from a million years ago I like how you explain the whole Nanopore news in a way anyone can understand.I have added you to a running list on this story I have on my blog http://clarksearch.com/blog/oxford-nanopores-exciting-sequencing-news/

  3. Thanks for the post! This issue with homopolymer errors is exactly what I’ve been pointing out to people as well.

    It’s also interesting to note that since Nanopore sequences bases three at a time, those errors should only crop up when you have three or more of the same base in a row. The chance of getting three bases in a row is around 1 in 16 (1 in 4 chance that the second base is equal to the first; another 1 in 4 chance that the third one is the same as the first two). 1/16 is 6.25%. So to get a 4% error rate, the MAJORITY of the homopolymer sequences would have to be read wrong!

    The other pernicious issue with homopolymer sequencing errors is that they’re a *systematic* error. Which means that they don’t simply go away as you pile on more reads, as we’ve found out with 454. If your errors are uniformly distributed across the entire read (like with Illumina or PacBio), then you can expect the next read to have sequencing errors in entirely different places. With 454 (and Ion Torrent, and Nanopore), those errors just keep piling up in the same place.

  4. Pingback: Friday coffee break « Nothing in biology makes sense!

Comments are closed.