OK, I’m Jewish, so I don’t the whole Christmas gift thing too much, but my birthday is right around Christmas, so that will do. So here’s what I want from sequencing technology manufacturers for my birthday: cheap, fast library construction. Especially jumping libraries.
Before I get to why I want this, let me translate for those who aren’t genomics cognescentii (An aside for the cognescentii: I’m cutting corners and simplifying some of the explanations–that’s the goal anyway. If you want to pick nits over the details, feel free, but you should know there’s a shampoo for that). Before we stick DNA in a sequencing machine, we prepare the DNA for sequencing, in a process known as library construction–the prepared DNA is called a ‘library.’ There’s two types of libraries. One known called fragment libraries chops the DNA into small pieces (and then those pieces are sequenced). Making those libraries is pretty cheap and fast. In fact, the Ion Torrent sequencing technology now has an automated system that does this in about five hours, if memory serves correctly. What I want (cuz it’s my fucking birthday!) are jumping libraries.
Here, what we do (and I’m leaving out a lot of molecular biology) is that we end up sequencing the very ends of a long piece of DNA (typically around 5,000 base pairs or subunits of DNA). These are key since they allow us to jump (hence the name) over repetitive sequence:
We don’t actually sequence a genome–we sequence little pieces of it…. What we do is tile or ‘stack’ reads on each other and build a sequence. Something like this toy example:
Becomes: AGCTCA (although we typically have much more ‘coverage’–many more reads confirming each base).
This works fine until you have repetitive content–identical (or nearly so) sequences that occur throughout the genome. Suppose our genome has unique sequences A, B, and C, with a repetitive region X between them. It’s impossible to figure out from tiling reads if we have A-X-B-X-C or A-X-C-X-B. We solve this by using what are called ‘jumps’. These reads are composed of two regions of the genome that are a certain distance apart (lots of molecular biology is done here), enabling us to jump over the repetitive stuff. In the ABC example above, if we have jumps that contain parts of A and B, and parts of B and C, and no A and C jumps, we conclude the sequence is A-X-B-X-C.
There’s a catch though: the jumps have to be larger than your repetitive regions, otherwise you’re right back where you started.
These are critical for assembling bacterial genomes into a small number of pieces, or if we’re lucky, one piece (note: a few, large pieces is what we’re shooting for). Many important bacterial traits, such as antibiotic resistance and the ability to cause disease, are found on plasmids (mini-chromosomes that can jump from strain to strain). Because nothing in biology is ever easy, these plasmids are riddled with repetitive sequence making large jumps essential if we wanted to track the spread of antibiotic resistance plasmids through a hospital (or epidemic).
Here’s the thing: the actual cost of sequencing a bacterium is ludicrously cheap. One Illumina Hi-Seq can sequence ~250 E. coli (a relatively large bacterium, especially among clinically-relevant species) in under two weeks. Regardless of the price (and prices are not costs), this is really cheap. The cost-limiting and rate-limiting step is library construction, in particular jumping libraries.
With all of the understandable focus on sequencing human genomes, there’s a great clinical opportunity to use genomics to track infectious disease–if we can lower the library construction costs.
Seems like a good bidness opportunity. The first technology* to fast, cheap jumping libraries might pull ahead of the pack…
*Don’t get me started on PacBio. Don’t.