Indels in Phylogenies: How Should We Treat Them?

This will be something of a technical post, but I’ve decided to pick the hivemind’s brain. In some projects I’m involved with, we’re generating lists of SNPs for a bunch of bacterial strains using Illumina.

For a given strain, anywhere from ten to forty percent of the SNPs are indels (actually ‘SNP’ is a bit of a misnomer because we can detect small multi-nucleotide insertions and deletions). Here’s the question: is there any way to use maximum likelihood methods with indels? I could just use parsimony methods, and treat the indels as characters, but I don’t want to lose information about the molecular evolutionary model. For the substitutions, it’s pretty clear that they violate the de facto assumptions of parsimony (equal rates and transition frequencies across all sites), so I would like to use a method that incorporates a molecular model.
I would note that as it gets really cheap to scan entire microbial genomes for SNPs, this will be a problem we have to grapple with, so strap on yer thinkin’ caps and come up with a solution!
Seriously, any ideas?
Update: While I was on vacation, I stumbled across this paper that uses DNAML to deal with gaps. While it’s definitely an improvement, there are two problems: 1) it treats a gap which is larger than one character (e.g., “—“) as multiple characters–each gapped site is treated as a character; 2) DNAML isn’t very computationally fast (although maybe this modified version DNAML could be implemented in fastDNAML?).

This entry was posted in Evolution, Genomics. Bookmark the permalink.

9 Responses to Indels in Phylogenies: How Should We Treat Them?

  1. David says:

    This is a problem I’ve encountered many times — and still haven’t come up with a good idea. There’s this paper ( I haven’t read it yet but it looks like it may be of use to you.
    I’ve used indels as characters in Bayesian analyses before. You have to partition the data but it works and I think it made the analysis much stronger.

  2. Super Jesus says:

    I think Sarah Palin will totally know this one.

  3. Larry Moran says:

    This doesn’t answer your question but I doubt very much that the quality of your data justifies using maximum likelihood or any other sophisticated algorithm.
    Many of your SNPs, including indels, are likely to be cloning and/or sequencing artifacts of one sort or another. You might as well stick with difference methods which are not only faster but have the advantage of allowing for various gap penalties to deal with indels. They also tend to swamp out errors.

  4. Aranae says:

    I think David’s right. Your best bet, based on what’s out there, is to let the gaps be treated as missing, create a binary character state matrix coding for presence/absence of individual gaps, and run a partitioned analysis in MrBayes or wherever.

  5. dj says:

    parsimony doesn’t have to assume equal frequencies across all sites. you can use weighted parsimony, for example, assigning diffs at 3rd positions less weight than those at 2nd positions, transitions less weight than transversions, etc. this is not new. check out any of many papers from the 90s in MPE or MBE, or Evolution, etc., when parsimony was getting more attention before maximum likelihood hit the pavement. for gaps, the extension of an existing gap would get less weight than the opening of a new gap, etc. there are workable solutions for the use of parsimony.
    and, to state the obvious, also make sure you are stone confident in your sequence data, and that you have the best alignment possible. visually examine any computed alignment, and tweak by hand as necessary to be sure that any gaps are real and make sense.

  6. Phil Stafford says:

    Why not select those SNPs that are conserved (use UCSC’s conservation db) and assign a weight to them in your analysis? Then a non-parametric method could be used for comparing ranks.

  7. Atle says:

    I have never read this blog before, so my answer might be quite insipid and out of place…..But I’m sure I read quite a lot about different methods for solving this problem in the book “Bioinformatics and Molecular Evolution” (Higgs/Attwood) some years ago. It is a reoccurring problem when calculating evolutionary distance between related sequences. I think it would at least be worth having a quick browse in the book at the library, or maybe through Amazon’s online reader.

  8. mirc says:


  9. sex shop says:

    I follow your site constantly and offers a very good share. I expect continued share

Comments are closed.