This will be something of a technical post, but I’ve decided to pick the hivemind’s brain. In some projects I’m involved with, we’re generating lists of SNPs for a bunch of bacterial strains using Illumina.
For a given strain, anywhere from ten to forty percent of the SNPs are indels (actually ‘SNP’ is a bit of a misnomer because we can detect small multi-nucleotide insertions and deletions). Here’s the question: is there any way to use maximum likelihood methods with indels? I could just use parsimony methods, and treat the indels as characters, but I don’t want to lose information about the molecular evolutionary model. For the substitutions, it’s pretty clear that they violate the de facto assumptions of parsimony (equal rates and transition frequencies across all sites), so I would like to use a method that incorporates a molecular model.
I would note that as it gets really cheap to scan entire microbial genomes for SNPs, this will be a problem we have to grapple with, so strap on yer thinkin’ caps and come up with a solution!
Seriously, any ideas?
Update: While I was on vacation, I stumbled across this paper that uses DNAML to deal with gaps. While it’s definitely an improvement, there are two problems: 1) it treats a gap which is larger than one character (e.g., “—“) as multiple characters–each gapped site is treated as a character; 2) DNAML isn’t very computationally fast (although maybe this modified version DNAML could be implemented in fastDNAML?).