From the archives, here’s something about how we might be underestimating the strength of natural selection when we look at molecular data:
PZ Myers has a superb summary of a very interesting PLoS paper. In the paper, the authors identify those genes that have experienced strong selection, and thus might be responsible for the chimpanzee-human divergence (PZ Myers has a great summary):
With all the data available from the human genome project and the ongoing chimpanzee genome project, we can start comparing DNA sequences. One parameter that can be assayed is the frequency of synonymous changes in the DNA: these are changes in the nucleotide sequence that produce synonyms in the triplet code, and therefore cause no changes at all in the protein sequence. These changes represent a kind of steady background noise, the rate of random, neutral changes in the genome. Non-synonymous changes, on the other hand, do change the amino acid sequence of the resulting protein, and are presumed to be more likely to have some kind of effect on the phenotype. The ratio of nonsynonymous to synonymous nucleotide changes within a gene, dN/dS, is a measure of the history of selection for change in that gene. High dN/dS values mean there has been selection pressure for novel forms, while low dN/dS values mean selection has been working to conserve the sequence.
So here’s the analysis: go through the list of human genes, find each one’s homolog in the chimpanzee, compute the dN/dS ratio, and rank them in order. What you end up with is a list, with the genes that have experienced the strongest selection for new properties between the two species at the top. Note that you can’t tell which of the two species has changed the most from their common ancestor from this analysis (although comparison with an outgroup can help with that), so all we know is which genes have diverged the most.
Here’s my problem with the article: this method will miss many, many genes. In other words, many ‘important’ genes will be missed. Now, this isn’t the authors’ fault: to paraphrase Rumsfeld, sometimes you have to analyze the genomes you have, not the genomes you wish you had. Note the plural genomes. But I’m getting ahead of myself.
Imagine a gene 300 amino acids long (that’s 900 base pairs of DNA; every three bases codes for one amino acid or codon). In many genes, most of the non-synonymous substitutions will be deleterious (dN/dS at that codon will be very close to zero), some will be neutral (dN/dS = 1), and a few will be beneficial (dN/dS > 1). If you average across the gene, the ratio of dN/dS will be much lower than 1. However, this doesn’t mean that the gene isn’t evolutionarily important: the few beneficial non-synonymous substitutions could be doing evolutionary backflips (dN/dS >> 1), and a gene-wide summary statistic still won’t detect selection at this genes because you average dN/dS across all sites.
I’m not arguing a hypothetical case here. I’m currently in the process of submitting a manuscript about a gene in E. coli involved in the ecological divergence between ‘harmless’ E. coli and those involved in urinary tract infections. In this gene, about 2% of the amino acids appear to have a dN/dS ratio > 1.0, and in almost all of the other amino acids, amino acid substitutions are deleterious (dN/dS ~ 0.1). This gene has a gene-wide dN/dS ratio ~ 0.07, yet we know from functional and experimental studies that this gene is vital in the ecological divergence between the harmless and pathogenic forms. The ‘PLoS’ ranking system would most likely miss this gene.
Now, if your eyes haven’t completely glazed over at this point, you’re wondering, “How the hell does he know what’s happening at each codon?” Simple. I’m the Mad Biologist. Never, ever doubt the Mad Biologist.
Seriously, there is a method known as the codon substitution method (for the technical details and paper, click here). Essentially, this method allows you to examine the dN/dS ratio for each amino acid, as opposed to the whole gene. I won’t get into the technical details here, but what this method would require for the chimp-human analysis is lots of human and chimp genomes (at least ten of each, although two of each is the bare minimum and not very reliable). This is why I said earlier that you analyze the genomes you have, not the genomes you wish you had.
The punchline is that while this is a very interesting paper, I think we might be missing a lot of evolutionarily important genes simply because many, though not all, non-synonymous changes in these ‘missed’ genes are removed by natural selection. Instead, the PLoS method will be biased towards genes whose amino acid structure can tolerate a lot of change without a degredation of function. What this means is that there might be even more genes that are responsible for the chimp-human divide. That’s pretty cool.
Note to creationists: If I catch a single one of you using this post to somehow try to ‘undermine’ the theory of natural selection, I’m going to flame your lame ass. The whole damn point of this post is that we might be underestimating the power of natural selection. In science, as opposed to crackpot theology, we use deduction and induction. Sometimes, in the face of incomplete evidence, we disagree over the particulars.