Who woulda thunk it? A recent paper in PLoS One argues that the NIH review process uses far too few reviewers to claim the level of scoring precision that the NIH provides.
NIH grants are scored on a scale from 1.0 to 5.0, with 1.0 being the best; reviewers can grade in tenths of a point (i.e., 1.1, 2.3, etc.). The authors, using some very straightforward statistics, demonstrate that four reviewers could accurately assign whole integer scores (1, 2, 3…), but to obtain reliable scores with a precision of 0.01, a proposal would require 38,416 reviewers.
Not going to happen. Keep in mind that NIH is considering moving to scores with a supposed precision of 0.001. The authors note:
The disconnect between the needed precision in order to allocate funds in a fair way and the number of reviewers required for this level of precision demonstrates a major inconsistency that underlies NIH peer review. With only four reviewers used for the evaluation of applications, an allocation system that requires a precision level in the range of 0.01 to 0.1 is not statistically meaningful and consequently not reliable. Moreover, the 4 reviewers NIH proposes are not independent which degrades the precision that could be obtained otherwise.
Consequently, NIH faces a major challenge. On the one hand, a fine-grained evaluation is mandated by their review process. On the other hand, for such criterion to be consistent and meaningful, an unrealistically high number of evaluators, independent of each other, need to be involved for each and every proposal.
They also argue that the inappropriately small numbers of reviewers is stifling novel proposals:
…4 independent evaluators can provide statistical legitimacy only under the circumstance of all evaluators giving essentially the same evaluation. For proposals that are expected to be more controversial, as potentially transformative ideas have been proposed to be, a small number of evaluators would lead to unreliable mean estimates.
In the conclusion, there’s some pretty good snark (boldface mine):
It is commonly accepted that NIH will not fund clinical trials that do not include a cogent sample size determination. It is ironic that NIH insists on this analysis for clinical studies but has not recognized its value in evaluating its own system of peer review. We posit that this analysis should be considered in the revisions of NIH scientific review.
The NIH peer review structure has not been based in rigorous applications of statistical principles involving sampling. It is this deficiency that explains the statistical weakness and inconsistency of NIH peer review.
My only quibble with this article is that the scores that might be funded typically range from 1.0-1.4 (although, like high school grades, there has been significant ‘grade inflation’), and I’m not sure what that does to some of the estimates. Granted, needing ‘only’ hundreds of reviewers isn’t comforting either.
One proposed solution is to radically shorten proposals to one or a few pages, so the number of reviewers can increase. Before you think this is crazy, the ‘meat’ of genomics white papers is typically only a few pages long (the rest is usually a discussion of how the sequencing will be done, which presumably the major sequencing centers have figured out by now).
I’ve always though that proposals are like students applying to a ‘highly selective’ college: you kick the bottom two-thirds, there is a small number of really qualified students that you obviously want, and the rest are pretty interchangeable (not that you want to tell the
customersstudents that…). My solution would be to keep the current process, triage the bottom sixty percent, and then randomly pick the remainder, with the exception of any proposal that was scored in the top ten percent of all reviewers assigned to that grant.
Although I do like the idea of one page proposals….
Cited articles: Kaplan D, Lacetera N, Kaplan C (2008) Sample Size and Precision in NIH Peer Review. PLoS ONE 3(7): e2761. doi:10.1371/journal.pone.0002761
Kaplan, D. 2007. POINT: Statistical analysis in NIH peer review–identifying innovation. The FASEB Journal 21:305-308.