Background: In many peer review settings, proposals are selected for funding onthe basis of some summary statistics – such as the mean, median, or percentile –of review scores. There are numerous challenges to working with scores. Theseinclude low inter-rater reliability, epistemological differences, susceptibility tovarying levels of leniency or harshness of reviewers, and the presence of ties. Adifferent approach that is able to mitigate some of these issues would be toadditionally collect rankings such as top-k preferences or paired comparisons andincorporate them in the analysis of review scores. Rankings and pairedcomparisons are scale-free and can enforce demarcation between proposals bydesign. However, analyzing scores and rankings simultaneously has not been doneuntil recently due to the lack of tools for principled modeling.
Methods: We first introduce an innovative protocol for collecting rankingsamong top quality proposals. This rankings collection is done as an add-on to thetypical peer review procedures focused on scores and does not require reviewersto rank all proposals. We then present statistical methodology for obtaining anintegrated score for each proposal, and from the integrated scores an inducedpreference ordering, that captures both types of peer review inputs: scores andrankings. Our statistical methodology allows for the collected rankings to differfrom the score-implied rankings; this feature is essential when the two qualityassessments disagree which, as we find empirically, often happens in peer review.We illustrate how our method quantifies the uncertainty in order to betterunderstand reviewer preferences among similarly scored proposals.
Results: Using artificial “toy” examples and real peer review data, wedemonstrate that incorporating top-k rankings into scores allows us to betterlearn when reviewers can distinguish between proposals. We also examine therobustness of this system to partial rankings, inconsistencies between ratings andrankings, and outliers. Finally, we discuss how, using panel data, this method canprovide information about funding priority that provides a level of accuracy in aformat that is well suited for the types of decisions research funders make.
Conclusions: The gathering of both rating and ranking data and the use ofintegrated scores and its induced preference ordering can have many advantagesover methods relying on ratings alone, leveraging more information to mostaccurately distill reviewer opinion into a useful output to make the most informedfunding decision.