Standard setting is both important and problematic in medical education. The Angoff method1 is widely used for standard setting selected-response items in high stakes settings such as the General Medical Council tests for non-UK, non-EU doctors wishing to practice in the UK, and USMLE Step 1, yet its use poses a number of challenges.
Perhaps the most significant of these is the requirement that assessors conceptualise a particular kind of candidate, often described as the ‘minimally competent’ or ‘Borderline’ candidate. In the context of Angoff standard setting, ‘Borderline’ generally represents a ‘Borderline pass’, and it is in this sense that we use it here.
Whichever form of words is used, assessors may have very different ideas of what that class of candidates represents. This is compounded by the fact that subject specialists among the assessors may lack generalist knowledge,2 or lack awareness of what particular level candidates would appropriately have achieved.
As a consequence, a minimum number of assessors may be required, and this in itself poses practical problems in identifying a sufficient number of assessors with sufficient expertise in the subject, and indeed experience in using the Angoff method. One safety-net option is to use the Hofstee compromise method3 if any ‘Angoffed’ assessment fails a ‘Reality Check’.4
A particular tendency of novice assessors is ‘reversion to the mean’, where they tend to award Angoff scores of around 50% rather than using the full scale range. This results in a low correlation between the predicted Angoff value and the observed Facility (where Facility is the percentage of candidates answering correctly) of the items.
Some of the same considerations apply to Ebel standard setting5. Again, the just-passing candidate is difficult to conceptualise, and a panel of experts is required to carry out the required classification.
An inexpensive alternative is to use either the Cohen method6, which derives the cut score from a multiple of the 95th centile candidate, or the similar modified Cohen method7, which relies on the 90th centile candidate. These methods are quick to implement, and do not require the input of expensive staff time. However, they may be criticised on the basis that they rely on the score of an individual candidate (or in the case of ties, a small number of candidates). We return to this issue in the Discussion.
However, it is possible that assessments vary more in difficulty than does the ability of the cohort, since medical students are highly selected for academic ability prior to entry. In this case, the difficulty of the assessment may be the key variable, and the cumulative Facility of the items is a guide to this.
Of course, Facility represents the whole cohort performance, rather than the performance of the Borderline candidates. We hypothesised that for good quality One-Best-of-Five MCQs, the relationship between Facility for the whole cohort, and the Facility for Borderline candidates, would be curvilinear in nature, with the difference between them approaching zero as the Facility approaches 100% and 20%. This is because if the entire cohort scores an item correctly, then so will the Borderline candidates, and if the best candidates do no better than guessing, then neither will the Borderline candidates.
In this study we therefore attempted to explore the effect of classifying different numbers of students as ‘Borderline’ in comparison with the cohort as a whole. Classification was carried out based on performance across the whole range of modules undertaken by the students as described in the Methods.
The exact nature of the relationship between whole cohort and Borderline Facility will depend on the proportion of Borderline candidates in the class, and we discuss ways in which this might be estimated.
Where such a relationship emerges, it would be of value in assisting novice Angoff assessors in estimating the performance of Borderline candidates for an item which had been used before. It could also be used for adjusting any items where the discrepancy between the predicted Angoff value and the observed Facility for that item is greater than seems plausible.
More importantly, the relationship could be used by itself as a standard setting method in conditions in which Angoff or similar methods were not practical: for instance, if too few subject matter experts were available to form an assessor panel, or where the resource costs of using the Angoff method were too high. This would then be a retrospective method based on test takers, rather than a prospective method based on test items.
The purpose of this study is to show proof of concept and although the analyses were carried out locally, we believe our results would be adaptable and of interest to other settings outside our school.