To become a competent physician, undergraduate medical students must be assessed not only on factual knowledge but also on communication and clinical skills. The reliability of clinical assessments to test these skills however, is known to be compromised by high levels of variabilityi.e. different results on repeated testing1,2.
Candidate variability, case variability (case specificity) and examiner variability all contribute to the overall variability of a clinical assessment. Candidate variability reflects the difference between candidates and in the absence of other variables (or error) candidate variability represents the true variability. Case specificity refers to the phenomenon that a candidate’s performance can vary from one case to the next due to differing levels of difficulty or content2, 3. Examiner variability refers to the fact that two examiners observing the same performance may award different scores. Many studies have shown that examiner variability is the most significant factor contributing to variability in clinical examinations4,5 and may even exceed the variability accounted for by differences in candidates6. The degree of examiner variability which is deemed acceptable is generally a minimum of 0.6 with 0.8 being the gold standard (where 0 shows no relationship between two examiners scores and 1 is a perfect agreement)7.
Variability in how examiners score candidates may be consistent, for example, an examiner who always marks candidates stringently (often referred to as a hawk) or an examiner who is consistently lenient (a dove)3. This kind of consistent examiner behavior can often be adjusted for when analyzing results. However, examiner variability may not always be so consistent and predictable.
Examiners in clinical assessments are subject to many forms of bias8. The 'Halo effect’ refers to the phenomenon where an examiner’s overall first impression of a candidate ("he seems like he knows his stuff”)) leads to failure to discriminate between discrete aspects of performance when awarding scores9. In addition, familiarity with candidates, the mood of the examiner, personality factors, and seeing information in advance have all also been found to affect examiners judgments 10,11,12 .
Variability may result in a borderline candidate achieving a score in the pass range in one assessment and the same candidate failing a comparable assessment testing the same/similar competencies. In high stakes examinations, such as medical licensing examinations, this can have serious implications for both the candidate, the medical profession and even society in general. Moreover, pass/fail decisions are now increasingly being challenged13.
Efforts to reduce variability in clinical assessments have ranged from utilising higher numbers of stations in Objective Structured Clinical Examinations (OSCEs), to employing objective checklists2,14. Many of these approaches have not been found to make any meaningful improvements to reliability15. However, increasing the number of observations in an assessment (by involving more examiners in the observation of many performances) has been shown to improve reliability16. In their evaluation of the mini-clinical exercise used in US medical licensing examinations, Margolis and colleagues stated that having a small number of raters rate an examinee multiple times was not as effective as having a larger number of raters rate the examinee on a smaller number of occasions and more raters enhanced score stability6. Consequently, an approach frequently adopted to improve reliability and limit the impact of inter-examiner variability is to pair examiners and ask them to come to an agreed score for a candidate’s performance. Little is known however, about what occurs when these paired examiners interact to generate a score.
Summary of existing literature
Although the hawk-dove effect was described by Osler as far back as 191317 its impact on the reliability of clinical examinations was only explored in recent years. In 1974 Fleming et al. described a major revision of the Membership of the Royal College of Physicians (MRCP) UK clinical examination and identified one examiner as a hawk18. There was a significantly lower pass rate in the group of candidates where this examiner examined compared with the remainder (46.3% and 66.0% respectively).
In 2006, an analysis of the reliability of the MRCP UK clinical examination that existed at that time, the Practical Assessment of Clinical Examination Skills (PACEs) exam, found that 12% of the variability in this examination was due to the hawk-dove effect19. Examiners were more variable than stations.
In 2008 Harasym et al.20 found an even greater effect due to the hawk-dove phenomenon in an OSCE evaluating communication skills. Forty four percent of the variability in scores was due to differences in examiner stringency/leniency; over four times the variance due to student ability (10.3%).
As mentioned above, many types of rater-bias are known to be at play when human judgement comprises part of any assessment process (halo effect, the mood of the rater, familiarity with candidates, personality factors etc8,9,10,11). Yeates and colleagues in 2013 proposed three themes to explain how examiner-variability arises21. They termed these: differential salience (what was important to one examiner differed to another); criterion uncertainty (assessors’ conceptions of what equated to competence differed and were uncertain); information integration (assessors tend to judge in their own unique descriptive language forming global impressions rather than discrete numeric scores).
Govaerts suggests that some examiner-variability may simply arise from individual examiners’ peculiarities in approach and idiosyncratic judgements made as a result, of the interaction between social and cognitive factors12.
Earlier reports had suggested that employing objective checklists would help overcome examiner-variability by regulating subjectivity2. More recently however, several lines of evidence suggest that global judgements produce more reliable results than highly structured tools4, 14. Furthermore, measurement instruments have been shown to account for less than 8% of the variance in performance ratings 22.
Other proposals to improve reliability have involved increasing the number of items used per station. However, Wilkinson et al analysed examiners marks over a four-year period in New Zealand and found that while items-per-station increased over the four years, there was no correlation between items-per-station and the station inter-rater reliability4.
The impact of examiner training has also been looked at in many studies23. Cooke et al.24 found no significant effect and while Holmboe et al.25 showed that training produced an increase in examiner stringency, this increase was inconsistent.
In a recent literature review on rater cognition in competency based education Gauthier et al.15 summarised the situation stating: "attempts to address this variability problem by improving rating forms and systems, or by training raters, have not produced meaningful improvements”.