To become a competent physician, undergraduate medical students must be assessed not only on factual knowledge but also on communication and clinical skills. The reliability of clinical assessments to test these skills however, is known to be compromised by high levels of variability i.e. different results on repeated testing1,2.
Candidate variability, case variability (case specificity) and examiner variability all contribute to the overall variability of a clinical assessment. Candidate variability reflects the difference between candidates and in the absence of other variables (or error) candidate variability represents the true variability. Case specificity refers to the phenomenon that a candidate’s performance can vary from one case to the next due to differing levels of difficulty or content2, 3. Examiner variability refers to the fact that two examiners observing the same performance may award different scores. Many studies have shown that examiner variability is the most significant factor contributing to variability in clinical examinations4,5 and may even exceed the variability accounted for by differences in candidates6. Examiner variability is generally referred to as the degree of inter-examiner reliability, or the more commonly used term, inter-rater reliability. The level of inter-rater reliability which is deemed acceptable is a minimum of 0.6 with 0.8 being the gold standard (where 0 shows no relationship between two examiners scores and 1 is a perfect agreement)7.
Variability in how examiners score candidates may be consistent, for example, an examiner who always marks candidates stringently (often referred to as a hawk) or an examiner who is consistently lenient (a dove)3. This kind of consistent examiner behavior can often be adjusted for when analyzing results. However, examiner behaviour may not always be so consistent and predictable.
Examiners in clinical assessments are subject to many forms of bias8. The ‘Halo effect’ refers to the phenomenon where an examiner’s overall first impression of a candidate (“he seems like he knows his stuff”) leads to failure to discriminate between discrete aspects of performance when awarding scores9. In addition, familiarity with candidates, the mood of the examiner and seeing information in advance have all also been found to affect examiners judgments10,11,12. Variability may result in a borderline candidate achieving a score in the pass range in one assessment and the same candidate failing a comparable assessment testing the same/similar competencies. In high stakes examinations, such as medical licensing examinations, this can have serious implications for both the candidate, the medical profession and even society in general. Moreover, pass/fail decisions are now increasingly being challenged13.
While several strategies to reduce variability in clinical assessments have not been found to make any meaningful improvements to reliability14, increasing the number of observations in an assessment (by involving more examiners in the observation of many performances) has15. In their evaluation of the mini-clinical exercise used in US medical licensing examinations, Margolis and colleagues stated that having a small number of raters rate an examinee multiple times was not as effective as having a larger number of raters rate the examinee on a smaller number of occasions and more raters enhanced score stability6.
However, different raters are known to focus on different aspects of performance and groups are more likely to make unpopular decisions than single raters16. In addition, it was previously assumed that assessments conducted with others present (the overt condition) should lead to more reliable assessments17. Consequently, some institutions (including our own) have adopted the practice of pairing examiners and asking them to come to an agreed score rather than use individual raters. Little is known however, about what occurs when these paired examiners interact to generate a score.
In the field of Occupational Psychology, a meta-analysis conducted by Harari et al looked at job performance ratings and found a relationship between the personality factors of the raters and the performance ratings given18. The ‘Big Five’ personality factors19 (neuroticism, extroversion, openness to experience, agreeableness and conscientiousness) accounted for between 6% and 22% of the variance in performance ratings. Furthermore, other research in the areas of personality and Human Behaviour has shown that there is a relationship between the big five personality traits and the responsiveness of individuals to persuasion and influence strategies20,21. Could an examiners personality make them more likely to influence or be influenced when examining in a pair?
In some of his work McManus hypothesized that personality may relate to examiner stringency22, and there is evidence from one study that there is a correlation between personality type and examiner stringency23. While there are anecdotal reports of some medical-educators expressing concern that employing paired examiners could allow a dominant individual to unduly influence the decision process, this has not been well explored in the literature16 and we found no studies that looked at the interaction between examiners in pairs.
Summary of existing literature
Although the hawk-dove effect was described by Osler as far back as 191323 its impact on the reliability of clinical examinations was only explored in recent years. In 1974 Fleming et al. described a major revision of the Membership of the Royal College of Physicians (MRCP) UK clinical examination and identified one examiner as a hawk24. There was a significantly lower pass rate in the group of candidates where this examiner examined compared with the remainder (46.3% and 66.0% respectively).
In 2006, an analysis of the reliability of the MRCP UK clinical examination that existed at that time, the Practical Assessment of Clinical Examination Skills (PACEs) exam, found that 12% of the variability in this examination was due to the hawk-dove effect22. Examiners were more variable than stations.
In 2008 Harasym et al.25 found an even greater effect due to the hawk-dove phenomenon in an OSCE evaluating communication skills. Forty four percent of the variability in scores was due to differences in examiner stringency/leniency; over four times the variance due to student ability (10.3%).
As mentioned above, many types of rater-bias are known to be at play when human judgement comprises part of any assessment process (halo effect, the mood of the rater, familiarity with candidates, personality factors etc8,9,10,11). Yeates and colleagues in 2013 proposed three themes to explain how examiner-variability arises26. They termed these: differential salience (what was important to one examiner differed to another); criterion uncertainty (assessors’ conceptions of what equated to competence differed and were uncertain); information integration (assessors tend to judge in their own unique descriptive language forming global impressions rather than discrete numeric scores).
Govaerts suggests that some examiner-variability may simply arise from individual examiners’ peculiarities in approach and idiosyncratic judgements made as a result, of the interaction between social and cognitive factors12.
Strategies to improve reliability in clinical assessments have ranged from increasing the number of items per station to implementing examiner training. Wilkinson et al analysed examiners marks over a four-year period in New Zealand and found that while items-per-station increased over the four years, there was no correlation between items-per-station and the station inter-rater reliability4. Cook et al.27 looked at the impact of examiner training and found no significant effect and while Holmboe et al.28 showed that examiner training was associated with an increase in examiner stringency, this increase was inconsistent.
In a recent literature review on rater cognition in competency-based education Gauthier et al.14 summarised the situation stating: “attempts to address this variability problem by improving rating forms and systems, or by training raters, have not produced meaningful improvements”.
In the field of psychology the Five-Factor Model of Personality (also referred to as the ‘Big Five’) has been proposed as an integrative framework for studying individual differences in personality and is among the most well accepted taxonomies of personality in the literature with wide application in different domains and across cultures due to its empirical validity18,20. In this personality index, no single cut-off point separates those who “have” a particular personality trait from those who do not, rather individual scores represent degrees of each of the five main personality traits – neuroticism, extroversion, openness to experience, agreeableness and conscientiousness. Score results are usually expressed as a T score and can be further described as being very low, low, average, high and very high for each of the domains. The different personality traits are often associated with certain personal characteristics. Neuroticism has been linked to susceptibility to social influence strategies20. Extroversion has been found to be positively related to networking behaviours in organisations29 and success in managerial and sales positions that require social interactions. Openness has been found to be the least susceptible personality trait to persuasion21. Other research has found agreeableness to be related to a tendency to favour positive social relationships and avoid conflict30. Employees who are high in conscientiousness generally display superior job performance as compared to employees who are lower in this trait18.
In clinical examinations Finn et al. found examiner stringency was positively correlated with neuroticism and negatively with openness to experience 23. The influence of examiner personality factors on scoring by examiner pairs has not been explored to date.
Objectives
To analyse how an examiners’ marks vary from when s/he examines alone to when s/he examines in a pair
To explore associations, if any, between examiner personality factors and examiner behaviour in scoring candidates
To explore the usefulness of personality profiling in matching examiners to form an examiner pair
Research question
Do examiners’ marks for a given candidate differ significantly when that examiner marks independently compared with when that examiner marks in a pair?
Is there an association between examiner personality factors and examiner behaviour in marking candidate’s performances?