This study used a new, bespoke programme within ExamWrite® (the question storage and delivery platform created by EpiGenysis for the Medical Schools Council Assessment Alliance) to generate four separate 50-question MCQ assessments based on weak theory or an item clone model of AIG, which was similar to that described by Lai et al., 2016a. Each question within the MSCAA question bank contains the minimum amount of information required to derive the correct answer so that cognitive load is reduced as much as possible rather than having a generic question template for the clinical vignette. This inevitably creates variability in the amount of content and overall length of each question. It is perhaps not surprising that items containing more information, and in particular those containing examination findings and investigation data, had the potential to create the greatest number of variants.
The assumption was that by altering incidental variables, ‘isomorphic’ items should be produced. The study therefore aimed to test whether the four AIG papers would have the same facility and standard set. The results however, suggest that ‘isomorphic’ MCQ assessments created using AIG do not necessarily have the same facility when given to students of similar ability as demonstrated by the fact that the average student performance varied by 10% across the four papers, from 55% (paper 4) to 65% (paper 3). This contrasts with a much smaller difference in the standard set for each paper, where the passing standard set only varied from 58% (papers 1 and 4) to 61% (paper 3).
As the cohort sitting each assessment was relatively large (444 to 659 students), it seemed reasonable to assume that the spread of student ability across the four papers would be similar. The assigned passing standard for each paper is consistent with this notion, with only small variances between papers. Importantly however, the average facility returned for each paper showed significant divergence. There are a number of alternative explanations for the differences observed in overall performance. Firstly, assessment papers were assigned per medical school and not randomly allocated to individual students. There is evidence showing that performance in MCQ assessments varies between institutions within the UK (Hope et al., 2021) and this may be one explanation for the difference in performance. The AIG papers were also sat within a relatively narrow time window of six weeks, resulting in student cohorts from different medical schools having a variable amount of time between sitting this formative assessment and their subsequent summative examinations. Different medical school cohorts are therefore likely to have been at different stages of exam preparation when they took this assessment. Students may also have varied in terms of their individual level of engagement with this online assessment depending on their approach to formative assessment opportunities.
As mentioned above, the standard set for each paper showed far less variation than student performance but approximated the average student performance for each of the four papers. It is well known that those setting the passing standard using the Angoff method have a tendency to revert to the mean (McLachlan et al., 2021). The findings of a previous large study looking at standard setting in Australian medical schools (Ward et al., 2018) also showed the same trend for standard setters to underestimate the difficulty level of hard items and overestimate the difficulty level of easy items, with implications for how well standards then correlate with actual student performance. Our study also showed significant reversion to the mean, in that on average judges underestimated the facility of easy questions and overestimated the facility of difficult questions. The standard set using the Angoff method should be that of a borderline candidate (a minimum passing score), which with a normal distribution of ability would mean that the majority of students should pass. If the standards set for our study were applied, around 50% of the total cohort would fail. The initial assumption could be, that the standard is too high and that those standard setting did not apply an appropriate standard for a final medical undergraduate examination. However, the members of the standard setting panel were drawn from a national standard setting group that have set a reliable standard for common content items used for final year examinations in medical schools across the UK in previous years. The more likely explanation is that success in MCQ examinations is significantly dependent on student preparation and participants in this formative study had not completed their pre-examination studies and are therefore likely to have had a lower level of knowledge than they would when they sit their summative assessments.
It was not just overall the student performance in a paper that differed. Individual questions showed significant variation both in terms of how each of the four variants performed relative to each other and also, in the degree of correlation between student performance and the passing standard set (highlighted in Table 4). In seeking hypotheses to explain this finding, the predominant theme identified was that the facility of question variants diverged most when clinical vignettes deviated more from typical keywords (or ‘buzzwords’) associated with the condition. The facility was lower for vignettes with greater deviation from the prototypical description of signs and symptoms than for the question variants that had a more classic description. Importantly, those standard setting did not appear to anticipate the degree of difficulty that this type of variant would engender. There are several possible explanations for this observation. The illness script is a concept that was introduced by Feltovich and Barrows (1984) to explain how doctors make diagnoses. An illness script consists of the typical components and general sequence of events that occur in a specific disease and once established, illness scripts allow automatic activation of pattern recognition (Custers, 2015). Script ‘instantiation’ occurs each time a physician sees a patient with a given condition, therefore each patient seen helps to refine the general illness script for that condition, for each individual clinician. Clinicians with more experience develop more refined illness scripts in particular with regard to associated ‘enabling conditions’ (the patient and contextual features such as age, risk factors etc that influence the probability of that condition being the diagnosis). It is likely that those setting the standard for the questions have more developed illness scripts and more readily arrive at the correct answer regardless of the text variants used, and therefore give each variant a similar standard. On the other hand, students will have less well developed illness scripts, have a more rudimentary organisation of events, and may rely more on prototypical descriptions of individual signs and symptoms when reaching an answer (Schmidt and Rikers, 2007; Custers, 2015). Therefore when a student’s knowledge is based mainly on learning prototypical descriptions (using keywords or buzzwords) rather than clinical experience, their pattern recognition of a given condition is likely to be less developed so they lack awareness of the variability in disease presentation seen in the real world (Stringer et al., 2021). This is exemplified by Item 12 (Fig. 3), where student performance is hypothesised to be related to the use of the phrase ‘loin to groin’ which acts as a buzzword for renal colic. In support of this hypothesis, Chan and Eppich (2020) found that doctors equated keywords (or buzzwords) with studying for undergraduate medical examinations and one of the participants in their study, a junior doctor said ‘…when we’re first learning clinical medicine, a lot of the patterns that we recognise are in specific phrases’ (Chan and Eppich, 2020). They concluded that keywords can communicate entire diagnoses and activate illness scripts independently of any other information. Think aloud studies looking at approaches to answering multiple choice questions have also identified recognition of buzzwords as a test-taking cognitive approach to answering questions (Surry et al 2017). Sam et al (2021) identified the response to buzzwords as a test-taking behaviour leading to superficial non-analytical cognitive processes in their think aloud study looking at the cognitive approaches students use to answer written assessments.
An alternative way to interpret the observed finding of a lower facility in variants containing less prototypical descriptions of a condition, is to consider cognition errors in the context of dual processing and bias. Norman (2009) describes ‘representativeness’ as a form of bias, which is the tendency to be influenced by the prototypical features of a disease and risk missing an atypical presentation. An example of this is shown in Fig. 2, where students were more likely to correctly diagnose acute cholecystitis if there was a prototypical description of right upper quadrant or upper abdominal pain but were less likely to make the diagnosis if the pain was described as epigastric, even when other evidence supported this diagnosis. Representativeness bias was also demonstrated in a study by Rathore et al (2000) in which two role players (one a white man and the other a black woman) both presented with identical symptoms of ischaemic heart disease and students were less likely to characterise the black woman’s symptoms as angina than the white man’s (46% vs 74% for the white male patient, P = 0.001). Croskerry (2003) describes a number of different cognitive errors including premature closure, which is the acceptance of a diagnosis based on initial salient information and without consideration of the whole presentation. Overreliance on keywords can result in premature closure, if the keyword(s) is/are assumed to verify the diagnosis (Chan and Eppich, 2020) and premature closure due to honing in on keywords has also been cited as a cause of cognitive error when answering MCQs (Heist et al., 2014; Surry et al., 2018). In this study we also found that changing a keyword or phrase could potentially invoke a false illness script as shown in Fig. 3, where the undue emphasis on weight loss (even though it is only a modest amount) was thought to have made a significant number of students erroneously consider cancer as the most likely diagnosis, as this would be the primary reason for requesting a CT scan.
Whilst this study has demonstrated that question variants created using AIG (and presumed to be isomorphic) have different psychometric properties, we acknowledge that there were limitations to the study. Firstly, participants were randomised by medical school and therefore the time between sitting this assessment and the final summative examinations were different between different medical schools and this may have impacted on student performance in this assessment. Furthermore, we know that performance in common content items in summative examinations also varies between medical schools across the UK (Hope et al., 2021). Therefore, differences in performance between papers may be a result of difference in school cohort performance rather than question characteristics. The common content items used for standard setting were part of a secure question bank that were not available for formative assessment due to concerns regarding item security however using common content items in the student assessment papers would have helped identify whether the differences between papers were a function of the question items or overall student ability. Secondly, whilst the study set out to investigate whether differences in performance and standard setting were observed, it was not designed to test any specific hypotheses as to why this might occur.
We also offer a possible explanation for this phenomenon in terms of illness scripts, reliance on keywords and the resultant bias that can be created. Further research into the effect that using or avoiding keywords and prototypical descriptors has on student performance and standard setting behaviour is warranted.