Pass Rate Set by Borderline Regression Method but not by Modified Angoff is Independent on Difficulty of Content in Objective Structured Clinical Exams

Background Standard setting is a method of determining the cut-off point on the scoring scale that separates the competent from the non-competent. This is a crucial feature of each exam. Pass rate should ideally be independent on the difficulty of exam content. Methods We compared the modified Angoff method (MAM) with the borderline regression method (BRM) of standard setting in 185 candidates examined by 137 examiners in the oral part of the European Diploma in Intensive Care exam, June 2018. We then compared the effect of removal of the hardest questions on the performance of the two techniques. The exam comprised 299 items in total across 6 OSCE stations. OSCE stations were of two types; short computer based OSCE stations (3 x 12 minutes), and longer structured discussion stations based on a clinical case (3 x 25 minutes). Our focus was the effect of item difficulty on the performance of the two standard setting techniques in determining the pass mark. Results MAM and BRM both led to similar pass rates overall for the shorter computer based 12 min OSCE stations. In the longer structured discussion 25 min stations MAM set a pass mark much higher than BRM, failing more of the candidates whose performance during the examination was judged by examiners on their global assessment as above the standard required to pass. Further analysis showed the exam items most affecting this were the more difficult items with lower discrimination; Angoff judges over-estimated the borderline candidates ability for these items. Elimination of these items led to convergence of pass marks achieved by the two methods. Conclusion Pass mark setting by Modified Angoff Method, but not by Borderline Regression Method, is influenced by the difficulty of exam content. This has practical implication for evaluating the results of OSCE exams.


Glossary Terms
Facility index: Facility index is the proportion of candidates who answer a test item correctly. We prefer using this term over "item difficulty" as the index ranges from 0 to 1 and the higher the number the easier the question. Indeed it may be considered counterintuitive that easy questions that are answered by more candidates have in fact higher difficulty.
Item discrimination is the degree to which students with high overall exam scores also got the particular item correct. The higher the discrimination, the better the item discriminates between the competent and the non-competent. Easy questions (high facility index) are expected to have lower discrimination than hard questions (low facility index).

Background
The European Diploma in Intensive Care (EDIC) [1] is awarded to candidates who pass a two-stage CESMA-accredited exam [2] process at the end of their intensive care medicine (ICM) training. Part one of the EDIC is a multiple-choice test (MCQ) testing core knowledge of clinical intensive care medicine. Part two is an oral examination, further testing knowledge, but with greater focus on the application and integration of that knowledge into practical decision making at the bedside that comes with experience. Since 2013, the oral part of the exam has been held twice a year simultaneously in 5 to 7 examination centers across Europe. The format has been a modified Objective Structured Clinical Examination (OSCE) [3], with all candidates being asked the same predefined questions with pre-specified correct answers. This Part 2 (oral) examination comprises two sections, each of a different OSCE style; The shorter section uses computer-based questions (3 x 12 minutes) to test radiology, investigations and monitoring, for example. The longer section uses a structured discussion style based around a single clinical case (3 x 25 minutes). A candidate must pass two out of three clinical case scenarios (CCSs) and two of three computer-based stations (CBSs) in order to pass the exam. During the CBSs (12 min each) one examiner tests the candidate's capability to recognize common patterns of clinical imaging, different curves relevant in critical care (monitor traces, ventilator screens, ECGs etc.) and biochemical abnormalities. During each CCS (25 min each) two examiners test one candidate's ability to make routine decisions for a typical intensive care clinical case. The OSCE will start with questions on the initial clinical scenario provided. Questions then further explore the case as it unfolds with the introduction of new information given to the candidate. The clinical cases are derived from 'real life' cases intended to reflect real critical care problems as closely as possible (for more details about EDIC exam see www.esicm.org) [1].
After ESICM Examinations Committee decided to stop weighing the exam items according to clinical relevance due to unstable pass mark (See Appendix Part 1 and ref [4]), a new pass mark technique had to be chosen. In turn, for the Spring 2018 examination, the pass mark was derived by two separate and parallel techniques, the modified Angoff technique with unweighted marks (otherwise similar to the previous technique), and the borderline regression method [5][6][7][8]. Data were collected and analyzed to compare the performance of these two and the effect of the harder questions. This paper reports on interesting generalizable findings discovered by this comparison.

Methods
The examination process. The oral part is organized twice a year and run over a single day simultaneously in 5-7 exam centers across Europe. Each centre typically accommodates around 36 candidates. Each candidate is exposed to 9 examiners across 3 CCSs and 3 CBSs. The examination begins for the candidate with a 30 min preparation period, during which they are given introductory vignettes to all 3 CCSs; these vignettes resemble an extract from patients' notes at or near referral to the intensive care physician and contain information pertaining to the history, physical examination and results of laboratory tests and imaging. Following this initial 30 minutes spent on their own reviewing these three vignettes, candidates move on through the 3 CCSs and 3 CBSs.
Each CCS last 25 minutes. Initial questions pertain to the introductory vignette. Further vignettes are then provided as the case develops, containing additional information. There are 2 examiners in the CCS room; one interacts with the candidate, while the other is records the correct answers on a tablet. Once the candidate has left the room at the end, the examiners spend 5 min to agree on which answers were correctly provided, and to agree a global assessment of performance against the defined standard for the examination. The EDIC 2 (final, or part 2) examination is regarded as an exit examination, testing knowledge and ability to apply that knowledge at a level commensurate with completion of training as an intensive care physician in Europe. The global assessment made jointly by both examiners at the end of each CCS and each CBS is an assessment of where the examiners think the candidate fell along a spectrum for that question and independent of the difficulty of that particular question. The score used comprises a 1 for clearly unsatisfactory, 2 for bare fail, 3 for borderline, 4 for clearly satisfactory, and 5 for superior performance. For any given performance on this scale, a candidate might be expected to achieve a higher number of correct answers for an easier question, or a lower number of correct answers for a harder question.
The CBS is examined by 1 examiner aided by a computer and marked in a similar way with both marks achieved for correct answers, and a global assessment recorded using the same scale as used in the CCS section (above). During 12 min exam time, the candidate is exposed to 8-12 slides containing a brief description of clinical context, an image and one or two questions. These single slide questions aim to tests candidate's ability to recognize patterns and integrate them with the clinical information. The focus of CBSs is imaging (x-rays and computer tomography), curves (haemodynamics, ventilator and monitor screens) and laboratory or blood test results. The examiner records correct answers provided by the candidate as the exam progresses, again on an electronic tablet. The examiner then has 3 minutes after the candidate leaves the room in which to record their impression of the candidate's general performance of on the same scale as used in CCSs.
Selection and training of examiners. The ESICM Examinations Committee prepares the exam content with the assistance from a consultant educationalist and supervises the process of selection and training of EDIC examiners in each center. Apart from mandatory training that includes observing one series of the exam, all examiners have a half-day training before every single exam and full explanation of the content of the exam. The measures taken to equalize examiners performance in different centers were described elsewhere [9].
Post-hoc evaluation of exam content. After each exam, the facility index and discrimination of each item is calculated and reviewed by the Examination Committee during a post-exam key validation meeting. Facility index (difficulty) is the proportion of candidates who gave the correct answer (low being easy and high being hard). The range by convention is between 0 and 1. The measure of discrimination used in this exam is the point-biserial correlation (item-rest correlations [10]). This describes how the answer on the particular item correlated with the total score of whole exam station without that item. Point biserial discrimination ranges between -1. In the first round each rater independently reviewed the content item by item. The question was: "How many percent of minimally competent candidates would give the correct answer?". The minimally competent candidate in this context is one who possesses the minimum level of knowledge and ability to meet the standard of the exam; that is to have completed training as an intensive care physician as considered by the ESICM examinations committee, thus being be able to safely look after patients unsupervised. The experts were instructed to predict each item facility index in range of 10-90% (in accordance with Angoff's original text, 100% is typically avoided because of the connotation it carries of perfect performance). The original Angoff method [11] was later commonly modified by adding second round of evaluation as described [12]. During the second-round, individual ratings were disclosed to the group. Items in which the raters differed by more than 40% were discussed and the raters with extreme ratings are allowed but not forced to re-evaluate their probabilities. Predicted probabilities of the minimally competent candidate giving the correct answer for each item was taken as the mean of all the judges scores for that item after the discussion. The pass mark for each station was calculated as the sum of all these means for items in that station.
Borderline regression method (BRM) of standard setting. BRM is a modern examinee-centered standard setting procedure, where an expert's (examiner's) decisions are based on the actual performance of the examinee [6] during the course of the exam [6]. A rater evaluates student's performance at each station by completing both a checklist recording correct answers provided, as well as recording a global rating scale as an assessment of their overall performance as compared with the defined standard for the exam (1 for clearly unsatisfactory, 2 for bare fail, 3 for borderline, 4 for clearly satisfactory, and 5 for superior performance). The scores (total of correct answers for each candidate) from all examinees at each station are subsequently regressed on the global ratings, providing a linear equation. The global score representing borderline performance is then the pass mark (See Figure 1). The pass mark is taken as the score of correct answers corresponding to the borderline global rating on this regression. All candidates achieving that score or above would pass Comparisons of the impact of hard questions on the performance of the two standard setting methods were made two steps; firstly, the two techniques were compared on the full questions set (299 questions) and then again on the 282 questions after the 17 questions removed that had been deemed to be the hardest and least discriminatory by the Examiners Committee. Secondly, the two methods were compared by ranking the questions in terms of hardness (low facility index) and then performing iterative comparisons of the techniques as progressively more and more of the harder questions were removed. These iterative calculations of pass mark and pass rate by MAM and BRM as an increasing proportion of questions from the hard end of the spectrum were removed was automated by script written in Mata language [19] (Figure 2). The comparison between the two standard setting techniques and the effect of removal of the questions at the hard end of the spectrum was explored for both CCS type questions and CBSs (Figure 3). Relative reliability G coefficients (See Appendix, Part 6) were calculated using EduG 6.1 software with Candidates/Items measurement design [20].  and also below the expectation of the committee and the examiners. Even after removal of the 17 questions, the pass rate remained lower by MAM than by BRM (41% vs 51%).

Results
The CCS question types was the main factor in the different effects seen between the two techniques when the 17 questions were removed. Removing these 17 questions made little difference to the pass mark for CBS questions with either MAM or BRM. MAM suggested an exceptionally low pass rate for CCS questions when these 17 were included, only moving towards a more usual or expected pass rate when they were excluded. The pass rate when judged by BRM varied little with or without these 17 questions and either way fell within the expected range.
The borderline regression method (BRM) requires, as described above, a contemporaneous judgement by the examiners along a global assessment scale as to where the candidate's performance fell along a pass-fail spectrum. Using this, we employed a sub-group analysis on the candidates deemed to be borderline. In this sub-group, the pass rate was close to 50% if BRM was used in both the full 299 questions and the 282 after the 17 were removed, but much lower if MAM was used to set the standard (See Table 2). Of note, 22%, 73% and 23% candidates rated as "clear pass" or "superior performance" by examiners during CCSs 1-3, respectively would have failed that station had MAM been used for standard setting. In summary, our primary results showed that BRM produced believable or acceptable results with or without the 17 questions removed, and in both CCS and CBS type questions. In contrast, MAM produced an unacceptable and improbable pass mark, particularly on the full 299 question dataset and particularly when used on CCS type questions. We then looked more closely at CCS type questions for reasons for this.
We noticed that CCSs had much higher proportion of harder (low facility index) and poorly discriminatory items (27%) as compared to CBSs (3%) -See Supplementary Appendix Part 4 for details. We therefore hypothesized that elimination of poorly performing exam items would be affecting standard setting by MAM, but not by BRM. In order to further evaluate this, we looked at the effect on the two methods that progressive removal of an increasing proportion of the harder items would have. We iteratively removed items in a stepwise manner starting with those with item that were harder (lowest facility index) and with lowest discrimination (See Fig.2) and calculated the impact of such operation on the hypothetical pass rates achieved by application of MAM and BRM to set the standard. As shown in Fig. 2, the pass rate set by BRM remains almost unaffected, whilst pass rate set by MAM increases and converges with the pass rate obtained by BRM. Of note, the difference between pass rates by MAM and BRM at baseline is proportional to the proportion of lowfacility-low-discrimination items.
In order to explore why MAM is influenced by the presence of hard items, we looked more The hardness of an exam overall should not affect the pass rate for any given cohort. The ideal standard setting technique should take this into account, setting a higher pass mark for an easier question set and vice versa. By progressively removing an increasing percentage of the questions from the harder end of the spectrum, we were able to test the effect of the two standard setting methods on the pass mark and pass rate with progressively an easier on average question set.
BRM provided a more constant pass rate as the average question set difficulty was varied, which was reassuring. MAM, however, progressively resulted in a higher pass rate as the average question set became easier; conversely, MAM made the exam harder to pass if the question set was harder on average. MAM predicted the harder questions to be easier than they really were, so a greater failure rate on these harder questions pulled the average pass rate down when they were included.
In our data, when plotting pass rate we found that if the question set hardness and discrimination was such that the pass mark was assessed at around 65%, both techniques concurred.
At the point of convergence of the pass rate by both methods, the pass mark was in range 63-66% (See Table S1 in Part 5 of Supplementary appendix for detailed results).
The intercept of the line in Fig. 3 is the predicted facility index of a hypothetical item with a real difficulty index of 0 (no candidate getting it right). For example, an extreme question with an answer that nobody could possibly guess or know. Yet, MAM predicts a percentage of correct answers (albeit low).

Discussion
The main finding of our analysis is that in the oral OSCE-type exam, pass rate set by modified Angoff method (MAM), but not by Borderline regression method (BRM), is dependent on exam content. The presence of larger proportion of harder items with low discrimination leads to arbitrary increase in the pass mark set by MAM, leading to failing candidates rated as clear passers or superior performance by the examiners. BRM is not affected by these harder nondiscriminatory items, suggesting that examiners on site are not adversely influenced by them. Of particular note and importance, the inclusion of these items distort the relationship between MAM-predicted and real difficulty of items in the subgroup of borderline candidates; this is the sub-group of candidates who will be most affected by the incorrect setting of the pass mark. These data show that harder lower discriminatory items represent a challenge to Angoff raters and therefore, in turn, a challenge to candidates.
This finding is important as MAM is still widely used for standard setting in many high-stake stakes oral exams. As opposed to written exams with options laid in front of candidates' eyes (e.g. MCQs), it is much more difficult to give the pre-specified correct answers when an open question is asked. Our analysis clearly demonstrates that Angoff judges systematically fail to identify items that are next to impossible to answer. It would require a detailed analysis of the discussion during the second round of the MAM to find out whether there was a "emperor's new clothes" phenomenon, where some of the judges are guarding their feeling of competence [21] and consciously or subconsciously deny to admit that they did not know the answer either. It has already been found that Angoff judges tend to reach this consensus at a higher cut-off score when group discussion is allowed (Wood 2006), a phenomenon that has been called group polarization [22,23].
Harder, poorly discriminatory items were mostly found in longer (25min) CCSs as opposed to shorter ( BRM is an extension of the borderline group method [24,25], where the mean score of borderline group is used as the station pass mark. In contrast, BRM uses the regression line based on actual performance of all examinees [5]. BRM cannot be used in exams sat by fewer candidates (e.g. below 50), which is often the case for high-stake specialist exams. Given the relatively high number of both examinees and examiners (>100), we consider this method optimal for standard setting for EDIC exam.
One weakness of this study is that we used only 8  Availability of data and material. All data generated or analysed during this study are included in this published article and its supplementary information files.
Competing interests: The authors declare that they have no competing interests