Simulated data generation
We simulated the operation of VESCA through three sequential processes (see Fig. 1), by simulating the combined effect of several known influences on OSCE scores. All parameter estimates were empirically-derived from analysis of Yeates et al’s (9) data.
Firstly, we modelled the “true” performance of a range of students on each station in an OSCE using a simple sum-score approach. Data were generated using the GeCos scale(14) which combines ratings on several performance domains to give a scale minimum of 6 and maximum of 27. To do this we randomly generated a distribution of students’ overall ability (M = 19.47 out of 27; SD = 1.13 (5.4% of scale)) and then generated a range of station difficulties (SD 1.52 (7.2% of scale)) and an idiosyncratic studentxstation interaction (SD = 1.71 (8.1% of scale)). We combined these, using a linear function to produce students’ simulated “true” performance on each station in the OSCE.
Secondly, we added examiner variability to these scores by creating a distribution of examiners (SD = 1.40 (6.7% of scale)). Examiners were randomly allocated to a station and to 1 of 4 examiner cohorts (i.e. distinct groups of examiners) such that each students’ “true” scores were exposed to a unique group of examiners stringencies, and the same examiner stringency applied to all students for a given station within a cohort. As examiners did not change station, we could not model examinerxstation effects. Next, we simulated an additional random error term (SD = 2.35 (11% of scale)) to capture additional unmodeled variation in examiners’ scoring (for example due to the time of day (15), contrast (16) or halo (17) effects from the previous candidates, examinerxstudent interactions, and any other unknown sources of variability). We summed the students’ “true” performance score on each station, with the examiner stringency and the additional random term to give the student’s “observed score” on each station in the OSCE – the scores they would have actually received in the exam. Formally, generation of the observed students’ scores can be expressed as:
$${Score}_{ijk}= {\beta }_{0}+{u}_{1}{Station}_{i}{ + u}_{2}{Student}_{k}{ + u}_{3}{Student:Station}_{ik}+{ u}_{4}{Examiner}_{j}+ {\epsilon }_{ijk}$$
Where: \({\beta }_{0}\) the overall model intercept (i.e, average student score in the dataset), \({u}_{1}\) station difficulty \(i\), \({u}_{2}\) student ability \(k\), \({u}_{3}\) the interaction between student \(k\) and station\(i{, u}_{4}\) examiner \(j\) stringency, and \({\epsilon }_{ijk}\) is the residual error.
Thirdly, we mimicked the influence of the VESCA procedure by randomly selecting a specified number of student performances on each station and nominating these as “video performances”. A proportion of examiners were then randomly selected to “participate” (see RQ 3) and the stringency values of these examiners + the random error term were applied to the relevant “video performances” for the station they had examined. This created an additional set of crossed “video scores” for each station as would be collected by using VESCA (i.e. the same “video performances” were scored by multiple examiners from different examiner cohorts). This created a dataset comprised of students’ “live” observed scores on each station in the OSCE, and further observed video scores allocated to station-specific videos by examiners. All data generation was performed via a flexible function written in R (18). The function always has four cohorts of examinees but allows the manipulation of i) the number of linking videos, ii) the min and max of the score range, iii) the numbers of stations, iv) the number of candidates, v) the number of cohorts, vi) the number of examiners, vii) the mean ability of a candidate, viii) the standard deviation of candidate scores, ix) the standard deviation of station difficulties, x) the standard deviation of examiner stringencies, xi) the standard deviation of a station by candidate interaction (i.e., the error in the ‘performance score’) and xii) the expected proportion of examiners who would participate in the linking process. See Fig. 1 for details.
The Many-Facet Rasch Model
As in the procedures used by Yeates et al(9), these data were then analysed using Many Facet Rasch Modelling, in FACETS (19) to produce an adjusted overall (i.e. average) score for each student (see Fig. 1). The Many-Facet Rasch Model (MFRM) (20) expands the simple two parameter Rasch model (21), which focuses on item difficulty and student ability, to include additional facets to model effects such as rater leniency, schools, locations, etc. A simple, three facet model could be expressed as:
$$log\left(\frac{{P}_{nijk}}{{P}_{nij(k-1)}}\right)={B}_{n}-{D}_{i}-{C}_{j}-{F}_{k}$$
Where, \({P}_{nijk}\) is the probability that person n, on item i by judge j, is given a rating of k. \({P}_{nij(k-1)}\) is the probability that person n, on item i by judge j, is given a rating of k-1, \({B}_{n}\)is the ability measure of the test taker n, \({D}_{i}\) is the ‘difficulty’ of test item i, \({C}_{j}\) is the severity of rater j, and \({F}_{k}\)relates to the probability of being assessed in category k of item I, rather than category k-1. Applying this within our study, the specific model used was:
$$log\left(\frac{{P}_{nijk}}{{P}_{nij(k-1)}}\right)={Student}_{n}-{Station}_{i}-{Cohort}_{j}-{F}_{k}$$
Which models the probability of student n responding to station i, examined by an examiner in examiner cohort j being rater in category k on item i, rather than category k-1.
We ran each simulation 1000 times in order to obtain stable estimates. As this was computationally demanding, simulations were run via 16 virtual machines on a 16-core server each linking R to facets using the R package “immer”(22).
Simulations
Several simulations were conducted to mimicking the VESCA method in various contexts. Unless otherwise specified, simulations modelled 12 stations, 60 students in 4 cohorts with 48 examiners, with an assumed 80% of examiners participating, and 4 linking videos.
Study 1 – The first study addressed RQ1 by modifying the number of linking videos (0, 2, 4, 6 and 8) and the expected proportion of examiners to consent to providing linking data (50%, 65%, 80% and 100%). This included modelling “typical” conditions (i.e. Yeates et al 2021) which comprised 4 linking videos and 80% participating examiners. No baseline differences between schools were modelled in study 1. All permutations of parameter values were simulated for a total of 5 (range of linking videos) x 4 (range of examiner participation rates) = 20 sets of 1000 simulations for each unique pair of values.
Study 2 – The second study addressed RQ2, by looking at the effect of changing the number of stations (6, 12, 18) and the degree of site-related baseline difference in examiner stringency / student leniency (0%, 5%, 10%, 20%) – see last paragraph of background for definition. Baseline differences were modelled selecting 2 examiners-cohorts as “school A” and 2 examiner cohorts as “school B” and then adding or subtracting the relevant percentage score to the students and examiners coefficients for each school. We assumed that examiner stringency was completely negatively correlated with student ability (i.e., as students became more able, examiners were more stringent and thus the mean expected scores between sites would be equal). All possible combinations of parameter values were simulated for a total of 3 (numbers of stations) x 4 (degrees of baseline difference) = 12 sets of 1000 simulations for each unique pair of values.
Study 3 – The third study examined RQ3 by reducing the size of the overall residual error term on the performance of the VESCA linking model. This was done by dividing the error term by 2 (error/2 – i.e. 50% of error in prior studies); by 4 (error/4, 25% of the error in prior studies) or by 8 (error/8, 12.5% of the error in prior studies). The objective of this study was not to investigate a plausible real-life situation (as reducing the residual error is a very difficult to achieve) but to understand the impact that this residual score error was having on the functioning of VESCA.
Measurement of performance
Having generated data using these parameters and subsequently obtained FACETS estimates of each students’ adjusted score, we used them to determine accuracy of the estimates.
To do this, we calculated three variables for each student, for all 1000 iterations of each permutation of each study:
-
Observed Score Error: The mean absolute difference (MAD) of the observed score – the performance score. This gave the residual error of each student’s observed score, from their “true” score, prior to adjustment.
-
Adjusted Score Error: The mean absolute difference (MAD) of the adjusted score – the performance score. This gave the residual error of each student’s score, from their “true” score, after adjustment via the VESCA method.
For the VESCA method to show utility, we would expect the adjusted scores to be closer to the “true” scores than the observed scores. Lastly, we calculated:
-
Score Adjustment: The mean absolute difference of the adjusted score – the observed score
This gave the size of the adjustment made to each student’s score using the VESCA method
We then calculated the first of our dependent variables: the proportion of students whose adjusted score became more accurate than their observed score (for brevity, termed “pAcc”). This was defined as the proportion of students for whom “adjusted score error” < “observed score error” (i.e. VESCA score adjustment had resulted in a score nearer to their “true” performance score).
For each permutation of each study, we then calculated:
-
The mean of all students “observed score error”
-
The mean of all students “adjusted score error”
-
The ratio of mean “adjusted score error”: mean “observed score error” (i.e. 1. / 2.)
This demonstrated, on average, how much score accuracy changed for each permutation in each study. For brevity, we term this the “error ratio” (ErR), noting that values below 1 indicated improved accuracy and values above 1 indicated reduced accuracy.
To address RQ4 (how does the proportion of candidates whose scores become more accurate vary for different sizes of score adjustment), we categorised each students’ data in each permutation of each study, based on the size of the score adjustment they received, using categories of score adjustment (expressed as a percentage of the assessment scale) of: [0–1%), [1–2%), [2–3%), [3–4%), [4–5%), [5–6%), [6–7%), [7–8%), [8–9%), (> 9%). Next, we further categorised students based on the extent of change in the accuracy of their adjusted scores compared to their observed scores (i.e. how much more or less accurate their adjusted score became), using categories also based on percent of the assessment scale of (<-6%), (-6%—4%], (-4%—-2%], (-2%—0%], [0%-2%), [2%—4%), [4%—6%), (> 6%). We then tabulated these results for inspection. To aid categorisation of these findings, we used a target of 80% of students’ scores becoming more accurate in order to define whether a useful threshold could be established.