Data were generated in 35 separate simulations, resulting in 25,200,000 “performance” scores (i.e. scores for 2,100,000 students on an average of 12 stations).

Research Question 1: How is the accuracy of score estimates produced by VESCA influenced by:

a. The number of linking videos per examiner (0, 2,4,6, or 8 linking videos)

b. The proportion of examiners who participate in scoring videos (50%, 65%, 80%, 100%)

c. The combination of these 2 effects

These questions were addressed by study 1. The accuracy of adjusted scores across all parameters modelled in this study were low. Notably, this study assumed that there were no baseline differences between examiners in different sites. Error ratio (ErR) values ranged from a worst case 1.22 (i.e. adjusted scores contained 22% *more* error than observed scores) for 2 linking videos, with 50% examiner participation, to a best case of 0.94 (i.e. score adjustment removed 6% of the error in the observed scores) for 8 linking videos with 100% examiner participation. The proportion of students whose scores became more accurate (pAcc) as a result of adjustment corresponded closely, ranging from pAcc = 0.44 (44% of students’ scores became more accurate; 56% of students’ scores became *less* accurate) for 2 linking videos / 50% examiner participation, to pAcc = 0.53 (53% of students’ scores became more accurate) for 8 linking videos / 100% examiner participation. A detailed breakdown of all permutations of these parameters can be seen in Table 1.

Table 1

Influence of number of linking videos per examiner and proportion of participating examiners on adjusted score accuracy

Number of Linking Videos per Examiner | Proportion of participating examiners | Mean Error in Observed scores (SD) | Mean Error in Adjusted Scores (SD) | Error ratio | Proportion of students' whose scores became more accurate through adjustment |

0 | 50 | 0.603 (0.46) | 0.623 (0.47) | 1.03 | 0.48 |

0 | 65 | 0.605 (0.46) | 0.619 (0.47) | 1.02 | 0.48 |

0 | 80 | 0.605 (0.46) | 0.618 (0.47) | 1.02 | 0.49 |

0 | 100 | 0.600 (0.45) | 0.618 (0.47) | 1.03 | 0.48 |

2 | 50 | 0.597 (0.45) | 0.728 (0.56) | 1.22 | 0.42 |

2 | 65 | 0.587 (0.45) | 0.676 (0.52) | 1.15 | 0.44 |

2 | 80 | 0.588 (0.45) | 0.643 (0.5) | 1.09 | 0.46 |

2 | 100 | 0.589 (0.45) | 0.612 (0.47) | 1.04 | 0.48 |

4 | 50 | 0.584 (0.45) | 0.661 (0.52) | 1.13 | 0.44 |

4 | 65 | 0.580 (0.45) | 0.618 (0.48) | 1.07 | 0.47 |

4 | 80 | 0.579 (0.45) | 0.592 (0.46) | 1.02 | 0.49 |

4 | 100 | 0.579 (0.45) | 0.565 (0.44) | 0.98 | 0.52 |

6 | 50 | 0.573 (0.44) | 0.625 (0.49) | 1.09 | 0.46 |

6 | 65 | 0.569 (0.44) | 0.586 (0.46) | 1.03 | 0.48 |

6 | 80 | 0.570 (0.45) | 0.563 (0.44) | 0.99 | 0.50 |

6 | 100 | 0.563 (0.45) | 0.538 (0.43) | 0.96 | 0.52 |

8 | 50 | 0.567 (0.44) | 0.614 (0.48) | 1.08 | 0.46 |

8 | 65 | 0.563 (0.45) | 0.569 (0.45) | 1.01 | 0.49 |

8 | 80 | 0.557 (0.44) | 0.544 (0.43) | 0.98 | 0.51 |

8 | 100 | 0.556 (0.45) | 0.524 (0.42) | 0.94 | 0.53 |

Accuracy of the adjusted scores was independently influenced by both the number of linking videos and the proportion of participating examiners. Changing the number of linking videos per examiner (whilst averaging across all of the included categories of examiner participation, i.e. keeping this constant) gave error ratios for 0 video = 1.03, 2 videos = 1.13, 4 = 1.05, 6 = 1.02, 8 = 1.00, with corresponding proportions of students seeing increased score accuracy (pAcc) values of 0 videos = 0.48, 2 videos = 0.45, 4 = 0.48, 6 = 0.49, 8 = 0.50 respectively. Notably, therefore, the accuracy of adjusted scores was reduced (compared to no linking) by having 2 linking videos per examiner, but then progressively slowly increased for larger number of linking videos.

Changing the proportion of participating examiners (whilst averaging across all of the included categories of linking videos, thereby keeping those constant) showed a more linear pattern, giving error ratios for 50% of examiners = 1.11, 65% of examiners = 1.06, 80% = 1.02 and 100 of examiners = 0.99. Corresponding proportions of students whose scores became more accurate (pAcc) were 50% of examiners = 0.45, 65% examiners = 0.47, 80% = 0.49 and 100% = 0.51 respectively.

Research Question 2: How is the accuracy of score estimates produced by VESCA influenced by:

a. Differing extents of baseline differences in examiner stringency between different sites (0%, 5%, 10%, 20%)

b. The number of stations in the OSCE (6, 12, or 18 stations)

c. The combination of these two effects

These questions were addressed by study 2. The accuracy of adjusted scores varied substantially in this study. Error ratio (ErR) values ranged from a worst case 1.42 (i.e. adjusted scores contained 42% *more* error than observed scores) for 0% baseline difference in examiner stringency, with 18 OSCE stations, to a best case of 0.29 (i.e. score adjustment removed 71% of the error in the observed scores) for 20% difference in baseline examiner stringency with 12 OSCE stations. The proportion of students whose scores became more accurate (pAcc) as a result of adjustment showed a corresponding pattern, ranging from pAcc = 0.37 (only 37% of students’ scores became more accurate for 0% baseline difference and 18 OSCE stations, to pAcc = 0.93 (93% of students’ scores became more accurate) for 20% baseline difference and 18 OSCE stations, with a very similar finding (pAcc = 0.92) for 20% baseline difference and 12 OSCE stations. A detailed breakdown of all permutations of these parameters can be seen in Table 2.

Table 2

Influence of stations in the OSCE and degree of baseline difference in examiner stringency on adjusted score accuracy

Degree of baseline difference between school (% of scale) | Number of Stations in OSCE | Mean Error in Observed scores (SD) | Mean Error in Adjusted Scores (SD) | Error ratio | Proportion of students' whose scores became more accurate through adjustment |

0 | 6 | 0.814 (0.63) | 0.829 (0.64) | 1.02 | 0.49 |

0 | 12 | 0.579 (0.45) | 0.592 (0.46) | 1.02 | 0.49 |

0 | 18 | 0.475 (0.37) | 0.674 (0.52) | 1.42 | 0.37 |

5 | 6 | 0.907 (0.7) | 0.828 (0.64) | 0.91 | 0.54 |

5 | 12 | 0.712 (0.54) | 0.592 (0.46) | 0.83 | 0.59 |

5 | 18 | 0.635 (0.47) | 0.673 (0.52) | 1.06 | 0.49 |

10 | 6 | 1.172 (0.85) | 0.825 (0.64) | 0.70 | 0.67 |

10 | 12 | 1.056 (0.68) | 0.589 (0.46) | 0.56 | 0.75 |

10 | 18 | 1.026 (0.59) | 0.67 (0.52) | 0.65 | 0.70 |

20 | 6 | 2.012 (1.07) | 0.82 (0.64) | 0.41 | 0.85 |

20 | 12 | 1.996 (0.83) | 0.586 (0.45) | 0.29 | 0.92 |

20 | 18 | 1.998 (0.72) | 0.667 (0.52) | 0.33 | 0.93 |

Table 3

Influence of reduction in examiner random error on adjusted score accuracy

Reduction in error | Mean Error in Observed scores (SD) | Mean Error in Adjusted Scores (SD) | Error ratio | Proportion of students' whose scores became more accurate through adjustment |

Error / 2 | 0.375 (0.29) | 0.323 (0.25) | 0.86 | 0.56 |

Error / 4 | 0.302 (0.24) | 0.207 (0.16) | 0.69 | 0.62 |

Error / 8 | 0.28 (0.22) | 0.166 (0.13) | 0.59 | 0.66 |

Accuracy of the adjusted scores showed different relationships with the baseline difference in examiner stringency and the number of OSCE stations. Changing the baseline difference in examiner stringency (whilst averaging across the 3 different numbers of OSCE stations, i.e. keeping this parameter constant) gave error ratios for 0% baseline difference = 1.15, 5% baseline difference = 0.93, 10% = 0.64, and 20% = 0.34 with corresponding proportions of students seeing increased scores accuracy (pAcc) values of at 0% baseline difference = 0.45, 5% = 0.54, 10% = 0.71 and 20% = 0.90 respectively. Consequently at 0% baseline difference in examiner stringency, score adjustment made scores less accurate, whereas at 20% baseline difference in examiner stringency, 66% of error was removed and 90% of students’ scores became more accurate.

Changing the number of stations in the OSCE (whilst averaging across all levels of baseline difference in examiners stringency, thereby keeping those constant) gave error ratios for 6 OSCE stations of 0.76, 12 stations of 0.68 and 18 stations of 0.87. Corresponding proportions of students whose scores became more accurate (pAcc) were 6 stations = 0.64, 12 stations = 0.69, and 18 stations = 0.62. Consequently, these different numbers of OSCE stations produced a U-shaped influence on adjusted score accuracy, with adjustments made from an OSCE with 12 stations showing greater accuracy than the score adjustments made from either a 6 or 18 station OSCE. Notably, however, the extent of error in observed scores for 18 stations (i.e. the amount of error contained in the unadjusted scores produced by examiners) is lower than for 12 stations (3rd column Table 2), so this observation may arise from an interaction of the effectiveness of score adjustment with the amount of error originally present.

Research Question 3: How is the accuracy of score estimates produced by VESCA influenced by reduction in the degree of random variability in examiners’ scoring (random error divided by 2, by 4, and by 8)

This question was addressed by study 3. As in study 1, it was performed with an assumption of 0% baseline difference between sites, and used standard parameters (12 station, 4 linking videos and 80% examiner participation). Accuracy of adjusted scores increased progressively as the amount of random error was reduced. Error ratios (ErR) for the usual extent of random examiner error = 1.02, half usual random examiner error (err/2) = 0.86, one quarter random error (err/4) = 0.69, and one eighth usual random examiner error (err/8) was 0.59. corresponding proportions of students whose scores became more accurate were: usual examiner error = 0.49, err/2 = 0.56, err/4 = 0.62, err /8 = 0.66. Consequently, whilst reducing the degree of random error we modelled within examiners scoring increase accuracy, a very substantial increase in examiners random error (one eighth its usual value) produced a moderate increase in accuracy (41% reduction in error; 66% of students’ scores became more accurate).

Research Question 4: How does the proportion of candidates whose scores become more accurate vary for different sizes of score adjustment for each of the parameters investigated within RQs 1–2

This study produced 32 tables of tabulated results. These findings, along with summary text and further details of how they were calculated, are presented in appendix 1. In summary, when there was no baseline difference between sites (i.e. study 1) the findings did not demonstrate a threshold for any of the studied parameters beyond which the target of pAcc > 0.8 was achieved. Notably the vast majority of adjustments made in study 1 were comparatively small. When larger baseline differences existed (10–20% baseline difference, see study 2) adjustments were typically larger, with a majority exceeding 9% of the assessment scale for 20% baseline differences. Thresholds in the region of 3–4% of the assessment scale could be set for scenarios where a baseline difference of 20% existed, to achieve a target of pAcc > 0.8. Notably, therefore adjustment thresholds depended on the degree of baseline difference rather than an absolute value of the adjustment threshold.