Remote Proctored Exam for Prociency-Testing: A Cross-Sectional Study

Background: The COVID-19 pandemic has profoundly affected assessment practice in higher education including a complex planning of supervision. To organise safely and reliably a remote prociency-test for admission to the Advanced Master of General Practice (AMGP), we developed a supervisor app tracking and tracing candidates’ behaviour. Methods: A cross-sectional design was adopted with candidates applying for admission to the AMGP. The supervisor app operated on three levels to register events: recording actions, analysing behaviour, and live supervision. Each suspicious event was given a score. The outcome measures were the number of suspicious events and the exam outcome compared to the past year. To get more insight into candidates’ perceptions about the app, a post-test questionnaire was administered. An exploratory factor analysis was performed to explore quantitative data, while qualitative data were thematically analysed. Results: In total, 472 (79%) candidates used the app in an off campus setting and 121 (20%) were on campus with live supervision. Test results of both groups were comparable. The app detected 22 candidates with a suspicious level >1, mainly due to background noise. All events occurred without fraud purpose. Out of 472 candidates, 304 �lled in the post-test questionnaire. Two factors were extracted from the analysis and identi�ed as candidates’ appreciation of the app and as emotional distress because of the app. Four themes were identi�ed in the thematic analysis providing more insight on candidates’ emotional well-being. Conclusions: A supervisor app registering and recording behaviour to prevent fraud during off-campus exams is e�cient without in�uencing the exam outcome. Although candidates’ perceptions were mixed, increased anxiety was due to the lack of clear guidelines about the app. Future research should compare in a controlled design the cost-benet balance between the supervisor app and candidates’ awareness of being monitored combined with a safe exam browsing plug in.


Introduction
The COVID-19 pandemic has profoundly affected teaching and assessment programs at all levels of education.The Flemish (Belgium) universities were challenged to switch uently from analogous to digital mode which went with ups and downs.Moreover, the COVID-19 countermeasures require su cient physical distance before and during exams necessitating smaller student groups per exam session.This situation demanded complex planning of manpower, location, and, in the case of digital exams, adequate equipment, both hard and software.
The challenge of reorganization is even more impressive in an interuniversity collaboration for education.The Advanced Master's in General Practice (AMGP) in Flanders, Belgium, is formally organized and offered by four Flemish universities (KU Leuven, University of Antwerp, University of Ghent, VUB).This collaboration comprises common administration, curriculum, examinations, and residencies, but separate registration of residents to the university of their choice.Given, we have over 900 residents in the AMGP training and education.Exams planning is a complex logistical and administrative process.Therefore, we built more than a decade ago an intelligent, comprehensive, and interactive assessment platform.
Medical candidates are in favour of computerized exams and these exams may enhance the learning experience and effect.(1)Our platform offers the interface for summative and formative knowledge testing (in six questions formats), Objective Structured Clinical Examination (OSCE)-performance, and pro ciency testing.
Pro ciency-testing takes place outside the regular exam regulations and is organised for the four universities together.The test comprises three stages starting with an administrative stage, followed by an actual exam, and nalized by a jury exam for candidates who failed the exam in stage two.The actual exam is an online machine-assisted test that runs on the digital assessment platform and consists of three components: knowledge testing, critical reasoning testing and situational judgment testing.The whole procedure runs since 2016 and has proved its reliability, validity, acceptability, and feasibility in this format.(2,3) The necessity of respecting COVID-19 countermeasures but adhering to the original exam format forced us to adopt a creative solution.Therefore, we decided and organised a remote proctored exam in addition to the on-campus exam.To prevent fraud, we developed a virtual supervisor app tracking and tracing candidates' behaviour during the exam.The technology we built and applied goes beyond the traditional proctored exam systems where focus lies on recording sound and image.(4) In this article, we report the outcome of the anti-fraud measures and compare the scores between off-and on-campus exams.This study also aims at providing insights about candidates' perceptions about the supervisor app.

Methods
In collaboration with the developers of the assessment platform and in discussion with the coordinators of the AMGP, we determined the criteria and conditions to design the supervisor app based on our online assessment platform.That implies that the app is not immediately available for third parties because of potential compatibility issues.Speci cally, we integrated an Application Programming Interface software (Vonage APIs) within our assessment platform.The Vonage APIs software enabled to and interact with and record the candidates live.Along with this software, we implemented several metrics to detect events implicating suspicious behaviour (Figure 1).Suspicious events were de ned as: switch to another browser, return to page, close page, disconnection from internet and sound/noise.
During the exam, the system recorded three channels: the computer screen, the camera, and the microphone.The supervisor app operated on three levels: recording of actions, analysis of behaviour and live supervision.These recordings were immediately encrypted and saved on a secured server.
Additionally, the app used an algorithm of pattern recognition in responses, clicking behaviour and timestamp analysis.It also analysed both individual behaviour as well as correlations across candidates.
Each event was given a score.Every suspicious event increased the result by 0.1.A candidate scoring 0.5 or higher was considered suspicious.The tracking was stored in the users' browser and sent to the server after each 20 non-suspicious events or when suspicious event occurred.When the suspicious rating was above 0.5, the exam submission was agged and a report on the suspicious behaviour was downloaded for assessment.
Remote supervision allowed for optional human oversight during the examination.The human monitor could immediately join the live feed of each candidate to obtain more information or to send a warning in a private message.In case of a crash of the supervisor app, the affected candidates were switched to Safe Exam Browser (SEB), which was integrated in our assessment platform.SEB allowed only the exam interface to remain accessible on the machine.
In addition to the technical solution, we expected the app to have an impact on candidates' behaviour regarding fraud prevention.Candidates were comprehensively briefed in advance on how to install the app, on how to test it, and on the speci c features of the supervising technology by an animation movie.
A voluntary panel tested the supervisor app in two sessions.During the rst session, the participants were ordered to behave in a suspicious manner: talking, making noise, turning away from the screen, using the internet, etc. Afterwards, we made the following adjustments to the app and the procedure: we isolated the sound-suspicious level from the other suspicious events, because of sensitivity of sound detection.We also increased the overall suspicious level from 0.5 to 1 to take the sound sensitivity into account.
The intervention took place during the regular exam period of the pro ciency test for admission to the AMGP.All participants were candidates applying for admission.To take the exam, candidates were free to register for either an off-or on-campus participation.On campus, a human supervisor was present, and candidates used the campus gear.Candidates that chose an off-campus option could simultaneously take the exam.For remote monitoring, we engaged an experienced supervising team of six staff members.The off-campus supervisors were able to send a noti cation or warning to candidates who were behaving suspiciously, and they intervened in case of technical issues.The developers of the app were also fully available during the exam to provide technical assistance, if necessary.
The actual exam and the associated procedures were set up as in previous years: all candidates completed the same exams and candidates who failed the machine-assisted test (the actual exam) were invited to a jury exam one week later.Candidates for whom suspicious behaviour was agged during the exam or in post-exam analyses were also invited to the jury exam.
To get an insight on candidates' perceptions about the supervisor app, we administered a post-test survey in the form of an online questionnaire.The questionnaire was sent only to the candidates participating in the off-campus exam.Filling the questionnaire was on anonymous and voluntary basis.
Candidates were asked whether they were taking the pro ciency test for the rst time to avoid any effects due to retakes (e.g.test anxiety).Respondents had to specify their level of agreement or disagreement in a 6-point Likert scale for six items regarding the supervisor app.In addition, candidates could explain how they perceived the in uence of the app on their exam experience and exam outcome.Table 1 displays a list of the survey questions.All methods were carried out in accordance with relevant guidelines and regulations.Ethical approval was granted by the Social and Societal Ethics Committee of the KU Leuven with the following approval number: G-2020-2262-R2(MAR).
Table 1 Structure of the questionnaire measuring candidates' perceptions about supervisor app Closed questions (6-point Likert scale) Q1: I was well informed about the app Q2: I was more nervous than usual before the exam because of the supervisor app Q3: I was more nervous than usual during the exam because of the supervisor app Q4: The supervisor app had an impact on my results.Q5: I found the supervisor app reassuring.Q6: I would use the supervisor app again in the future for other exams.
Open questions Q7: How did the supervisor app in uence you when taking the exam?Q8: How do you think that the supervisor app in uenced your result?Analysis Outcome measures were de ned as the number of suspicious events per student and exam outcome compared to the results of past year.To analyse the post-test questionnaire, we performed an exploratory factor analysis to understand patterns related to candidates' perceptions while using the supervisor app.(5) We used an oblique rotation, since we expected some correlation among factors.(6)The reliability of the scale was calculated based on Cronbach's alpha.(7) We analysed quantitative data with SPSS 27 (IBM SPSS Statistics 27).Qualitative data were thematically analysed by two researchers separately (VA and BS).(8) Discrepancies in coding were discussed until consensus was reached.Data from open-ended questions were analysed in the software program QSR International's NVIVO (Release 1.0).

Results
A total of 593 candidates subscribed to the exam in the four Flemish universities.Four hundred seventytwo (79%) candidates used the supervisor app for an off-campus exam and 121 (20%) were present on the campus (Table 2).Most candidates registered at the KU Leuven (227) and at the University of Ghent (203) chose to take the exam off-campus.The results of both (off-and on-campus) groups were comparable.Overall, we registered and solved 15 technical issues in the off-campus context (Table 3).Eight of these issues concerned software problems (in particular loading a reading text in a new tab).Two candidates experienced a negative impact on the exam performance due to technical issues.The developers team switched one candidate to the SEB mode to complete the exam.Based upon the post exam analyses and after deliberation, the exam coordinator exempted the other candidate of the jury exam.In total, the app detected 22 (4%) candidates with a suspicious level >1.All cases concerned one or more noise-event (background noise).All other non-critical events consisted of leaving the webpage, closing a page or typing text.Live monitoring and a post exam review of records revealed that all these events occurred hazardously but without fraud purpose.
The monitoring supervisors agged two candidates who were typing more than expected (in a multiplechoice exam).After revision of the records, these candidates were using 'control nd' to search for words in the reading text.During the exam, the supervisors intervened eight times for a technical issue, they warned two candidates to stop talking to themselves and they sent a group message to ask to turn down background noise.
Out of the 472 candidates that used the supervisor app, 304 lled in the post-test questionnaire, 213 women and 91 men.All of them were taking the pro ciency test for the rst time.Almost seventy-two percent of the candidates (219) had chosen the GP Training as rst choice postgraduate medical training (Figure 2).Ten percent (30) wanted to follow a medical training in Internal Medicine, but they were participating in the pro ciency test to be able to register in the AMGP as a second choice.
An explanatory factor analysis was initially conducted on the 6 items of the questionnaire with oblique rotation.However, one item "I was well informed about the supervisor app" had to be omitted because of a communality lower than .40.(6)The other ve items (communality> 0.40) were included in a secondary analysis.This analysis yielded a Kaiser-Meyer-Olkin (KMO) of .63 verifying the sampling adequacy (Table 4).After analysing the data to obtain eigenvalues for each factor, two factors had eigenvalues over Kaiser's criterion of 1 explaining 82.46% of the variance.The scree plot also justi ed retaining two factors (Figure 3).Table 5 shows the factor loadings after rotation.The items that cluster on the same factors suggest that factor 1 represents candidates' appreciation of the supervisor app, while factor 2 represents emotional distress because of the supervisor app.Our sample size was appropriate to perform a factor analysis, since it su ces the 10:1 rule of thumb subject to item ration.(6)The reliability of the questionnaire was calculated based on Cronbach's alpha as 0.72 (Table 6).the qualitative analysis, four different themes were discerned.All themes were related to candidates' emotional well-being.The rst two themes referred to stress and anxiety before and during the exam.Some candidates felt more anxious because they feared technical problems or running out of time.The third theme was anxiety and stress because of the supervisor app.Candidates mentioned that they felt stress because the app might detect something as fraud, without the intention of fraud.
Candidates also admitted that they felt awkward, observed, and distracted because of the app.
Instead of just thinking about a question and the possible answers, I had to constantly remind myself about my behaviour (not to look upwards or left and right).The feeling of constantly being watched was not helpful (Anonymous candidate).
The fourth theme was connected to positive emotions about the supervisor app.Candidates thought that the supervisor app was reassuring and that, in case of need, someone could intervene and help.
The app worked pretty well for me, it was rather reassuring that it would be taken into account, if technical problems arose (Anonymous candidate).
Finally, candidates identi ed some technical issues before and during taking the exam.Before the exam, the main problem was slow internet connection.During the exam, candidates experienced problems, when opening multiple tabs at the same time.

Discussion
This pilot study demonstrated that a supervisor app with recording and registration of behaviour could detect all suspicious events without an impact on the exam performance.Suspicious events were also manually screened after the exam.
The imbalance between registrations for off-or on-campus among the universities is striking but explained by the availability of infrastructure (classes and IT infrastructure).We see indeed that candidates from larger universities were more likely to register for an off-campus exam.The psychometrics of the exam were comparable between the off-and on-campus group.Here, we can conclude that there was probably no bias induced by the voluntary option for participation off-or oncampus.Other authors compared scores from proctored exams to scores from the traditional exams of the past years and have also found no difference.(9) The number of technical issues was very low and mainly related to software issues.The complexity of the app (registration of behaviour, recording, etc.) in combination with a multicomponent exam with different question types required very performant IT-equipment.Only in one case a technical issue led to interruption of the exam of the affected candidate.All other issues were solved within an acceptable time span by the supervisors and the developing team.These issues disclosed the vulnerability and low quality of candidates' personal gear.Here, the educational institution should guarantee a fair and safe exam environment for every candidate by offering high quality infrastructure and logistics.
During the entire exam, a team of eight persons was constantly monitoring and in contact with the candidates.Although we did not perform a cost-bene t analysis, we assume that the whole procedure is not cost saving since the off-campus format was run in duplicate to the on-campus exam.
On average, candidates provoked three events of which 'noise/sound' was the most prevalent and critical one.Sound is indeed di cult to avoid or to control (we noticed children crying, birds singing, candidates talking out loud, slamming doors, street works, etc).The developing team immediately increased the threshold sound level during the exam (in the rst minutes).
A very small number of candidates was agged as suspicious by the app but live monitoring during the exam lifted them above suspicion.The supervisors reviewed the recordings of two candidates who were agged by both the system and the live monitoring during the exam.The records did not reveal any fraudulent behaviour.Prior to the exam, we thoroughly instructed candidates on the registration of suspicious events.This single intervention might have been a decisive argument to withhold candidates from fraud in this high-stake exam.Probably, a simple recording of sound and image combined with a safe web browsing blocking all other webpages might have been as e cient.A less comprehensive registration of events might reduce the number of technical issues wherefore the supervisors could focus on live monitoring.
During the exams, supervisors limited the number of interventions and warnings to avoid distraction of the candidates.Technical issues were solved, and supervisors sent a warning to only two candidates who were talking to themselves.After the exam, we only received reports of technical issues but no complaints on the interventions or interruptions by the supervisors.
The ndings from the exploratory factor analysis indicate that the supervisor app was met with mixed reactions.Considering that candidates took the exam during the COVID-19 pandemic, the supervisor app seems to have provided reassurance that communication with the university was established, in case of technical di culties.However, the ndings indicate that candidates also experienced emotional distress because of the app.Factor 2 identi ed as emotional distress had high loadings in items that referred to increased stress before and during the exam caused by the supervisor app.
The thematic analysis of the qualitative data provided a more in-depth insight of candidates' perceptions, especially on wellbeing.The third theme relates to the negative impact of the supervisor app on candidates' emotional well-being.A potential explanation for this is that candidates were not sure about what the supervisor app could detect as fraud and suspicious behaviour.Hence, more clear instructions could be bene cial for reducing stress and anxiety while using the app.Nevertheless, the fourth theme comprises candidates' positive emotions and appreciation of the supervisor app.The direct connection with the university through the app seemed to reassure and comfort candidates that assistance was available, in case of technical issues.Lastly, the rst and second theme relate more to exam anxiety, and speci cally to the format of the exam rather than using the supervisor app.Therefore, they fall outside the scope of this paper, and are not extensively discussed.Regarding technical issues, candidates' comments showed that software issues were the most frequent, con rming what supervisors observed during the exam.
Finally, some recommendations were proposed for improving the supervisor app.The most frequent suggestion was that more clear guidelines were necessary regarding what can be perceived as fraud.Another recommendation was to allow candidates record a video of their surroundings before the exam.These recommendations could be taken into consideration for future improvements of the supervisor app.

Strengths and limitations
The major strength of this intervention is the feasibility to use the app for a large group of candidates for a high-stake exam.Second, we compared the exam outcome of the intervention with a control group composed by candidates who preferred to take the exam on-campus.We cannot completely rule out a bias here but since exam outcomes were comparable for both groups, we consider the risk as low.From other authors we also know that candidates' preference for computerized or paper based exams does not in uence exam outcome.(10) A limitation of this study is the low number of questions included in the questionnaire.Although the questionnaire could be considered reliable based on the Cronbach's alpha calculation, repeated measurements are necessary to con rm reliability of the scale.

Conclusion
A sophisticated supervisor app registering behaviour and recording sound and image to prevent fraud during exams has proved to be e cient without affecting the exam outcome.Nevertheless, live monitoring and standing by for solving technical issues induced a high human workload.The supervisor app is also user-friendly, provided that what can be detected as fraud is more thoroughly explained.In future research, a controlled design should compare the cost-bene t balance between the complex intervention of the supervisor app and the combination of the candidates' awareness of being monitored with a safe exam browsing plug in.

Figures
Figures

Figure 1 Type
Figure 1

Table 2
Participation and test result off versus on campus

Table 3
Comparison of exam procedure and outcome off-versus on-campus