Examining The Objectivity of Scoring and Evaluation of Simulation Training

Objective: The increase in minimally invasive surgery and endovascular procedures has led to a decrease in surgical experience. This may adversely affect both surgical training and postoperative management. Since it poses no risk to a patient, simulation training may be a solution these problems. COVID-19 requires social distancing which has created a negative impact on the simulation educational environment. To date, there is only limited research examining whether skills are evaluated objectively and equally in simulation training, especially in microsurgery. The purpose of this study was to analyze the objectivity and equality of simulation evaluation results conducted in a contest format. Methods: A nationwide recruitment process was conducted to select study participants. Participants were recruited from a pool of qualied physicians with less than 10 years’ experience. In this study, the simulation procedure consisted of incising a 1 mm thick articial blood vessel and suturing it with a 10-0 thread using a surgical microscope. To evaluate the simulation procedures, a scoring chart was developed with a maximum of 5 points each for eight different evaluation criteria. Five neurosurgical supervisors from different hospitals were asked to use this scoring chart to grade the simulation procedures Results: Initially, we planned to have the neurosurgical supervisors score the simulation procedure by direct observation. However, due to COVID-19 some study participants were unable to attend. Thus requiring some simulation procedures to be scored by video review. A total of 14 trainees participated in the study. The Cronbach's alpha coecient among the scorers was 0.99, indicating a strong correlation. There was no statistically signicant difference between the scores from the video review and direct observation judgments. There was a statistically signicant difference (p < 0.001) between the scores for some criteria. For the eight criteria, individual scorers assigned scores in a consistent pattern. However, this pattern differed between scorers indicating that some scorers were more lenient than others. Conclusions: The results of this study indicate that both video review and direct observation methods are useful and highly objective techniques evaluate simulation procedures. Despite differences in score assignment patterns between individual scorers. video review and actual face-to-face review of simulation procedures. This paper investigates the effect of scoring patterns on simulation training procedure evaluations even for situations where it is dicult for trainees to participate in-person due to the COVID-19 restrictions. Additionally, to objectively evaluate the improvement of skills over time due to simulation training, it needs to be conrmed that similar evaluations are possible even if there is a change of evaluator (e.g., due to retirement or transfer) or the evaluation method (video or face-to-face) changes. The purpose of this study is to clarify the effectiveness of simulation-based evaluation in microsurgery education by analyzing the results of simulation training evaluations conducted in a contest format with many experts participating as judges. Additionally, this study assesses the objectivity and equality of these results.


Introduction
The Coronavirus Disease 2019 (COVID-19) pandemic has forced restructuring of surgical training programs around the world. [3,5,23] Simulation training for surgical procedures is a proven method to improve surgeon skills. [2,20] Some countries have adopted virtual reality simulation for training and recerti cation for surgical specialties, including obstetrics and gynecology [9] and abdominal surgery [14,16,18]. However, there is only limited reported research on the use of simulation training in the eld of microsurgery. This research only discusses the development of simulation equipment and does not extend to analysis of the objectivity of the evaluation results from the simulation training. A survey [19] by the American Society for Reconstructive Microsurgery reported that only about 6% of microsurgeons had experience with high-precision simulation training. Additionally, 24% of the respondents thought that simulation training was a useful indicator of clinical performance, despite having no actual experience with simulation training [19].
In Japan, the Organization for the Certi cation of Cardiovascular Surgeons(which consists of three academic societies including the Japanese Society for Vascular Surgery), mandates 30 hours of off-the-job training under a new medical specialist system. The Japanese Society of Endoscopic Surgeons requires an educational seminar of more than 10 hours and at least three practical training sessions for technical certi cation. However, there is no mention of off-thejob training for board certi cation by the Japan Neurosurgical Society. The Japanese Society on Surgery for cerebral stroke has a technical certi cation medical education seminar, but the content is limited to only a single three-hour session with content that corresponds to off-the-job training. The Japanese Society of Plastic and Reconstructive Surgery (which conducts microsurgery) have a similar system for medical specialists but with no mention of off-the-job training. This is because the models that can be used in microsurgery for the elds of plastic surgery [13] and neurosurgery [17] are microscopic in nature. Consequently research on the objectivity of evaluation may be di cult to conduct.
The evaluation of skills from simulation practice should not just con rm that a trainee has practiced, but also identify the strengths and weaknesses of individual skills and report these back to the trainee. Thus, it is necessary to analyze how differences in evaluator scoring patterns may affect trainee results. There is only limited research examining the objectivity of evaluations of microsurgical techniques by scorers from different backgrounds or differences in scoring patterns between video review and actual face-to-face review of simulation procedures. This paper investigates the effect of scoring patterns on simulation training procedure evaluations even for situations where it is di cult for trainees to participate in-person due to the COVID-19 restrictions.
Additionally, to objectively evaluate the improvement of skills over time due to simulation training, it needs to be con rmed that similar evaluations are possible even if there is a change of evaluator (e.g., due to retirement or transfer) or the evaluation method (video or face-to-face) changes.
The purpose of this study is to clarify the effectiveness of simulation-based evaluation in microsurgery education by analyzing the results of simulation training evaluations conducted in a contest format with many experts participating as judges. Additionally, this study assesses the objectivity and equality of these results.

Materials
We invited neurosurgeons from all over Japan who became physicians in 2010 or later to participate in a surgical technique contest with the task of incising and suturing arti cial blood vessels. The contest was open to the public, and the judges observed the participants directly.
The task contents were uploaded to YouTube and made public. Each participant was given 5 minutes to incise and suture a 1 mm diameter arti cial blood vessel. The suture thread was 10-0 nylon, and the participants could select, the micro forceps, scissors, and needle holder for the procedure. An identical arti cial blood vessel was prepared by the organizers to ensure equality and fairness. A scoring chart was prepared based on the Objective Structured Assessment of Technical Skill scoring method. [21] The scores were assigned based on a six-point scale where 0 = Failing, 1 = Bad, 2 = Poor or Below Average), 3 = Fair or Average, 4 = Good or Fairly Good, and 5 = Excellent or Almost Excellent. There were eight evaluated criteria, resulting in a maximum possible score 40 points assigned per scorer. Each participant was evaluated by 5 scorers, so the maximum total score for the contest was 200 points. Details about of the eight evaluation criteria follow: 1. POSTURE: Surgeons Positioning / Instrument layout (preparation).

Methods
This study was conducted with the approval of the Ethical Review Committee of the Medical University Hospital.
We asked neurosurgeons who are quali ed as supervisors by the Neurosurgical Society and the Society of Surgery for Stroke from different universities and a liated institutions to be the scorers for this experiment. Selection of scorers from these societies ensured that differences in surgical techniques and philosophical approaches were represented in the study. After the contest, the specialist judges discussed technique strengths and weaknesses with each participant due to the impact of COVID-19, the contest was rescheduled from March 2020 to September 2020, and the contest format was changed from requiring an in-person format for all participants to using a web conference for some participants. Video judging was provided for those who wanted to participate but was unable to attend the contest in-person due to restrictions on domestic travel.
An example of a video review was prepared and sent to participants not able to attend in-person. The review video consisted of a video of the microscope screen and a video of the surgeon's entire body taken from the side and back perspectives of the surgeon ( Figure 1). The contest organizer mailed a urethane stand to x the arti cial blood vessel used for suturing, a 1-mm arti cial blood vessel, and 10-0 thread to each of these participants. Additionally, this video review and the scoring sheet were sent to the judges. Judges were asked to write down the advantages and disadvantages of each participant's technique as a comment. Results of these evaluations were sent to the participants so that they could assess their skills. The same scoring chart was used for both the video review and on-site judging.
A venue with a 200 person capacity was chosen as the nal judging site. However, due to COVID-19 restrictions, only about 40 people attended. A KINEVO 900 (Carl Zeiss medic. Germany) surgical microscope was placed in the front center of the room, and with ve judges positioned around this microscope ( Figure   2). The nal ranking score for each participant was determined based on the sum of the scores from all judges.. Final contest ranking was determined by comparing average scores from the judges. For fairness, a judge's score was excluded from the nal average if he/she were from the same institution as the participant. For instances where participants became doctors in the same year and had identical nal average scores, the younger doctor was ranked higher.

Statistical Analysis of Evaluation Results
The evaluation scores from all the judges for both the video and on-site judging were averaged to form nal rankings. We examined whether there were any differences in the rankings patterns among the judges. Statistical analysis was performed to assess the following: 1) differences in overall scores and years of medical experience between the video and on-site judging; 2) differences in individual judging scores between video review and on-site judging; 3) correlation analysis of all judges in the video and on-site judging. A commercially available software package (JMP Pro 14, SAS Institute) was used for the statistical analyses. All statistical results are presented as mean and standard error of mean or median and standard deviation. Statistical signi cance was evaluated using the nonparametric Mann-Whitney U test and Fisher's exact test. These tests were chosen due to heterogeneity of variance and the small sample size. The DUNN test with merged ranks was used for all paired mean testing between multiple groups. The statistical signi cance level was set to p < 0.05. Consistency of ratings among scorers was estimated using the Cronbach's alpha coe cient.

Results
Initially, the April 2020 conference had an enrollment of 18 participants. But this conference had to be postponed due to the COVID-19 pandemic. When the rescheduled contest was held in September 2020, only six participants were able to attend the in-person contest. Reduction in attendance was due to work relocation, study abroad, and domestic travel restrictions. There were eight participants in the video review contest and four participants who attended both the in-person and video review contests. One of the video review participants was currently studying abroad during the contest. Five judges participated in both the video review and on-site contests two of which participated in both contests. Table 1 shows the results of the comparison of the number of years of medical experience of the participants and the overall scoring results for both the video review and the on-site review contests. There was no signi cant difference in overall score or years of medical experience between the participants in either of the contests... Of the four participants who participated in both the video review and in-person contests, two had higher scores in the video review contest and two had higher scores in the in-person contest. Table 2 shows the participant evaluation scores by judge and overall total for the video review contest (Table 2a) and in-person contest. (Table 2b).
Examination of these tables indicates that there between differences judge scoring patterns but nal participant rankings tended to be similar. Table 3 shows the statistical results for each evaluation criteria broken down by judge and overall total. Judges for each evaluation criteria for the video review contest (Table 3a) and the in-person contest (Table 3b).
Review of the video review contest results shows (Table 3a), that there was no statistically signi cant difference between judges in scoring the cutting and needle criteria. However the other six criteria showed a statistically signi cant difference in scores between judges. Of the six items, Judge C scored statistically signi cantly lower than the other judges on ve of the six evaluate criteria. For four of the six evaluated criteria, Judge D gave statistically signi cant higher scores than the other judges. Judge D also scored statistically signi cantly higher than Judge C in the total score. (p = 0.0118) For the in-person contest (Table 3b), there was no statistically signi cant difference in scoring between the judges for the posture, ligation, and knots evaluation criteria. The remaining ve evaluation criteria had a statistically signi cant difference in scores between judges. In particular, Judge I statistically signi cantly scored higher than Judge F (p = 0.0083) and Judge G (p = 0.001) for the in-person contest.  The table 4 shows the discussions between the trainee and the specialist on the content of each skill after the evaluation of each skill.

Discussion
Using the current evaluation criteria, there was homogeneity in the evaluation results among the ve judges in both the video review and in-person contests. In both contests, judges tended to give higher or lower evaluations scores depending on the individual evaluation criteria. Some judges tended to give "severe' evaluation scores while some others tended to give "gentle" scores evaluations. However, the difference in scoring tendencies did not affect the nal relative ranking of the participants. Two judges participated in both the video review and the in-person review contests, but neither of these judges who were analyzed as assigning "harsh" or "gentle" scoring patterns. There was also no statistically signi cant difference between the scoring results for the video review contest and the in-person review contest. For our study, when an expert is in charge of the judging and is provided with the task content and scoring table, results indicate the possibility of similar and objective evaluation scores using either video review or in-person review.
The impact of COVID-19 has created a need for social distancing. This has affected the simulation educational environment. In other words, it has become di cult for many trainees to assemble in-person, practice a simulation and subsequently be evaluated on their performance. Of course, the purpose of simulation training is not competition in surgical procedure contests, but to motivate trainees to improve their skills and to appropriately deliver performance feedback back to them. To achieve this, we conducted this study to con rm that objective and equal evaluation is not affected by differences in judges, video observation, or in-person observation. Results from this study show that it is possible to objectively evaluate the technique of microsurgery using a web system because the details of the technique can be projected on a monitor. This is an advantage of microsurgical education, since it is easy to record detailed techniques. After both contests conducted for this study, contestants were given the opportunity to discuss the strengths and weaknesses of their techniques with the judges and commented to the study organizers that their participation was very meaningful.

Simulation Materials Needed for Microsurgical Technique Evaluation
In this contest, we used arti cial blood vessels. Arti cial blood vessels were selected based on the following factors: availability of the same simulation material for all participants, cost, safety, animal welfare, ease of access, and similarity to actual surgery. There are many reports on the practice of vascular anastomosis using arteries from chicken wings [1,6], and the similarity to actual human blood vessels has been discussed. However, for this study, arti cial blood vessels were select because they could serve as a better guarantee of identity than a chicken wing artery. However, if arti cial blood vessels are not available, chicken wing arteries are a feasible alternative. A 5-minute time limit for each participant was set to allow time for scoring after the observation of the actual procedure, as well as provide su cient time for participant change and preparation. The in-person contest with six participants took more than three hours from the start to the end of the contest, including two hours of participant discussion with judges Obviously, more participants will require more time, accordingly. If chicken wing arteries are used, drying of the wing blood vessels is required in the preparation, which will require more time for the contest. In addition to arti cial blood vessels and chicken wing arteries, other simulation training that reported [7,11,15,22] using placenta and vascular models created by 3D printers. These models have higher similarity to the actual surgical eld than use of arti cial blood vessels [10,12,22]. It has been suggested that these alternatives may contribute to the improvement of surgical techniques, but identical items are di cult to prepare, thus limiting their use in objective evaluation programs. Furthermore, they are not readily available.
In order to investigate and compare the educational effects of simulation-based training models, McGaghie, et al. [8] proposed a translational outcome effect classi cation for simulation-based training models. This is a ve-level classi cation of effectiveness, ranging from trainee satisfaction (level of effectibeness1) to patient outcomes (level of effectibeness4), cost reduction, and skill improvement (level of effectibeness5). Skill assessment with simulator tools is categorized per their level of effectiveness. A recent review of 108 articles on neurosurgical simulation training reported [12] that there were 15 models at level 2 and only six models above level 2. In other words, most of the papers [12] on simulation training are about useful methods for improving skills, but not about objective evaluation of skills. The objective evaluation method for microsurgical skills used in this study is McGaghie's classi cation level 2, but previous reports that fall into this classi cation have used 3D model creations and cadavers [4]. Both of these preparations require the time and costs for model creation. Patel, et al. [12] pointed out the need for a cost effective training simulator to continue simulation training. Our method is a model that can be practiced with an arti cial blood vessel or a winged blood vessel and has the potential to be a cost effective method of technical evaluation when used in conjunction with a web-based screening method.

Limitation
Due to the small number of participants, multi-factorial statistical analysis could not be performed. We requested expert reviewers from all over Japan, but not all facilities have experts, and some facilities do not have people who can conduct evaluations. The use of an arti cial vessel model allows for a more basic setup at a lower cost. However, it limits the ability to evaluate completion due to the lack of active blood ow to evaluate anastomotic patency.
Ultimately, it is necessary to be able to objectively evaluate the relationship between actual improvement in procedure skill and subsequently with patient prognosis. In other words, we would like to investigate the correlation between evaluation results from simulation training and patient prognosis.

Declarations
Funding: Not applicable. Financial support for this study was not provided.
Con icts of interest/Competing interests: The authors declare that they have no con icting or competing interests.
Availability of data and material: The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request." Code availability: Not applicable.
Ethics approval: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards Consent to participate: Informed consent was obtained from all individual participants included in the study.
Consent for publication: Not applicable.
Author contributions: All authors contributed to the study conception and design.
Material preparation, data collection, and data analysis were performed by YM, SS, AT, KA, and MT Data analysis was performed by YM, and SS.