Face, Content, and Construct Validity Evaluation of Simulation Models in General Surgery Laparoscopic Training and Education: A Systematic Review.

Background Laparoscopic technical surgical skills (LTS) are considered a fundamental competence for General Surgery residents. Several simulation tools (ST) have been explored to develop LTS. Although a plethora of systematic reviews evaluate the translation of LTS developed in simulation to real surgery, there is a lack of evidence that clarifies effectiveness of different validated ST in acquisition of LTS in surgical residents. The aim of this systematic review (SR) is to summarize published evidence on ST validation used for surgery education and training. Methods A protocol was published in PROSPERO. A SR was carried out following PRISMA guidelines. Complete published articles in English or Spanish that validate either content or construct, plus another form of validation of ST to acquire LTS in general surgery were included. Articles that used only one validation or did not validate an ST were excluded. Results 1052 publications were initially identified across all searched databases. Title review identified 204 studies eligible for full text screening. 10 studies were included for final review. Two studies assessed both face and content, 4 face and construct, and 4 face, content and construct validity. None of the studies presented comparable outcomes due to metrics variation and scores used for the validation strategies. Conclusions This study assessed validated laparoscopic simulation models, particularly in content and construct validity. Articles reported an increased use of simulation models in laparoscopic training with positive feedback from trainees, but few studies reported validation of training model. Validation strategies are not standardized, limiting comparability between them.


Introduction
Development of technical laparoscopic surgical skills (LTS) in General Surgery residents is of paramount importance. 1 To develop laparoscopic ability in trainees, multiple tools have been implemented, such as video games, 2 virtual reality (VR) training models, box simulators (BS), surgical simulation (SS) on synthetic, virtual, biological models, and human cadaveric models which have shown to be an effective educational method for surgeons. 3 While tools such as video games have had positive correlation with the development of technical skills in the operating room (OR), training programs currently focus on models that allow simulation and training of laparoscopic movements. 2 These training models can be further divided into laparoscopy simulation models and SS models, depending on whether aspect simulated is the technique (simple laparoscopic movements) or procedure (emulating a surgical procedure on a synthetic or biological model). 2,4 From available training models to develop LTS, SS has been heavily favoured due to evidence of superiority given by shortening of the learning curve for LTS acquisition, improvement of technical dexterity, and faster development of skills that are necessary to perform these types of procedures. 4 Most laparoscopic simulations are done using VR, integrating visual and haptic equipment; synthetic models, live or cadaveric animal models, and human cadaveric specimens preserved by freezing are other popular ST currently in use. 2,[4][5][6][7] The earliest laparoscopic simulators to be widely utilized were known as training boxes, or BS; nowadays most recent models of BS are used for training basic coordination skills or as casings to simulate intraabdominal environment in SS. All mentioned simulation tools are used to train skills through simple and standardized exercises and repetition, 7 with the objective of strengthening basic skills related to precision and movement of objects in space, which are fundamental for any laparoscopic procedure. New simulation models have evolved from BS, adding synthetic or biological components to enhance the experience by adding SS to basic dexterity exercises. 8,9 SS in VR offers an additional opportunity to measure skills of students and evaluate their progress objectively by evaluating time, number of movements, errors committed, amongst other variables. 2,[4][5][6][7]10 Mechanical VR simulators or haptic simulators allow training of the resistance and strength sensation required. 11 Nevertheless, due to the diversity and heterogeneity of available models, there is no clear comparison between ST models validation results reported in the literature. 4,12 Although multiple systematic reviews evaluate the degree in which skills acquired in simulated models can be transferred to a real-life scenario [3][4][5] or compare the reported effectiveness of specific ST, 13,14 to the best of our knowledge there is no current literature that assesses available evidence on the effectiveness of simulation between different ST through face, content, and construct validation for acquisition of LTS in surgery residents.
These validation strategies are of particular interest amongst a variety of available options 15 when evaluating and validating a model due to their aims: face validity considers the realism and difficulty level similarity in comparison to real tasks, content validity assesses the model's effectiveness during a specific skill training to improve participants' technique and construct validity evaluates the ability of a simulator to differentiate between different levels of expertise. 16 This validation strategies, when combined, could determine a model's effectiveness on training. 17 A publication by Gallagher et al, 15 highlighted the growing need for useful strategies to teach minimally invasive surgery techniques, such as laparoscopy. As a response to an editorial entitled "Surgical Research or Comic Opera?", where many valid points were highlighted some strategies to improve research in surgical education are outlined.
Gallagher et al. 15 also mentioned that the function of good science is to show clear cause-and-effect relationships, but in education of minimally invasive surgery it does not always require a prospective, randomized, double-blind experimental design. Indeed, they suggested the optimal method may not even be experimental, but statistical, such as Structural Equation Models. Other types of validation included were concurrent validity: defined as "an evaluation in which the relationship between the test scores and the scores on another instrument purporting to measure the same construct are related"; Discriminate validity: defined as "an evaluation that reflects the extent to which the scores generated by the assessment tool actually correlate with factors with which they should correlate"; and Predictive validity: defined as "the extent to which the scores on a test are predictive of actual performance". 15 All validations have their advantages and disadvantages and are important in the construction of a test for high stakes assessment; however, predictive validity is the one most likely to provide clinically meaningful assessment, but it is not possible to measure out from the operating room. 15 As such, the aim of this SR is to summarize current published evidence on ST used for general surgery education and training, validated through 2 validation strategies between either face, content or context.

Search Strategy and Assessment of Eligibility
An initial protocol was published in PROSPERO (Registration code: [details omitted for double-anonymized peer review]) and based on it a SR was undertaken in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 18 guidelines up until October 2021 using predefined search terms. Complete published articles in English or Spanish that validate either content or construct, plus another form of validation of ST to acquire LTS in general surgery were included. Articles that used only one validation strategy or those who presented an ST but did not validate the model, evaluated an ST for other specialties or did not include general surgery residents were excluded.

Data Extraction
Databases reviewed included PubMed (Medline), Cochrane, Google Scholar, EMBASE, SciELO and the references of systematic reviews found during the search process were searched for additional manuscripts. The review was updated up to October 2021. The search strategy was replicated across all databases, MeSH and non-MeSH terms were used as part of search algorithm: ("laparoscopy" [MeSH Terms] OR "laparoscopy" Articles obtained were imported to Rayyan Systems Inc 19 online platform. Two reviewers independently reviewed articles obtained by search strategy in a blind extraction by tittle and abstract; once done, reviewers discussed any conflicts, and a third peer had final decision on disagreements, after which a full text review was carried out. Full papers in conflict between reviewers were discussed, no third-party mediation was required at this point. Data extraction was carried out independently by reviewers to an Excel template previously designed with data of interest (Annex 1).

Validation
In the context of simulation for medical education, validation is considered a complex concept that aims to stablish that a proposed simulation model accurately achieves its training purpose and is comprised of a series of principles including face, content, construct, concurrent, discriminate, and predictive validity amongst others. 15 To assess ST that provided objective evidence of their utility to develop LTS and enhance reliability, only ST which were evaluated for either content or context validity and included at least another form of validation were included. Validity principles appraised in this study were defined as follows: Construct Validity. The degree in which the model can identify the ability it is designed to simulate, 15 in simulation this is represented by the capability of the ST to differentiate between experts and novices in a specific task. 20 Content Validity. A detailed evaluation of the content of simulation to determine if its content is appropriate. 21 In context of ST for medical education, it assesses usefulness of the ST as a training tool.
Face Validity. Face validity describes the realism of a simulator and difficulty level similarity in comparison to real training tasks. 22

Risk of Bias
PRISMA guidelines suggest Risk of bias should be assessed using Cochrane's RoB2, 23 however this tool is exclusive for clinical trials, and it is not appropriate for medical education studies, for that reason Medical Education Research Study Quality Instrument (MERSQI), 24 which better suits the evaluation of quality in education research, was used.
Medical education research study quality instrument is a measurement score designed in 2007, validated and widely used to evaluate methodological quality of medical education research. 24 It consists of 10 items, reflecting 6 domains of study quality: Study design, sampling, type of data, validity, data analysis and outcomes; each of which carries a maximum score of 3; maximum possible MERSQI score is 18 and has potential range from 5 to 18. Higher MERSQI scores have been associated with a higher acceptance rate to competitive journals and increased likelihood of external funding. 24

Results
The search strategy produced 1052 publications across all searched databases. Title review of the manuscripts using selection criteria identified 204 studies which were eligible for full text screening, of which 10 studies were included in the final review, results of the search and screening are summarized in Figure 1. General characteristics of the study, simulation type, and the ST's characteristics of validity are summarized in Table 1.
Initial article selection excluded 812 of 1052 studies found. Reasons for exclusion were: (1) No validation of the model was included in the study (340 articles excluded); (2) Subjects in the study did not include General Surgery Residents (384 articles excluded); (3) Articles not in English or Spanish, or those that did not pertain to laparoscopic surgical simulation training (88 articles were excluded).
A total of 240 full texts were screened, 156 papers were excluded due to either a lack of, incomplete or not properly described validation strategy, (Figure 1 Reason 4) (146 articles were excluded); or wrong outcome ( Figure 1 Reason 5) (10 articles were excluded). Papers that did not have at least 2 different validation strategies, at least one of them being content or construct validity were excluded as well (Figure 1 Reason 7) (74 articles were excluded). Finally, articles were clustered for analysis by validation strategies employed: face and content, face and construct and, face, content and construct.

Face and Content Validity
Two articles validated ST by face and content 25,26 both presented a BS training model. Bahsoun et al. 25 proposed a cardboard-based model which replaced traditional laparoscopic camera with a tablet included 10 participants, and face validation was done by experienced laparoscopic surgeons that answered an evaluation form for tablet Feedback. For content validation the authors looked at quality of feed and compared it with the laparoscopic camera and stack. This study researched components of a laparoscopic training setup and recreated a simplified, less expensive model based on the use of a tablet, to overcome the obstacle of laparoscopic stack and camera costs. 25 Achurra et al. 26 presented a high-definition BS, in which 76 expert bariatric surgeons answered a four-item questionnaire for face validity, result showed that 98% considered that BS had a high fidelity and was optimum for advanced laparoscopic simulation. Content validity was evaluated by assessing improvements on global and specific rating scores of 15 first-year general surgery residents. Model evidenced a statistically significant improvement in scores after training with the BS. 26

Face and Construct Validity
A total of 4 articles that assessed face and construct validity were included. De Moura Júnior et al. 27 described a BS for laparoscopic surgery in a fiberglass-based console with 37 participants including both surgeons and general surgery residents with no further stratification, assessed both face and construct by means of selfadministered surveys which tested both satisfaction with model as well as perception on performance and found consistently higher scores on model evaluated Authors conclude that the model is a useful tool in the development and evaluation of residents in training. 27 Another 3 authors presented ST based on VR with different variations, Sánchez-Peralta et al. 28 described a model based in SINERGIA laparoscopic virtual reality simulator, participants were divided into novice and experts. Both groups positively rated VR simulator as similar to reality, and a statistically significant difference was found in at least one of the metrics for each task between groups, providing construct validity. 28 Lahanas et al. 29 proposed an augmented reality simulator for skills assessment in minimal invasive surgery tasks evaluated included instrument navigation, peg transfer and clipping. Participants were divided in 2 groups, novices and experienced surgeons. Construct validity was evaluated by a statistical comparison of 2 experience groups while performing the 3 tasks described previously. 29 All participants answered a five-point Likert questionnaire regarding the realism and difficulty of each task obtaining face validity of the instrument.
Finally, Sankaranarayanan et al. 30 presented a ST consisting of a VR simulator called Gen2-VR, participants had a different experience between them and were not stratified, the study compared the use of the model under different scenarios with an older VR model. Face validity was assessed by means of a five-point Likert-scale, while a statistically significant difference was found between scenarios and VR models providing construct validity. 30 This ST is of particular interest as it includes the assessment of factors that may interfere with surgery outcomes such as noise and interruptions which can be added to immersive VR environment and should be considered while training is performed. 30

Face, Content, and Construct Validity
Four articles amongst all the publications had all 3: face, content and construct validity, 2 of them 31,32 proposed BS training models while the other 2 presented VR models. 32,33 Van Empel et al. 31 proposed a BS model named TrEndo, participants were divided into 2 groups based on prior laparoscopic experience, novices and experts. Face and content validities were assessed by a questionary and construct validity was appraised by a series of tasks such as needle position and executing  31 Construct validity showed differences between the 2 groups, with significant better results in the experts' group. Perez Escamirosa et al. 32 presented a ST called EndoViS, a training system for psychomotor skills development in laparoscopic surgeons, sample groups were divided by expertise levels (medical students, residents and surgeons), face and content validity were assessed by a questionnaire of 13 statements that combined design, realism of cavity, functionality of simulator and training capacities, answered with a fivepoint Likert-scale, 32 construct validity was measured by evaluation of 4 skill tasks peg transfer, rubber band, pattern cutting, and intracorporeal knot suture. 32 Authors of this article stated that EndoViS training system has been presented and successfully validated. They proposed that the system provides a non-obstructive alternative to the traditional tracking systems and a reliable method to capture and analyze motion of surgical instruments offering potential for surgical training programs. 32 Kawaguchi et al. 33 presented a ST named LAP-X to train basic laparoscopic skills, for this paper groups were divided by novice and experts, all participants completed a questionnaire designed to evaluate the face and content validity, content validity was evaluated by experts only, questions were answered by selecting a mark on an ordinal scale ranging from 1 to 5 and construct validity was tested by parameters of a task completed using the VR simulator.
Arts et al. 34 presented an augmented reality VR model called EoSim. Participants were divided in experienced, intermediate and inexperienced. To assess face and content validity, each participant completed a questionnaire comprised of a five-point Likert-scale on realism, didactic value, and usability, construct validity was assessed using the outcome parameters generated by the tracking software. Construct, face and content validity are presented for the ST. 34 Methodological Quality and Risk of Bias PRISMA guidelines Risk of bias using Cochrane's RoB2 23 was not conducted due the educational research nature of the included manuscripts. Methodological quality of the studies was reviewed, classified, and scored using MERSQI scoring system. 24 MERSQI scores were compiled for each domain and totalled to allow comparisons between papers; Information was summarized in an Excel template, with a Confidence Level of 95.0% resulting in a mean of 11,69 (CI 95% 11,40-11,98). Individual scores can be found at Table 1.

Discussion
Evidence provided by educational papers and validity definitions suggests that ST for LTS development should be validated. 15,18,20 Our Systematic Review highlights evidence of feasibility, utility, and importance of multiple validation strategies in development and presentation of laparoscopic simulation models. Heterogeneity between papers is rampart, and widely explained due to unavailability of standardized validation strategies for simulation models in surgical education. One of the most vital difficulties lays in the fact that terms utilized across studies are not standardized, such as conflicting and incomplete information on participant experience levels. Terms such as "novice" are not punctually defined in literature; authors take novices as medical students, first year residents or any year residents, even though these represent extremely different populations with high variability in informal training and skill acquisition. Furthermore, in some cases they provide no data regarding previous training or laparoscopic experience when differentiating experience levels, thus increasing the difficulty in cross study comparisons of ST.
Reviewed literature showed that most validated simulations are carried out on VR, using both visual and haptic equipment. Other ST currently in use include synthetic models, live or cadaveric animal models, and human cadaveric specimens preserved by freezing, 2,4-7 all these simulation tools are used to train skills through standardized exercises and repetition, 7 offering the opportunity to objectively measure the skills and evaluate progress, 10 and allowing the training of resistance and strength sensation required for laparoscopic surgery. 11 While previous studies suggest residents prefer training on frozen human or animal cadaveric models, 12 no validation of these ST was found in the reviewed literature, even when it was the preferred model between students, high costs and difficulties in preservation of cadaveric specimens are strong drawbacks that may contribute to scarce literature objectively evaluating their usefulness. Multiple systematic reviews focus on the effectiveness of specific ST's such as virtual or cadaveric simulation, 13,15 but don't compare results across different ST's.
Most authors attempted to assess face validity; however, no standardized validated score exists for this purpose, questionaries and scores tend to vary, even questions on surveys are not comparable and thus represent a source of heterogeneity. Due to the high variability of questionnaires used, and in the absence of a standardized strategy to validate ST, comparison between papers is not objectively possible at this time.
All studies evidenced some measure of improvement in LTS when using simulation, which is supported by current literature and suggests simulation training in laparoscopy is useful to develop LTS in residents. [1][2][3] Simulation exercises allow students to practice in a controlled environment, giving a safe space to make mistakes while receiving feedback from teachers and experienced clinicians, thus reducing risk of errors in future procedures. 35 As a requirement of general surgery residency programs, laparoscopic simulation of any kind should certainly be included, 1 ideally before residents perform any surgery in patients. 4,16 As stated previously, it is not possible to determine which ST has a better performance for general surgery residents due to the heterogeneity of models and lack of standardized definitions, measuring tools and validation strategies, which limit comparability between them. 4,20 Even when the quality of individual studies was moderate, as evidenced by the MERSQI tool, these differences in methodological procedure prohibit adequate comparison between the different ST's. Despite this limitation, improvement has been described in LTS acquisition after simulation, regardless of the model, methods for validation and validation type which reinforces the need for simulation during training in surgery. 1,[4][5][6]9,20,[26][27][28][29][30][31]33,34

Face and Content Validity
Both studies evaluated BS models, Bahsoun et al. 25 and Achurra et al. 26 Although they had a similar ST, there were substantial differences between them, particularly in the subjects included for each study. Due to a lack of clarification on expertise levels, study populations were not comparable. While Bahsoun et al. 25 described the study population as "residents and experts", Achurra et al. 26 defined them as "novices and experts", furthermore each author used different selection criteria to classify the participants expertise. Face validation for these ST was performed in both studies with nonstandardized questionaries developed for the singularities of each BS model, which in accordance with literature, a lack of standardized and validated questionaries can be a confusion factor in multiple scenarios. 35

Face and Construct Validity
Four studies assessed both Face and construct validity. In this sub-group most ST were VR, consistent with current major prevalence of this type of simulations for laparoscopy training. 2,4-7 De Moura Júnior et al. 27 presented a BS model, face validity was assessed by means of a selfadministered survey which tested both satisfaction with the model as well as perception on performance. Sánchez-Peralta et al. 28 and Sankaranarayanan et al. 29 presented a different VR model. The main distinction between those 2 models, was that Sankaranarayanan et al. 30 included assessment of environmental factors that may interfere with surgery outcomes such as operating room noise and interruptions which can be included in the immersive VR environment and should be considered while training is performed, giving this model an interesting additional perspective.
Lahanas et al. 29 proposed an augmented reality simulator, in which virtual objects were superimposed on reality. Participants were described as either novices with no previous laparoscopic experience, or experts but no further information was provided. It is the only study that proposed an augmented reality simulator. 29 Face, Content, and Construct Validity  32 Kawaguchi et al. 33 and Arts et al. 34 Van Empel et al. 21 assessed face and content validation by means of a questionary while construct validity was appraised by a series of tasks. Perez Escamirosa et al. 32 presented a ST and assessed face and content validity by means of a questionnaire comprised of 13 statements answered with a five-point Likert-scale, construct validity was measured by skill performance in 4 tasks.
Kawaguchi et al. 33 assessed face and content validities with a questionnaire designed by authors, questions were answered by selecting a mark on an ordinal scale ranging from 1 to 5 and content validity was evaluated by experts only. Construct validity was tested by parameters of a task completed using the VR simulator. Arts et al. 34 described an augmented reality VR model and assessed face and content validity through a questionnaire comprised of a five-point Likert-scale, while construct validity comprised the outcome parameters generated by a tracking software.
As already stated, an important variation on Likertscales and questionnaire design is apparent in the included studies, which hinders data analysis and comparability of results. 35 Considering all the above and recalling Gallagher et al., 15 there is no "magic bullet" experimental methodology for surgical research; instead, there is a range of experimental designs that should be applied, and the development of validated education tools is a necessity. 15 Predictive validity is the one most likely to provide clinically meaningful assessment, but due to its execution difficulties, there is still no standardised research methodology to establish it, therefore in medical education for minimally invasive surgery and laparoscopic skills before the transition to the operating room, Face, content and construct validity are well established and studied options for measuring LTS.
COVID-19 and Simulation Models. Development of novel learning modalities such as synthetic models and VR, have been accelerated during the COVID-19 pandemic. 36 However, many simulation models are nascent in development and represent high costs for their application. 16 Currently BS and VR models are the most prevalent simulation tools available, BS models are favoured due to their visual and haptic realism while maintaining relatively low costs. Rapid advent of technology may expand on the possibility of VR models using haptic feedback while lowering development and usage costs.
Strengths and Limitations. As already stated in the literature, limitations in this type of reviews are related to restrictions in medical education research, ethical barriers and the fact that no gold standard has been described to assess surgical training outcomes. 16,37 A major limitation of this review is the lack of inclusion of predictive validity, considered an "ultimate" assessment to establish validity of an educational tool. 16,35 Nevertheless, it remains difficult to implement in practical terms. Also, the inability to compare data acquired by diverse researchers due to heterogeneity of gathering tools or variations of Likert-scales hinders the possibility to acquire strong evidence about the use of simulation in different contexts.
The fact that only manuscripts which considered 2 types of validation with at least one being content, or context excluded many partially validated ST models which are being used worldwide. And the specific population being general surgery residents also diminished multiple simulation scenarios that could be of importance.
Furthermore, research participants definition and selection for our review limited studies considering exclusively medical students, which were excluded, nevertheless several manuscripts validated content by assessing improvement in performance in laparoscopically "naive" subjects, defined as undergraduate students. 38 However, according to our knowledge, this review is the first to address face, content and construct validity for laparoscopic surgery training ST validation. A recent SR published by Shah et al. 39 aimed to describe the status of simulation-based training tools in general surgery in the current literature, assess their validity and determine their effectiveness 39 ; that systematic review may be complementary to ours. While Shah et al. 39 included all published simulation models and described a variety of validations proposed by authors, no distinction was made to highlight and review models with at least 2 kinds of validation, one of them being either content or construct validity which have been previously established validations that support validity of ST as training tools to effectively develop LTS in surgical education. 15 Future work should aim to standardize scales for validation which incorporates both subjective and objective validation assessments such as the GOALS and OSATS scores for laparoscopic training. This will allow researchers to collect adequate data to validate multiple models and, in the future, compare results between published ST's.

Conclusion
This SR assessed laparoscopic simulation models that were validated, with particular interest in content and construct validity. Articles reported an increased use of simulation models in laparoscopic training. Although studies demonstrated that laparoscopic surgery trainees preferred the use of frozen human or animal cadaveric models, synthetic and BS models are currently the most convenient and practical for developing skills in residents, VR models provided visual realism and improved haptic feedback, but still need future development.
Although surgical simulation models receive positive feedback from participants, proper validation of the models is not standardized. For that reason, comparing results from different publications is limited by the methodological heterogeneity amongst articles, even when individually articles are well executed. Moreover, studies examining simulation methods seldom use objective validation methods such as construct and content validity. Future work should examine how to facilitate and standardize objective validation strategies that are applicable to a variety of ST´s.