Institutional board review was obtained from the Medical Ethics Committee of the Pontifical Catholic University of Chile (Nº 190107005), in accordance with ethical guidelines. All subjects were above 18 years of age and gave informed consent. Last-year medical students and general physicians were recruited to complete four nonconsecutive training sessions in a three-month period. We recruited six orthopedic surgeons with experience in knee arthrocentesis as experts against whom we could compare trainee improvement. Five days before encountering their initial training session, trainees received instructional documentation including written directions and rationale for performing a diagnostic knee arthrocentesis, a description of the training scenario including the assessment tool used to evaluate performance in technical and nontechnical competencies, and a video describing the procedure step by step.
A high-fidelity hybrid simulation scenario was created. A patient-actor was trained with a script consisting of a 30-year-old patient arriving in the emergency department (ED) with two-day left knee pain associated with fever and joint inflammation. The patient-actor was stationed on a gurney; upon uncovering her left knee, trainees encountered a simulated knee (Sawbones©, Pacific Research Laboratories; Vashon, WA, USA). The model joint was a non-articulated knee with a partially mobile patella. On a side table, trainees had to select the required materials to perform the procedure, including hospital paperwork and informed consent. A health care assistant was posted to assist the trainee upon request but limited his participation to orders given by the trainee. During each session, trainees were required to take an abbreviated history and perform a physical examination of the patient. They had to explain the procedure and obtain written informed consent. After preparing the required implements, they had to execute the procedure and load the laboratory test tubes. Finally, they completed hospital exam forms and gave the patients postprocedure recommendations (figure 1). A single orthopedic surgeon evaluated all trainees, and sessions were recorded for secondary evaluation to determine the inter-rater reliability of the evaluation tool. We used a specific direct observation of procedural skills (DOPS) scale designed for the scenario (supplemental material files 1 and 2). After the procedure, all trainees were immediately conducted to a debriefing room to receive feedback from the surgeon who evaluated their performance. The surgeon had been previously trained to give effective feedback using the Pendleton model13. Feedback was also registered. Each trainee’s DOPS result constituted a point in their individual learning curve. Trainee learning curves were compared with expert performance to measure student proficiency in the training scenario. Proficiency was defined as the trainee’s ability to safely conduct the procedure with careful consideration for the patient and following the best practices outlined in the educational material they received measured through the de novo DOPS scale.14
After feedback, each trainee completed a validated satisfaction scale.15 This tool measured the trainees' perceptions regarding scenario realism, quality of the instructional material sent beforehand, feedback received, and perceived utility of the training session. One year after training, all participants were contacted to determine if they had performed any knee arthrocentesis. Those who had performed the procedure were given a questionnaire to measure how confident and prepared they felt to undertake the real-life procedure. Specifically, we asked if training had allowed them to perform patient consent and education, perform a safe knee arthrocentesis, fill laboratory tubes and paperwork, and explain postprocedure care to the patient. Additionally, we asked the trainees to assess the perceived utility of participating in the training sessions.
DOPS adaptation and validation
We adapted the new score from a DOPS previously validated in the same cultural setting14. The adaptation maintained the 11 items included in the original DOPS but adjusted their descriptors to assess knee arthrocentesis. We determined the content validity of the de novo DOPS by conducting a Delphi panel composed by experts in Orthopedic Surgery, Rheumatology and Emergency Medicine. Ratings and commentaries for each item were registered, and modifications were made for repeat expert assessment. We repeated expert consultation through the Delphi panel until we obtained at least 80% agreement on all items.
With the second evaluation performed by another orthopedic surgeon, inter-rater reliability was assessed19. Validity analysis was carried out with DOPS scale applications in consecutive sessions for each trainee. The construct validity was determined through an exploratory and confirmatory factor analysis. The exploratory factor analysis detected latent variables or constructs underlying the base of the observed variables20-21. A confirmatory factor analysis was further performed to validate the factor structure identified in the prior exploratory analysis21.
Adaptation and validation of the DOPS scale
Inter-rater reliability was measured with the weighted Kappa (wK) coefficient. Levels of agreement for wK were determined as proposed by Landis et al., considering wK values 0.00–0.20 as slight agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, substantial agreement; and 0.81–1.00, almost perfect agreement 15.
In the exploratory factor analysis, the number of factors (or dimensions) was selected considering the Kaiser–Guttman22 and Cattell23 criteria. Thus, the factors with Eigenvalue above one, and those above the inflection point in the scree plot were retained. The determination coefficient (R2) was estimated to quantify the percentage of the scale's items' variance explained by the two factors identified in the exploratory analysis. Internal consistency for each dimension detected in factor analysis was performed using Cronbach’s Alpha.
Learning curve analysis
A mixed-effects/multilevel model with a random intercept was constructed to study differences in consecutive DOPS results of each trainee. The use of multilevel models was based on the fact that each trainee's performance was assessed in repeated training sessions. Thus, as DOPS scores in consecutive sessions for the same subject are compared, a correlation among them is expected, producing biased estimates of the standard errors and confidence intervals. Mixed-effects/multilevel models can be used to obtain standard errors that take the clustering within subjects into account. Multilevel statistical modeling enables quantitative analysis of learning curves and has been proven to have higher statistical power than conventional repeated-measures analysis of variance (ANOVA)16. This statistical method has also been used in previous research to analyze how trainees acquire skills29-31.
Given that residuals of the mixed-effects model did not have a normal distribution, the standard error was estimated using bootstrapping (10,000 replications). Thus, a 95% confidence interval (95% CI) was obtained using the bias-corrected and accelerated method. Mean scores and 95% CI were expressed for each training session.
All analyses were conducted on Stata version 16 (StataCorp. 2019. Stata Statistical Software: Release 16. College Station, TX: StataCorp LLC).