We found that the performance of practicing physicians on simulated critical event scenarios is highly context-specific. Variation due to task sampling (context specificity attributable to the content of the scenario) is known to affect the validity and reliability (generalizability) of scores13,14. However, measurement properties of assessment scores have not been well characterized for the population of practicing anesthesiologists we studied, nor have they been well evaluated for the types of critical event scenarios we modeled. The simulation cases used in this study were constructed to reflect the timing of events in actual clinical practice in scenarios that require both technical and behavioral expertise to effectively manage the patient’s condition. These cases were presented with as much realism as is achievable with current simulation techniques. Scenarios were developed to accurately reflect the types of challenges faced by practicing anesthesiologists in real-world, emergency situations where the correct answer is not clear and the outcome is not predetermined. Thus, we believe that the performances elicited were likely a fair reflection of how the subjects would have acted in real situation, although that cannot be know with certainty. Yet, even with some reservations about drawing strong conclusions from this data about the reliability of an assessment using this approach, we were surprised to find that the performance of physicians in one of two critical event simulation scenarios often did not match their performance in the other. This was true for behavioral and technical scores.
Previous studies have investigated the psychometric properties (reliability, validity) of scores from standardized patient encounters as well as other simulation-based assessments15–17. While reliable and valid scores and associated high stakes competency decisions can be obtained, these decisions demand broad sampling of the domain and effective rater training18. High-stakes applications, such as the introduction of objective structured clinical examinations (OSCE) into the primary board certification of anesthesiologists19, require an evaluation of the sources and magnitude of measurement error to determine the number of scenarios needed to obtain sufficiently precise estimates of ability. The American Board of Anesthesiology (ABA) recently introduced OSCEs to assess two domains that “may be difficult to evaluate in written or oral exams - communication and professionalism and technical skills related to patient care”20. Those examinations are comprised of seven stations. Other certification bodies, including the Royal College of Physicians and Surgeons of Canada, also realize the unique ability of simulation-based assessment to evaluate domains not covered with traditional assessment techniques21. However, the types of assessment encounters administered can be highly context specific, that is, because of the nature of the management task, the skills measured in one patient management problem may not generalize to another. This indicates that numerous performance samples are needed to get sufficiently reliable ability estimates. To address this issue, many high-stakes performance assessments focus on straightforward interactions or problems. While simple, uncomplicated, scenarios may be appropriate for some tests, they are not likely appropriate for assessing the ability of practicing anesthesiologists to manage critically ill patients. To assess practicing physicians, realistic scenarios that require integration of multiple skills, such as the patient and team management simulations described here, are needed. Unfortunately, these types of scenarios are typically very context specific. This makes it challenging to administer a sufficient number of stations to yield reliable and fair estimates of an individual clinician’s abilities and still be practical in terms of the required time and cost to administer.
The D study, although limited because each participant was only evaluated in three different scenarios in two pairs, nevertheless suggests that greater than 20 scenarios would be required to achieve a reliability of 0.8 (desirable for high-stakes assessments12). Controlling for numbers of scenarios and raters, the estimated generalizability coefficients from our study were lower than those reported elsewhere22,23. While the scenarios were modeled to present management challenges that all practicing, board certified, anesthesiologists should be able to handle, we found that some participants could perform well on one scenario and do poorly on the next. A similar observation was made in a recent analysis of anesthesiology residents who were scored on four simulation scenarios24. In our analysis, this variation was seen in both technical and behavioral performance. The scenarios were developed to elicit nuanced performances that may have been more content specific because clear-cut management expectations were accompanied by ambiguous, real world interactions with others embedded in various provider roles within the scenario. This result highlights the challenge of developing content-valid and practice-relevant simulation-based performance assessments for practicing physicians, especially if these are to be used for summative purposes.
For most performance-based assessments, the variance attributable to the task, and associated interactions, outweighs that associated with the rater25. While variance attributable to the rater was less than that attributed to the task (scenario), it was not zero for the second pair of scenarios. Even though rater training was quite stringent, individual evaluators still varied with respect to how they used the scoring rubrics. Also, with longer scenarios, the raters had to aggregate holistic judgments over time, potentially leading to more variation between raters. Future studies could explore these potentially biasing effects by collecting performance ratings over time and comparing these with overall judgments. As it stands, at least for the types of complex scenarios modeled in our investigation, the ability estimates of the practitioners were highly dependent on the choice of scenarios and, to a lesser extent, the choice of raters.
To improve reliability, the problem of context specificity can be addressed in a number of ways, including shortening the scenarios (to allow for the collection of more performance samples) and making scenario content more generic. A study of junior anesthesiology trainees that used a behaviorally anchored ratings system to score seven, 15-minute scenarios achieved a generalizability coefficient of 0.817. However, shortening the scenarios, while increasing sample sizes, could have a negative impact on validity. One of the strengths of utilizing longer scenarios is that they more accurately represent the clinical environment, thus allowing for the assessment of patient management strategies over a realistic, evolving, event.
We intentionally scored behavioral and technical skills separately, hypothesizing that behavioral skills would be less content specific than technical skills. Our results did not support this hypothesis; behavioral performance was as scenario-specific as technical performance. While one might expect that behavioral skills would be more generalizable across different patient encounters, our scenarios included communication with various providers, including a first-responder anesthesiologist, other physicians, and various healthcare professionals. It may be that, unlike typical standardized patient scenarios that measure doctor-patient communication26, the generalizability of communication skills for critical events depends on the context and criticality of the patient presentation, as well as the particular person, or persons present.
Our results could be interpreted to mean that a robust simulation-based high-stakes performance assessment for practicing anesthesiologists would be challenging and, perhaps, impractical. We hesitate to make such a firm conclusion because of the limited number of samples for each subject in this analysis. Regardless of the practicality of simulation for high-stakes assessment, formative assessment individual performance in these kinds of longer, more complex, critical event scenarios still has considerable value for individuals as well as for learning how clinicians perform in general. Numerous studies have shown that simulation-based medical education fosters self-reflection and identification of performance gaps27–29. As part of ongoing professional improvement, providing feedback to individual physicians about their performance on the management of specific clinical emergencies is likely to have a positive impact on the quality of their subsequent patients’ care. Additionally, standardized technical and behavioral learner-specific feedback would likely have a greater impact on the learner’s awareness of their knowledge and performance gaps for a particular event than self-assessment. This use of simulation could be initiated using the scenarios and assessment tools we have developed. Objective, specific feedback should have a positive long-term impact on the quality of patient care delivered by individuals who participate in these formative, simulation-based assessments30.
Although there have been numerous changes in undergraduate medical education and residency training guidelines, “graduate medical education (GME) lacks a data-driven feedback system to evaluate how residency-level competencies translate into successful independent practices...”31. Simulation-based performance data from practicing clinicians could be aggregated to inform modifications in educational and training programs to address specific performance deficiencies across specialties. The impact of this approach for the profession and our patients might actually be greater than administering high-stakes summative examinations because the goal would be to raise the performance of the entire profession rather than to identify and restrict the low performers from practicing.
Our study had a number of limitations. First, only a small group of participants consented to being studied as the primary provider in two scenarios and, to the extent that these participants are not representative of practicing anesthesiologists, the generalizability of our findings may be affected. A larger-scale study, where participants are required to manage more scenarios, would better quantify the effect of task sampling on the reliability of the scores. Also, relatively few participants consented to completing two scenarios. To the extent that the participants are not representative of practicing anesthesiologists, the generalizability of our findings may be affected. Further, our study was limited to two independent ratings of each scenario. While rater effects should tend to cancel out with sufficient numbers of scenarios and raters, we were not able to adequately investigate this. For future studies specifically designed to assess the numbers of scenarios and raters needed to achieve adequate reliability for high-stakes assessment, it would be appropriate to incorporate a design where more participants managed a larger number of encounters and with more raters.
Second, the study was embedded within a required formative educational experience for board-certified anesthesiologists28 and this affected the design of the scenarios, which were found to have some differences in difficulty. Although this may be attributable to the clinical problem being managed, it may also have been a reflection of a scenario that was not optimally designed or administered and hence was more difficult for the participants to interpret and manage. Since the cases were primarily designed for formative education, the content, timing, and delivery may have been affected. Thus our results may not fully generalize to a high-stakes assessment setting where both individual factors (e.g. motivation) and environmental factors could be quite different.
Despite these limitations, our study showed that performance in complex manikin-based simulation encounters is context specific. The administration of a larger number of scenarios would yield a more reliable assessment of an individual’s clinical performance but would be logistically challenging and increase the costs. However, the use of relatively few scenarios still allows for the provision of individual, case-specific, feedback to clinicians. Given the rarity of some clinical presentations, the performance data, in aggregate, could also inform quality improvement initiatives, including focused training programs and educational activities.