Background and context criteria viewed as most important to use in clinical simulation to generate evidence for SaMD were a clear description of the SaMD being evaluated, including purpose and intended users, and description and justification of the simulation performed; this was alongside any other research being conducted to evaluate the SaMD. While remaining important, the criteria for appropriately declaring sources of funding and other conflicts of interest scored less highly. In terms of overall study design criteria, participants viewed it as important that potential limitations of study design and potential biases associated are discussed. Similarly, strategies to minimize potential study biases being discussed, particularly regarding issues of equity (e.g., high risk patient profiles, racial disparities), were also noted as important. Again, while consensus was reached to include information on how digital literacy is considered in the study design, participants did not rate this as highly as the other areas. The study population criteria rated as most important included the eligibility criteria for clinicians who took part in the clinical simulation being representative of the intended end users (e.g., staff level, qualification, experience) and transparency around the number of clinicians who took part in the simulation.
Thehighest rated delivery of the simulation criteria included the need to describe the environment in which the simulation took place (e.g., physical/virtual, type of facility), the equipment used, the initial orientation and training provided to clinicians before taking part, and a description of how the SaMD was described to clinicians before taking part. The fidelity criteria rated as most important were that the clinical simulation has high conceptual fidelity that meets the intended use of the SaMD, it uses high fidelity synthetic patient cases, and simulation has high clinical scenario fidelity. Beyond these elements, participants highly rated the need to describe the methodology and rationale for developing the synthetic patient cases, the overall representativeness of the synthetic patient cases, the potential limitations of the synthetic patient cases, as potential data biases in their development. On software and AI, participants viewed all criteria as highly important, including the need to describe any continuous machine learning (ML) algorithms embedded in the SaMD, including their design and development, ensure they are reviewed at regular intervals to monitor their changes, and describe and justify any software updates to the SaMD since the clinical simulation study.
The most important study analysis criteria included the need to clearly define primary and secondary outcome measures, and provide a rationale and justification for selecting these, the impacts of any unintended consequences (e.g., harm) from the study, data analysis methods, generalizability of the findings, and results of a sensitivity analysis to assess the robustness of the clinical simulation findings.
The inclusion of the majority (43/55) of criteria presented to the Delphi participants suggests there is substantial data collection and reporting that need to be considered and executed if clinical simulation is to be increasingly used to authorise medical devices for market based in part or exclusively on validation simulations. However, the benefits of such evidence generation and reporting are also clear; evidence can be used to achieve regulatory authorisation and post-market, can detect error and validate technologies11. Developing the SIROS framework with guidance across the 43 criteria presented in this study can enable manufacturers to work methodically on evidence generation and regulatory submissions for more streamlined SaMD approval. A next step will be to develop an accompanying checklist that makes the SIROS framework actionable for SaMD developers.
While our study is the first of its kind in collecting consensus on how to assess clinical simulation being used to generate evidence on SaMD, participants in our research viewed many of the criteria as important. This is complementary to findings from other contemporary studies on methods to generate evidence on DHTs. More generally, participant comments on the importance of such guidance are in line with research by Day et al.12 who note that many digital health start-ups have limited clinical robustness, as measured by regulatory filings and clinical trials. The authors note a lack of meaningful clinical validation for almost half of digital health companies (44% had a clinical robustness score of 0), highlighting a lack of guidance such as the SIROS framework. In recent years, national guidance and regulations have been developed to enable rapid assessment of DHTs. In the United States, the Digital Medicine Society has created a regulatory compass tool entitled RegPath, with input from the FDA, to enable improved understanding of whether a specific DHT falls within FDA regulation, and if so, which regulatory pathway is relevant13. In Europe, German regulators have developed a fast-track pathway for digital health applications (in German, DiGA) to be reimbursed by statutory health insurances14,15, a model that will now be employed in other European countries. The importance of developing a framework for assessing clinical simulation can therefore been seen as a next step to compliment ongoing regulatory developments internationally, where there are limited tailored guidelines or frameworks at present.
The seven areas and associated criteria agreed by the participants, are in line with existing literature and research studies that utilized clinical simulation to evaluate DHTs. Gardener et al. used clinical simulation to evaluate a clinical decision support tool for matching cancer patients to clinical trials16. Participants in the research stated that they were provided sufficient guidance on the exercises and enough clinical information in the synthetic patient cases, though a small number noted that they would have preferred more information on histology information. Such findings suggest the importance of providing regulators information on initial orientation and training provided to clinicians before taking part, as a key factor in the clinical simulation’s success. Gardener et al., reported that participants noted that a lack of familiarity with the novel solution could potentially challenge the clinical simulation approach. However, as there are few published studies on the role of clinical simulation to evaluation DHTs, there is an urgent need for further research in this area that can both utilize the areas developed through our research, but also validate these areas by utilizing them in practice. Similarly, there is the potential for the seven-dimension SIROS framework developed through to be utilized in the evaluation of other types of DHTs where many similar issues will be of concern to regulators.
Echoing the well-researched importance of fidelity of simulation used in health education17,18, participants in the Delphi identified this as a key area where researchers were required to outline how the clinical simulation sought high fidelity with planned future use of the SaMD. The specific context in which the SaMD is intended to be used must be considered in planning for the clinical simulation to enable accurate reporting, with particular attention paid to high fidelity synthetic patient cases, and their implications for representativeness and equity. In this regard, the evidence required is similar to that of all DHTs developed with AI/ML methods. For example, the Good Machine Learning Practice for Medical Device Development: Guiding Principles developed by the FDA, UK Medicines and Healthcare products Regulatory Agency (MHRA) and Health Canada in 2021 encourage good practice in medical device development using AI/ML, including the reduction of bias through representative clinical study participants and datasets19. In cases where the SaMD being simulated uses continuous self-learning algorithms, the Delphi participants highlighted the need for reporting of plans for continuous monitoring and other steps to maintain quality and safety as part of defining, controlling, and improving software life cycle processes outlined in ISO/IEC 1220720 and adhering to relevant local legislation and regulatory guidance. Such views were echoed by Carolan et al. who note the need for international standards and guiding principles addressing the uniqueness of SaMD with a continuous learning algorithm6.
The importance of presenting issues of bias and equity through the simulation process is a central element of how to assess clinical simulation being used to generate evidence on SaMD according to our research participants. This is perhaps unsurprising given the increasing research outlining the potential risk of bias and increased health inequities associated with poorly developed or implemented DHTs, including AI/ML7,21-23. Guo et al. identified a range of relevant tools and frameworks providing guidance on different aspect of bias in evidence generation studies24. These include Quality in prognosis studies (QUIPS), Cochrane risk-of-bias tool for randomized trials (RoB2), PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies, and The risk of bias in nonrandomized studies of interventions (ROBINS-I). Such frameworks offer SaMD developers a ready source of information to address and report on issues related to bias as part of clinical simulation research during submissions for regulatory approval. In response to the challenge of adaptive technologies, the proposed FDA framework for modifications to AI/ML-based SaMD further seeks to ensure safety and effectiveness are maintained25.
Current tools and guidance remain at best a stopgap until regulatory environments and international guidelines can be developed to ensure SaMD is developed and deployed with a clearer understand on its impact on quality and safety through the software life cycle. To advance this process, manufacturers should engage with regulators and propose clinical simulation methods for the purpose of regulatory approval. While there are a few existing examples of clinical simulation data in part or wholly being used as evidence for regulatory approvals, developing more real-life use cases will enable the development of best-practice guidelines and lessons learned that will benefit all stakeholders. Regulators and notified bodies should also work with manufacturers on the application of clinical simulation. Providing greater clarity on what they would like to see from data and how it can best be collected and presented for SaMD approval, will enable manufacturers to be increasingly targeted in their approach to clinical simulation.
Limitations
Delphi studies traditionally begin with an open-ended question for participants to provide numerous open-ended responses. The results of this initial ideas generation stage are then analysed, summarized and presented in subsequent rounds26. However, some studies have chosen a different approach, where pre-existing information is initially described to participants and on which they are asked their opinion27,28. This approach was taken in this study as outlined in the procedure. This can be justified as it aims to prepare participants for the upcoming rounds and can reduce the potentially overwhelming task of data analysis. However, limitations regarding potential bias of responses exist along with exclusion of relevant ideas that participants may have contributed if they were requested in an open-ended format.
There was some confusion apparent in the panelists’ qualitative comments, particularly in the scoping round, about what the terminology used and the question’s context. For example, many panelists misinterpreted the term ‘study participants’, ‘intervention’ and ‘evaluation opinion’ most commonly. This may have been due to the complexity of the use case scenario and with no practical scenarios provided to help with its understanding. To overcome this, an analysis of the comments regarding lack of clarity was carried out and used to improve the wording of the questions and criteria for round 1.
Despite efforts to recruit a global cohort that is representative across high- and low-and-middle income countries, most participants came from high-income countries, particularly the UK. This may have arisen due to several factors, such as the informal networks from the study sponsor being predominantly based in the UK, or that the global SaMD community has a greater base in high income countries. Regardless, further research is required to ensure that the study results are applicable to other settings and in other country contexts.
The two pre-defined criteria for deciding on whether to include criteria between successive rounds may have led to some potentially important items being removed unnecessarily. As mentioned previously, in the quantitative analysis of rounds 1 and 2, any item was removed if it was rated by more than 10% of panelists as ‘not important’ or ‘not important at all’ or if less than 60% of panelists rated it was ‘important’ or ‘very important’. The analysis of round 1 led to 5 items that met the former criteria but did not meet the latter. However, given that it did not meet one of the pre-determined criteria it was excluded from round 2. Therefore, these 5 items may have been included important information for regulators to consider when evaluating clinical simulation methods but were excluded from inclusion.