Search results
The search in PubMed, Scopus, and Web of Science yielded a total of 1397 hits. After the elimination of 567 duplicates, 821 records were screened by title and abstract. We checked 193 full-text publications against the predefined eligibility criteria, resulting in the inclusion of 20 studies. In addition, four studies were identified through the search of the reference lists of included articles and the complementary Google search. In total, 24 studies were included for further investigation. Details of the article screening and selection are shown in Fig. 1. The list of excluded studies is provided in Supplementary file 2.
Main characteristics of the publications
Table 1 summarizes the 24 articles involved for further analysis. The first guideline, CHARMS (28), focusing on the appraisal of systematic reviews of predictive modelling studies was published in 2014, followed by a guideline on the reporting of ML predictive models by Luo et al (29) from 2016. Most guidelines were subsequently published in 2020 (n = 9, 38%) and 2021 (n = 9, 38%), followed by 2022 (n = 3, 13%) and one guideline in 2023.
The first authors’ affiliation was mainly from the United States (n = 9, 38%), followed by the United Kingdom (n = 4, 17%), Australia (n = 2, 8%) and Canada (n = 2, 8%). Other guidelines were published by authors with affiliation from France, Italy, the Netherlands, Sweden, Spain, Germany, and Switzerland. The 24 guidelines were published in 22 journals, with Nature Medicine (n = 3, 13%) being the most common.
The CHARMS guideline (28) received the highest number of citations on Google Scholar (N = 1080), followed by Luo et al (29) which was cited 571 times. Four guidelines, CONSORT-AI (30), SPIRIT-AI (31), CLAIM (32), and MI-CLAIM (33), attracted more citations than the average (N = 178). They were cited 523, 444, 442, and 212 times, respectively. All highly cited guidelines included a point-by-point checklist.
Table 1
Summary of the 24 included reporting guidelines for medical AI studies
First author (Year) / Country of affiliation
|
Name of the guideline / N of reporting items
|
Focus of the guideline
|
Journal
|
Purpose of the guideline
|
Google Scholar citationsa
|
Narrative guidelines
|
Buvat (2021) / France (34)
|
T.R.U.E. / 4
|
Nuclear medicine
|
The Journal of Nuclear Medicine
|
To aid the identification of studies reporting ground-breaking developments in AI-based research in nuclear medicine.
|
15
|
Stevens (2020) / US (35)
|
NSb / 19
|
ML in clinical research
|
Circulation: Cardiovascular Quality and Outcomes
|
A guideline for transparent and systematic presentation of outcomes from ML analyses, addressed primarily for clinical researchers. Designed to supplement current clinical reporting requirements.
|
90
|
Faes (2020) / UK (36)
|
NS / 8
|
ML clinical studies
|
Translational Vision Science & Technology
|
Improve the quality of research on the therapeutic use of ML by equipping clinicians and researchers with the tools they need to conduct their own rigorous assessments.
|
111
|
Bates (2020) / US (37)
|
NS / 8
|
Clinical research of AI-based interventions
|
Annals of Internal Medicine
|
Suggestions for reporting standards to enable the assessment of the incremental benefits of ML and AI, and remove barriers from their clinical adoption.
|
53
|
General reporting checklists
|
Al-Zaiti (2022) / US (21)
|
ROBUST-ML / 30
|
ML in clinical studies
|
European Heart Journal - Digital Health
|
Increase physicians' understanding of ML by equipping them with the information and tools required to comprehend and evaluate clinical research focusing on ML.
|
10
|
Cabitza (2021) / Italy (19)
|
NS / 55
|
ML in clinical studies
|
International Journal of Medical Informatics
|
To analyse the scientific rigor of a medical ML contribution and the reliability of its findings qualitatively.
|
93
|
Olczak (2021) / Sweden (38)
|
CAIR / 40
|
Clinical AI research
|
Acta Orthopaedica
|
Clinical reporting guidelines for artificial intelligence and ML; guidance on selecting appropriate outcome indicators.
|
33
|
Scott (2021) / Australia (39)
|
NS / 12
|
ML algorithm in healthcare
|
BMJ Health Care Informatics
|
To evaluate the clinical usefulness of ML technologies in healthcare.
|
58
|
Hernandez-Boussard (2020) / US (40)
|
MINIMAR / 21
|
AI in healthcare
|
Journal of the American Medical Informatics Association
|
To facilitate the diffusion of algorithms across healthcare systems, enable transparency to address any biases and unintended effects, and encourage the use of secondary resources through promoting external validation and encouraging the use of secondary resources.
|
145
|
Norgeot (2020) / US (33)
|
MI-CLAIM / 19
|
Clinical AI modelling
|
Nature Medicine
|
To propose a baseline for reporting to guarantee transparency and practicality in the use of AI in healthcare.
|
212
|
Luo (2016) / Australia (29)
|
NS / 52
|
ML predictive models in biomedical research
|
Journal of Medical Internet Research
|
To develop guidelines for the application of prediction models based on ML in healthcare settings.
|
571
|
Checklists for specific study designs
|
Liu (2020) / UK (30)
|
CONSORT-AI / 49
|
Randomised clinical trials involving interventions with AI component
|
BMJ
Nature Medicine
Lancet Digital Health
|
To establish a standard for the reporting of clinical trials employing artificial intelligence-based therapies.
|
523
|
Rivera (2020) / UK (31)
|
SPIRIT-AI / 66
|
Clinical study protocol involving AI method
|
BMJ
Nature Medicine
Lancet Digital Health
|
To enhance the comprehensiveness of clinical trial protocol documentation.
|
444
|
Vasey (2022) / UK (24)
|
DECIDE-AI / 38
|
Early-stage clinical evaluation of AI-driven decision support systems
|
BMJ
Nature Medicine
|
To facilitate the evaluation of research and the reproducibility of their results in healthcare studies using AI-based decision support systems.
|
96
|
Moons (2014) / Netherlands (28)
|
CHARMS / 35
|
Systematic reviews of prediction modelling studies
|
PLOS Medicine
|
To assist with the formulation of a review question and evaluation of all forms of primary prediction modeling studies for systematic reviews.
|
1080
|
Checklists for specific clinical areas
|
Daneshjou (2021) / US (41)
|
CLEAR Derm / 25
|
Image-based AI in dermatology
|
JAMA Dermatology
|
To synthesize the minimal current material to serve as a guide for dermatological AI developers and reviewers.
|
42
|
Haller (2022) / Switzerland (23)
|
R‑AI‑DIOLOGY / 21
|
AI in clinical neuroradiology
|
Neuroradiology
|
To assist neuroradiologists in evaluating an AI tool for clinical neuroradiology applications.
|
5
|
Kwong (2021) / Canada (42)
|
STREAM-URO / 29
|
ML in urology
|
European Urology Focus
|
To improve ML literacy in the field of urology by establishing a standard for reporting ML applications.
|
18
|
Mongan (2020) / US (32)
|
CLAIM / 42
|
AI in medical imaging
|
Radiology: Artificial Intelligence
|
A recommended method for reporting medical imaging studies.
|
442
|
Mörch (2020) / Canada (43)
|
Canada protocol / 36
|
AI in mental health
|
Artificial Intelligence In Medicine
|
Focusing on mental health and suicide prevention, this study explores methods to more effectively identify and respond to ethical challenges in AI.
|
13
|
Schwendicke (2021) / Germany (44)
|
NS / 31
|
AI in dental research
|
Journal of Dentistry
|
Instructions for the design, execution, and reporting of research with dental AI.
|
102
|
Sengupta (2020) / US (45)
|
PRIME / 28
|
Cardiovascular Imaging-Related ML
|
Cardiovascular Imaging
|
To ensure that the ML models used in cardiovascular imaging studies are reported consistently, this comprehensive guide and associated checklist have been developed.
|
97
|
Naqa (2021) / US (46)
|
CLAMP / 26
|
AI in medical physics
|
Medical Physics
|
To guarantee rigorous and repeatable research of AI / ML in the area of medical physics, introducing a new, necessary checklist for AI / ML applications in Medical Physics (CLAMP).
|
20
|
Cerdá‑Alberich (2023) / Spain (22)
|
MAIC–10 / 10
|
AI in medical images
|
Insights into Imaging
|
A guide for examining publications related to AI in medical imaging, with a focus on study design and evaluation.
|
4
|
a On July 18, 2023; b NS: Not specified |
Guideline development process
Thirteen articles (54%) provided methodological details about the guideline development process. Ten guidelines (42%) were registered in the EQUATOR website. Two (8%) guidelines were extensions to existing EQUATOR guidelines (CONSORT-AI, SPIRIT-AI), the rest were standalone guidelines (92%). From the seven main components of guideline development defined by the EQUATOR Network, a literature review, a Delphi survey, an expert consensus meeting, and pilot testing were reported by 13 (54%), 6 (25%), 7 (29%), and 6 (25%) guidelines, respectively. Eleven guidelines (46%) reported a funding source, 4 guidelines, albeit vaguely, referred to a future update policy, and 6 guidelines (25%) were adapted or endorsed by a journal or professional organisation. The development of guidelines involved on average 2.3 (SD 2.0, range 0–7) out of the 7 investigated components.
The development process was most comprehensive for CONSORT-AI (30) with all seven development steps completed, followed by SPIRIT-AI (31), which involved all key steps except the reporting of endorsement by a journal or professional organisation. The development of DECIDE-AI (24) was also comprehensive, but no reference was made about intentions to update it in the future.
The mean (SD) development steps of narrative guidelines, general clinical checklists, study design specific checklists and clinical area specific checklists were 0.5 (0.6), 1 (1.2), 5.3 (1.7), and 2.6 (1.6), respectively, with significant difference between the groups (ANOVA, F3,20=10.53, p < 0.001). Registration on the EQUATOR Network and obtaining funding were associated with more comprehensive development processes, while the endorsement by journals or professional societies was not an indicator of methodological rigor. The mean (SD) development components (except the grouping variable) in registered studies were 3.4 (2.3) versus 1.4 (1.4) in studies not registered in EQUATOR (Welch’s t test, p = 0.031). Funded studies featured 2.6 (1.9), whereas non-funded studies featured 1.1 (1.3) development steps (Welch’s t test, p = .034). The difference between endorsed and not endorsed guidelines was not significant (Welch’s t test, p = 0.95). Registration on EQUATOR or endorsement was not associated with higher citation counts. However, the Google Scholar citation count was higher for funded guidelines (mean: 303.8, SD: 102.9) than for those without funding (mean: 71.9, SD: 17.1) (Welch’s t-test, p = 0.049). Further development details are reported in Table 2.
Table 2
Development details of the included guidelines
Studies
|
Development methods reported
|
Registered in EQUATOR website
|
Literature review
|
Delphi survey
|
Expert consensus meeting
|
Pilot testing
|
Funded
|
Update policy was stated
|
Journal / Society
|
Narrative guidelines
|
T.R.U.E. / Buvat (2021) (34)
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
✓
|
Stevens (2020) (35)
|
×
|
✓
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
Faes (2020) (36)
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
Bates (2020) (37)
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
General reporting checklists
|
ROBUST-ML / Al-Zaiti (2022) (21)
|
×
|
×
|
×
|
×
|
×
|
×
|
✓
|
×
|
×
|
Cabitza (2021) (19)
|
✓
|
×
|
✓
|
×
|
×
|
×
|
×
|
×
|
✓
|
CAIR / Olczak (2021) (38)
|
×
|
×
|
×
|
×
|
×
|
×
|
✓
|
×
|
×
|
Scott (2021) (39)
|
✓
|
×
|
✓
|
×
|
×
|
×
|
×
|
×
|
×
|
MINIMAR / Hernandez-Boussard (2020) (40)
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
MI-CLAIM / Norgeot (2020) (33)
|
×
|
✓
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
Luo (2016) (29)
|
✓
|
✓
|
✓
|
✓
|
×
|
×
|
✓
|
×
|
×
|
Checklists for specific study designs
|
CONSORT-AI / Liu (2020) (30)
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
SPIRIT-AI / Cruz Rivera (2020) (31)
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
×
|
DECIDE-AI / Vasey (2022) (24)
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
✓
|
×
|
×
|
CHARMS / Moons (2014) (28)
|
✓
|
×
|
✓
|
×
|
✓
|
✓
|
✓
|
×
|
×
|
Checklists for specific clinical areas
|
CLEAR Derm / Daneshjou (2021) (41)
|
✓
|
×
|
✓
|
×
|
✓
|
×
|
✓
|
×
|
×
|
R-AI-DIOLOGY / Haller (2022) (23)
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
STREAM-URO / Kwong (2021) (42)
|
✓
|
✓
|
✓
|
×
|
✓
|
×
|
×
|
✓
|
×
|
CLAIM / Mongan (2020) (32)
|
×
|
✓
|
×
|
×
|
×
|
×
|
✓
|
×
|
✓
|
Canada protocol / Mörch (2020) (43)
|
✓
|
×
|
✓
|
✓
|
×
|
×
|
×
|
×
|
×
|
Schwendicke (2021) (44)
|
✓
|
✓
|
✓
|
✓
|
×
|
✓
|
×
|
×
|
✓
|
PRIME / Sengupta (2020) (45)
|
✓
|
✓
|
✓
|
×
|
×
|
×
|
✓
|
✓
|
✓
|
CLAMP / Naqa (2021) (46)
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
×
|
✓
|
MAIC-10 / Cerdá‑Alberich (2023) (22)
|
✓
|
×
|
✓
|
×
|
✓
|
✓
|
✓
|
×
|
×
|
Main characteristics of the guidelines
Four papers (17%) were narrative guidelines (34–37) and twenty (83%) comprised a checklist. Seven checklists (30%) were formulated as general AI reporting standards without specific focus on any particular research domain (19, 21, 40, 29, 30, 33, 35–39).
Four guidelines (17%) explicitly stated their focus on distinct study designs, encompassing randomized controlled trials (30) clinical trial protocols (31), early-stage clinical evaluation of AI-driven decision support systems (24), and systematic reviews of prediction modelling studies (28). Two of these are AI-related extensions of well-established checklists, namely the CONSORT for randomised trials (11) and the SPIRIT for clinical trial protocols (47).
Nine guidelines (38%), were designed to address various clinical areas including urology (42), neuroradiology (23), mental health (43), medical physics (46), medical imaging (22, 32), dermatology (41), dentistry (44), and cardiovascular imaging (45). While discussing general AI-related standards, through the journal or the elaboration examples, five (21%) more guidelines could be indirectly associated with nuclear medicine (34), cardiovascular medicine (21, 35), ophthalmology (36), and orthopaedics (38). Some areas overlapped, such as cardiovascular imaging, medical imaging, and cardiovascular medicine.
Of the total publications analysed, 20 (74%) were designed for authors, reviewers, and editors, while eight (30%) were tailored to clinicians and model users, and three (11%) were intended for application developers. Two guidelines (7%) did not specify their intended audience (29, 37). Overall, Table 3 provides an overview of the focus areas covered by the guidelines and their respective target audiences.
Table 3. Characterisation of guidelines by their focus areas and target audiences
Analysis of the structure of reporting guidelines
The domains and areas covered by the guidelines and checklists (as classified in the original paper) with the number of respective items are summarized in Table 4. The structure, and level of detail of both the narrative guidelines and point-by-point checklists showed significant heterogeneity.
Nine guidelines followed the usual IMRAD (Introduction, Methods, Results, and Discussion) structure of research articles with one or more additional sections (e.g., Title/Abstract, Statements/Other information) or omissions (22, 29–32, 38, 42, 46, 48). The number of reporting items within the IMRAD group ranged between 10 (22) and 66 (31) with a median of 40.
Albeit with greater variation, the structure of other 12 guidelines followed the ML pipeline method of clinical AI studies as summarized by MI-CLAIM (i.e., Study design, Data and optimisation, Model performance, Model examination, Reproducibility) with frequent additions of partial clinical information domains (e.g., Participants, Outcomes, or Clinical Deployment) or omissions (19, 21, 41, 45, 23, 28, 33, 35–37, 39, 40). The number of items within the ML group ranged between 8 (36, 37) and 55 (19) (median: 21).
The third group comprised three guidelines with structures not fitting into either the IMRAD or ML pipeline frameworks (34, 43, 44) with items ranging between 4 (34) and 36 (43).
The structure of checklist differed considerably within subgroups. The most similar structure was observed across the checklists for specific study designs, as three (CONSORT-AI, SPIRIT-AI, DECIDE-AI) out of the four followed IMRAD with additional sections such at Title/Abstract and Statements/Other information. Although the fourth checklist (CHARMS) in this subgroup was also designed to be used by authors, namely researchers performing systematic review studies, its structure corresponded more the ML pipeline focusing on details of data and the model (26 items), with less emphasis on participants and results (9 items).
From the general checklists two articles followed the IMRAD structure (29,38) and five the ML pipeline (19, 21, 33, 39, 40). Among checklists for specific clinical areas four followed IMRAD (22, 32, 42, 46), and three the ML pipeline (23, 41, 45). CLAIM integrated the ML workflow elements within the IMRAD format (32).
Altogether, the 24 guidelines contained 704 items in 224 sections. Many items were complex statements covering more than one reporting element. The mean number of items (i.e., depth of detail) differed significantly by the type and focus of guidelines. The mean (SD) item count of narrative guidelines, general checklists, checklists for specific study design and checklists for specific clinical areas were 9.8 (6.4), 32.7 (16.8), 47 (14.0) and 27.6 (9.0) respectively (ANOVA, F3,20=6.31, p = 0.003). However, we found no association between the depth of detail and guideline structure, with mean (SD) item count of 39.1 (16.3) for guidelines with IMRAD, 23.4 (13.0) for those with ML pipeline and 23.7 (17.2) for guidelines with other structure (ANOVA, F2,21=3.17, p = 0.063). Furthermore, we found no association between the type (i.e., narrative, or general) and focus (i.e., study design specific, or clinical area specific checklists) and the structure of guidelines (Fisher’s exact test, p = 0.272). However, the structure of guidelines and the comprehensiveness of their development were associated, with mean (SD) 2.8 (1.9), 0.9 (1.2), and 2.3 (1.5) development steps of guidelines with IMRAD, ML pipeline or other structures, respectively (ANOVA, F2,21=3.72, p = 0.041). The number of development steps and the number of items showed moderate positive correlation (r = 0.56, p = 0.005).
Table 4
Structure of the reporting guidelines
Narrative guidelines
|
Guideline
|
Sections
|
Structure
|
T.R.U.E. / Buvat (2021) (34)
|
Four questions with elaboration
• Is it true?
• Is it reproducible?
• Is it useful?
• Is it explainable?
|
Other
|
Stevens (2020) (35)
|
Explanations and examples around three areas
• Research Question and ML Justification (6 items)
• Data sources and pre-processing (training and validation data) (7 items)
• Model Training and validation (6 items)
|
ML pipeline of clinical AI studies
|
Faes (2020) (36)
|
Eight questions with elaboration
• Was the Study Methodology Prespecified?
• Is the Model Being Evaluated in its Intended Stage in the Care Pathway?
• Do the Authors Provide Sufficient Clarity on How the Data Were Split?
• Are the Image Labels Likely to Reflect the True Disease State?
• How Is Diagnostic Accuracy Reported?
• Is the Dataset Used in Model Development Reflective of the Setting in Which the Model Will Be Applied?
• Is the Output of the Model Interpretable and Can it Be Interrogated? Are Differential Diagnoses and Estimates of Confidence Provided?
• Is the Performance Reproducible and Generalizable?
|
ML pipeline of clinical AI studies
|
Bates (2020) (37)
|
Two parallel sections: “recommendations for study design and conduct”, and “recommendations for reporting”.
Four main issues under each section, with corresponding broad items, which cover multiple elements.
• Validation (3 items)
• Uncertainty (1 item)
• Implementation (1 item)
• Data (3 items)
|
ML pipeline of clinical AI studies
|
General reporting checklists
|
ROBUST-ML / Al-Zaiti (2022) (21)
|
Quality items with corresponding important red flags to observe, organised in five domains:
• General design considerations (3 items)
• Data quality considerations (6 items)
• Feature engineering considerations (5 items)
• Model development considerations (10 items)
• Considerations for clinical utility (6 items)
|
ML pipeline of clinical AI studies
|
Cabitza (2021) (19)
|
Questions, which may cover multiple items, are organised around 6 domains
• Problem understanding (6 questions)
• Data understanding (3 questions and 13 subitems)
• Data preparation (4 questions and 2 subitems)
• Modelling (3 questions)
• Validation (8 questions and 9 subitems)
• Deployment (7 questions)
|
ML pipeline of clinical AI studies
|
CAIR / Olczak (2021) (38)
|
Recommendations follow the reporting structure of a research article (5 parts)
• Title and abstract (2 items)
• Introduction (1 item)
• Methods (9 items, 25 subitems)
• Results (1 item)
• Discussion and other information (2 items)
|
IMRAD
|
Scott (2021) (39)
|
12 questions
• What is the purpose of the algorithm? (1)
• How good were the data used to train the algorithm? (2a)
• To what extent were the data accurate and free of bias? (2b)
• Were the data standardised and interoperable? (2c)
• Were there sufficient data to train the algorithm? (3)
• How well does the algorithm perform? (4)
• Is the algorithm transferable to new clinical settings? (5)
• Are the outputs of the algorithm clinically intelligible? (6)
• How will this algorithm fit into and complement current workflows? (7)
• Has use of the algorithm been shown to improve patient care and outcomes? (8)
• Could the algorithm cause patient harm? (9)
• Does use of the algorithm raise ethical, legal or social concerns? (10)
|
ML pipeline of clinical AI studies
|
MINIMAR / Hernandez-Boussard (2020) (40)
|
Reporting items and their descriptions are organised in four domains
• Study population and setting (4 items)
• Patient demographic characteristics (5 items)
• Model architecture (8 items)
• Model evaluation (4 items)
|
ML pipeline of clinical AI studies
|
MI-CLAIM / Norgeot (2020) (33)
|
Statements of good reporting practices are organised around 6 domains in 5 parts
• Study design - Part 1 (5 items)
• Data and optimisation - Parts 2,3 (5 items)
• Model performance - Part 4 (3 items)
• Model examination – Part 5 (5 items)
• Reproducibility – Part 6: choose appropriate tier of transparency (1 item)
|
ML pipeline of clinical AI studies
|
Luo (2016) (29)
|
Topics and reporting elements are organised along the reporting structure of a research article (5 parts)
• Title and abstract
o Title (1 item)
o Abstract (5 items)
• Introduction
o Rationale (2 items)
o Objectives (1 item)
• Methods
o Describe the setting (2 items)
o Define the prediction problem (8 items)
o Prepare data for model building (11 items)
o Build the predictive model (10 items)
• Results
o Report the final model performance (6 items)
• Discussion
o Clinical implications (1 item)
• Limitations of the model (4 items)
o Unexpected results during the experiments (1 item)
|
IMRAD
|
Checklists for specific study designs
|
CONSORT-AI / Liu (2020) (30)
|
Reporting items are AI-related extensions to the CONSORT checklist, organised in 25 sections along the reporting structure of a clinical study report (7 parts)
• Title and abstract
o Title and abstract (2 AI extension items)
• Introduction
o Background and objectives (2 items + 1 AI extension)
• Methods
o Trial design (2 items)
o Participants (2 items + 3 AI extensions)
o Interventions (1 item + 6 AI extensions)
o Outcomes (2 items)
o Sample size (2 items)
• Randomization
o Sequence generation (2 items)
o Allocation concealment mechanism (1 item)
o Implementation (1 item)
o Blinding (2 items)
o Statistical methods (2 items)
• Results
o Participant flow (2 items)
o Recruitment (2 items)
o Baseline data (1 item)
o Numbers analysed (1 item)
o Outcomes and estimation (2 items)
o Ancillary analyses (1 item)
o Harms (1 item + 1 AI extension)
• Discussion
o Limitations (1 item)
o Generalisability (1 item)
o Interpretation (1 item)
• Other information
o Registration (1 item)
o Protocol (1 item)
o Funding (1 item + 1 AI extension)
|
IMRAD
|
SPIRIT-AI / Cruz Rivera (2020) (31)
|
Reporting items are AI-related extensions to the SPIRIT checklist, organised in 33 sections along the reporting structure of a clinical study protocol (8 parts)
• Administrative information
o Title (1 item + 2 AI extensions)
o Trial registration (2 items)
o Protocol version (1 item)
o Funding (1 item)
o Roles and responsibilities (4 items)
• Introduction
o Background and rationale 2 items + 2 AI extensions)
o Objectives (1 item)
o Trial design (1 item)
• Methods: participants, interventions and outcomes
o Study setting (1 item + 1 AI extension)
o Eligibility criteria (1 item + 2 AI extensions)
o Interventions (4 items + 6 AI extensions)
o Outcomes (1 item)
o Participant timeline (1 item)
o Sample size (1 item)
o Recruitment (1 item)
• Methods: assignment of interventions (for controlled trials)
o Sequence generation (1 item)
o Allocation concealment (1 item)
o Implementation (1 item)
o Blinding (masking) (2 items)
• Methods: data collection, management and analysis
o Data collection methods (2 items)
o Data management (1 item)
o Statistical methods (3 items)
• Methods: monitoring
o Data monitoring (2 items)
o Harms (1 item + 1 AI extension)
o Auditing (1 item)
• Ethics and dissemination
o Research ethics approval (1 item)
o Protocol amendments (1 item)
o Consent or ascent (2 items)
o Confidentiality (1 item)
o Declaration of interests (1 item)
o Access to data (1 item + 1 AI extension)
o Ancillary and post-trial care (1 item)
o Dissemination policy (3 items)
• Appendices
o Informed consent materials (1 item)
o Biological specimens (1 item)
|
IMRAD
|
DECIDE-AI / Vasey (2022) (24)
|
AI-specific and general reporting items in 17 AI-specific and 10 generic* reporting themes organised around the reporting structure of a clinical study report (6 parts)
• Title and abstract
o Title (1 item)
o Abstract (1 item)*
• Introduction
o Intended use (2 items)
o Objectives (1 item)*
• Methods
o Research governance (1 item)*
o Participants (3 items)
o AI system (3 items)
o Implementation (2 items)
o Outcomes (1 item)*
o Safety and errors (2 items)
o Human factors (1 item)
o Analysis (1 item)*
o Ethics (1 item)
o Patient involvement (1 item)*
• Results
o Participants (2 items)
o Implementation (2 items)
o Main results (1 item)*
o Subgroup analysis (1 item)*
o Modifications (1 item)
o Human-computer agreement (1 item)
o Safety and errors (2 items)
o Human factors (2 items)
• Discussion
o Support for intended use (1 item)
o Safety and errors (1 item)
o Strengths and limitations (1 item)*
• Statements
o Data availability (1 item)
o Conflicts of interest (1 item)*
|
IMRAD
|
CHARMS / Moons (2014) (28)
|
Reporting items are organised in 11 domains
• Source of data (1 item)
• Participants (4 items)
• Outcome(s) to be predicted (6 items)
• Candidate predictors (or index tests) (5 items)
• Sample size (2 items)
• Missing data (3 items)
• Model development (5 items)
• Model performance (2 items)
• Model evaluation (2 items)
• Results (3 items)
• Interpretation and discussion (2 items)
|
ML pipeline of clinical AI studies
|
Checklists for specific clinical areas
|
CLEAR Derm / Daneshjou (2021) (41)
|
Items are organised under 4 sections and anonymous subsections
• Data (15 items organised in 5 subsections)
• Technique (4 items organised in 2 subsections)
• Technical assessment (4 items organised in 2 subsections)
• Application (2 items in 1 subsection)
|
ML pipeline of clinical AI studies
|
R-AI-DIOLOGY / Haller (2022) (23)
|
Reporting items (questions) are organised under 10 domains
• Disease/domain (2 items)
• Preselection of cases and reference dataset (4 items)
• Data parameters (2 items)
• Data quality check (1 item)
• Anonymization, coding and de-identification (2 items)
• Data storage and processing (2 items)
• Integration in the radiologist’s workflow (2 items)
• Update (3 items)
• Validation and labels (2 items)
• Ground truth and reference (1 item)
|
ML pipeline of clinical AI studies
|
STREAM-URO / Kwong (2021) (42)
|
AI related reporting items (and explanations) are mapped to those of the TRIPOD statement, organised along the reporting structure of a clinical study protocol (6 parts)
• Title, abstract (1 item)
• Introduction (2 items)
• Methods (12 items)
• Results (8 items)
• Discussion (3 items)
• Supplemental materials (3 items)
|
IMRAD
|
CLAIM / Mongan (2020) (32)
|
Items are organised along 7 sections of the reporting structure of a clinical study, and nine topics
• Title or Abstract (1 item)
• Abstract (1 item)
• Introduction (2 items)
• Methods
o Study design (2 items)
o Data (7 items)
o Ground Truth (5 items)
o Data Partitions (3 items)
o Model (3 items)
o Training (3 items)
o Evaluation (5 items)
• Results
o Data (2 items)
o Model performance (3 items)
• Discussion (2 items)
• Other information (3 items)
|
IMRAD
|
Canada protocol / Mörch (2020) (43)
|
Items are organised in five broad categories
• Description (8 items)
• Privacy and transparency (8 items)
• Security (6 items)
• Health-related risks (6 items)
• Biases (8 items)
|
Other
|
Schwendicke (2021) (44)
|
Items are listed under two main sections
• Planning and conduction (9 items)
• Reporting (22 items)
|
Other
|
PRIME / Sengupta (2020) (45)
|
Items are organised under 7 sections
• Designing the Study Plan (5 items)
• Data Standardization, Feature Engineering, and Learning (6 items)
• Selection of Machine Learning Models (6 items)
• Model Assessment (2 items)
• Model Evaluation (4 items)
• Best Practices for Model Replicability (3 items)
• Reporting Limitations, Biases and Alternatives (2 items)
|
ML pipeline of clinical AI studies
|
CLAMP / Naqa (2021) (46)
|
Items are organised under 6 sections
• Abstract (3 items)
• Introduction (3 items)
• Materials (9 items)
• Methods: machine learning algorithm (3 items)
• Methods: performance statistics (5 items)
• Discussion (3 items)
|
IMRAD
|
MAIC-10 / Cerdá‑Alberich (2023) (22)
|
Checklist items (with descriptions specifying multiple reporting elements) are paired with four article sections with overlap
• Introduction (1 item)
• Materials and methods (5 items)
• Materials and methods, results (1 item)
• Results, discussion (1 item)
• Discussion (2 items)
|
IMRAD
|