The protocol is reported with reference to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocols (PRISMA-P) statement[18, 19] (the checklist is shown in the Supplementary file in Appendix 1) and is conducted using the methodology recommended for systematic review and meta-analysis of prediction models[20]. We will conduct the framing of the review question (Table 1), study design, data extraction, data analysis and critical appraisal under the relevant guidelines[21–23]. We have registered the study on PROSPERO(CRD42021256416).
2.1 Eligibility criteria
Studies will be screened on the basis of the following criteria: population (P), index (I), comparator (C), outcomes (O), timing (T), and setting (S). The PICOTS is an amendment of the PICOS system specific to systematic reviews of prediction models with extra consideration for timing and clinical setting[20, 24].
2.1.1 Patients
Studies on prediction models for patients with AHF (or acute decompensated heart failure) be considered for inclusion, but those focus on exclusive patients with specific morbidities or groups of children will be excluded.
2.1.2 Index/Potential prognostic models
Studies reporting the derivation and validation of a multivariable prediction model will be eligible for this review. The derived model should include at least 50 patients who experienced an event during the period of observation, because studies with fewer cases may not be sufficient for convincing administrative or clinical use[17].
2.1.3 Outcome, timing and setting
The primary outcomes predicted by the models under review are mortality and readmission rates of AHF patients. Readmission is defined as all-cause or heart failure-related readmission, which includes emergency department revisits or re-hospitalizations. The studies included in this review should report at least one of the primary outcomes. Studies for prediction models in inpatient and emergency department settings at any period will be included.
2.1.4 Types of studies and limits
Cohort, nested case-control, case-cohort, or registry studies using any type of data source (e.g., administrative databases and electronic medical records) will be included in this review. Only those original research articles reported in peer-reviewed publications will be eligible for this review. Secondary research, reviews, conference proceedings, dissertations, editorials, expert opinions, or consensus paper abstracts will be excluded.
2.1.5 Information sources and search strategy
Embase, Pubmed, Web of Science, and the Cochrane Library will be searched for results from their inception onwards. Detailed search strategies are presented in Supplementary file in Appendix 2, and the search terms cover expressions for acute heart failure, prediction model, and mortality/readmission[25]. We will focus on studies published in English. We will search a backword citation on all model derivation studies and identify relevant external validation studies via these studies.
2.2 Data management and study selection
Reviewers will undergo formal training before the formal selection of studies, considering different backgrounds and knowledge of the reviewers. Eligibility criteria will be explained and discussed in detail to ensure the reviewers have the same understanding. Search results will be combined using Endnote X9, and duplicates will be removed. Abstracts and titles of each study will be screened by two reviewers (selected among XZ, SW, ZG, MH), excluding articles based on eligibility criteria.
2.3 Data collection and analysis
2.3.1 Data extraction
Two independent reviewers (selected among XZ, SW, ZG, MH) will perform the data extraction, using a standardized data extraction form for all included studies. The form was developed based on the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (the CHARMS checklist)[22]. For each included study, the following information, but not limited to, will be extracted: study information (title, authors, countries and regions, published year); type of prediction model (regression-based model or risk score, derivation and/or validation); source of data (cohort, registry, case-control or randomized trial participants data); participants (consecutive/inconsecutive participants, location, number of centers, ED/hospitalization/combination/unclear, acute heart failure/acute decompensate heart failure, age limitation, mean age, percent of male patients, eligibility criteria, study dates); outcomes to be predicted (types of outcome, the time point and over what time period the outcome is predicted); candidate predictors (number and type of predictors, timing of predictor measurement); missing data; sample size; model derivation; model performance; model evaluation and results. Full details of the form are presented in the Supplementary file in Appendix 3.
Disagreements will be resolved via discussion or consultation with a third reviewer when a consensus cannot be reached. We will obtain missing data from the authors if possible. A study will be excluded if the key missing data could not be supplemented.
2.3.2 Critical appraisal
We will provide an overview critical appraisal of the methodological quality and reporting transparency of included studies. To critically appraise the methodological quality of included studies, which will include the risk of bias and applicability, we will use the PROBAST tool[23, 24]. This tool consists of 20 items structured in four domains: participants, predictors, outcome, and analysis. After 4 steps of an overall assessment, each domain will be ranked as “high”, “low” or “unclear” for both risk of bias and applicability. The final results of the PROBAST assessments will be presented in a summary table (Supplementary file in Appendix 4).
The TRIPOD statement[21, 26], which provides a checklist of 22 items considered indispensable for reporting transparency of a prediction model study, will be used for evaluating transparency of the reporting of the included studies. Each element of the TRIPOD statement could be answered with “yes” or “no”, depending on whether the element was reported in the studies. Each “yes” answer will receive 1 point, and each “no” answer will receive 0 points. Generally, when several aspects are described within a TRIPOD item, all aspects has to be reported to obtain a point for that item. For elements that do not apply to a specific situation, they can be marked as “not applicable”[27]. We will report the overall TRIPOD adherence score of each study, which is developed to assess the uniformity in measuring adherence to the TRIPOD statement. The computational method of TRIPOD adherence score is calculated by dividing the sum of the adhered TRIPOD items by the total number of applicable TRIPOD items[28].
To ensure consistence in evaluating the parameters, reviewers will participate in a pre-study training before the critical appraisal. Any disagreement will be handled as described previously.
2.3.3 Qualitative data synthesis of the prediction models
Characteristics of the included studies will be described systematically by a narrative synthesis, and quantitative data from the included studies will be presented. Key findings, such as outcomes to be predicted, predictors, performance measures, and the predictive accuracy of the model, will be tabulated to facilitate comparisons. We will report uncertain measures in the way they were published or approximated using published methods[20].
2.3.4 Preparation for quantitative synthesis
Quantitative synthesis of predictive performance of the included models will be based on the results of performance measures and their precision values [20]. However, the types of performance measures vary among different studies and are often inconsistent or even not reported for further analyses. We will focus on discrimination and calibration, which are considered as the two most common performance measures of predictive models. Discrimination refers to a prediction model’s ability to distinguish between patients developing and not developing the target outcome and is often quantified by the concordance (C) statistic[29]. The C-statistic will be estimated from the reported measure when missing, if necessary[20]. Calibration refers to a model’s accuracy of the predicted risk probabilities and indicates the degree of which expected outcomes and observed outcomes agree with each other[29]. Previous studies offer different statistic measures for calibration, such as calibration plot, calibration slope, Hosmer-Lemeshow test, and others. If the data on the total number of expected (E) and observed (O) events are available for extraction among the included models, the O:E ratio will be used for further analysis and roughly provide an indication of the overall model calibration[30]. Extracted C-statistics and total O:E ratios will be rescaled before further meta-analysis to improve the validity of their underlying assumptions, according to statistic models provided in the literature [31, 32].
2.3.5 Quantitative syntheses
The nature of quantitative analyses will depend on the number of prediction models of the included studies. Derivation and validation of models will be considered separately.
Data will be synthesized by a meta-analysis if the number (at least 5) of studies included is large enough in a subset with an analogous clinical question appraised repeatedly [33]. A meta-analysis of validation studies for a common prediction model will also be performed if feasible.
2.3.6 Meta-analyses and investigation of heterogeneity
Performance measures of discrimination and calibration will be analyzed by a random-effects model of meta-analysis to estimate the average performance for all included models, if appropriate[20]. The meta-analysis will be performed using Softwares STATA 16.0 (Stata Corp, College Station, TX, USA) with several packages. To better handle the uncertainty in the estimated heterogeneity among studies, we will adopt the restricted maximum likelihood (REML) estimation and use the Hartung-Knapp-Sidik-Jonkman (HKSJ) method when calculating 95% confidence intervals for the average performance[34, 35].
Heterogeneity in the pooled results is usually dependent on differences in the design and populations across the validation studies, such as changes in case mix variation or baseline risk[36, 37]. The case-mix variation of each study will be quantified by estimating the standard deviation of the linear predictor[20]. Cochran’s Q and the I²statistic will be calculated for statistical homogeneity assessment of heterogeneity[38]. Potential sources of heterogeneity will be explored using meta-regression analyses if there are enough studies included in the meta-analyses (≥10 studies)[39, 40].
2.3.7 Subgroup analysis
If the number of included studies is large enough for specific subgroups, we will perform the following subgroup analyses: (1) type of prediction model; (2) source of data; (3) outcomes to be predicted (hospitalization and readmission); (4) method of predictive model building; (5) location; (6) timing (at which time point and over what period the outcome is predicted); (7) settings; and (8) others. Further subgroup analysis will be dependent on the final data extraction.
2.3.8 Sensitivity analysis
Sensitivity analyses will be conducted by excluding studies with a high risk of bias (at least 4/7 domain determined using the PROBAST tool) and studies with low reporting transparency (the overall TRIPOD adherence score <50%), to explore their influence on effect size.