Evaluation and validation of a prediction model for extubation success in very preterm infants

To perform an external validation of a publicly available model predicting extubation success in very preterm infants. Retrospective study of infants born <1250 g at a single center. Model performance evaluated using the area under the receiver operating characteristic curve (AUROC) and comparing observed and expected probabilities of extubation success, defined as survival ≥5 d without an endotracheal tube. Of 177 infants, 120 (68%) were extubated successfully. The median (IQR) gestational age was 27 weeks (25–28) and weight at extubation was 915 g (755–1050). The model had acceptable discrimination (AUROC 0.72 [95% CI 0.65–0.80]) and adequate calibration (calibration slope 0.96, intercept -0.06, mean observed-to-expected difference in probability of extubation success −0.08 [95% CI −0.01, −0.15]). The extubation success prediction model has acceptable performance in an external cohort. Additional prospective studies are needed to determine if the model can be improved or how it can be used for clinical benefit.


INTRODUCTION
Invasive mechanical ventilation in very preterm infants is associated with adverse outcomes such as bronchopulmonary dysplasia (BPD) and neurodevelopmental impairment [1]. Reducing the duration of invasive mechanical ventilation by early extubation may reduce these risks, but extubation failure is associated with an increased risk of BPD, death and intracranial hemorrhage [2,3].
Recently, a model to estimate the probability of successful extubation in very preterm infants was developed [3] and made available online at www.extubation.net. The model includes six variables associated with extubation success: gestational age at birth, chronological age at extubation, weight at extubation, pre-extubation blood gas pH, pre-extubation FiO 2 , and highest respiratory severity score (RSS) in the first 6 h of age. The area under the receiver operating characteristic curve (AUROC) for the model was 0.77 in the development cohort. However, the performance of the model in an external cohort of infants undergoing a clinically-indicated extubation attempt is unknown.
We aimed to assess the performance of the extubation success prediction model in a cohort of infants in which extubation was attempted at an academic U.S. neonatal intensive care unit (NICU). Additionally, we aimed to evaluate the performance of alternative prediction models and examine differences in characteristics and outcomes between infants with extubation success, compared to failure. Our goal is to guide the development of future tools that may inform decisions about extubation for very preterm infants.

Study design
We conducted a retrospective observational study that included intubated infants with a birth weight <1250 g admitted from 8/1/08 to 7/31/17 at Emory University Hospital Midtown in Atlanta, Georgia, with initial extubation occurring within 60 days of birth. These criteria were used to match the selection criteria of the cohort in which the model was derived [3]. Infants with first extubation after 60 postnatal days, prior extubation, a first extubation that was unplanned, or with data missing for the extubation model or outcome were excluded. The study was approved by the Emory University Institutional Review Board with a waiver of informed consent and was performed in accordance with the Declaration of Helsinki. The TRIPOD Checklist for Prediction Model Validation was used to ensure transparent reporting of findings [4] (Supplementary Table 1).

Predictors and outcome
Data were collected on the six predictors in the extubation success calculator: gestational age at birth, chronological age at extubation, weight at extubation, pre-extubation FiO 2 , pre-extubation pH (closest to extubation, regardless of timing), and the highest respiratory severity score (RSS) in the first six hours of age. RSS was calculated as the product of the mean airway pressure and FiO 2 . Additional data available on infant and maternal demographics, surfactant and caffeine exposure, and several aspects of pre-and post-extubation ventilatory support were collected. Extubation success was defined as survival for ≥5 days without an endotracheal tube, consistent with the derivation cohort [3]. BPD was graded according to severity using published criteria [5]. Retinopathy of prematurity receiving treatment was defined as receipt of laser or intravitreal anti-VEGF therapy. Severe intraventricular hemorrhage, defined as grade 3 or 4, and periventricular leukomalacia were ascertained based on ultrasound reports by a pediatric radiologist. Necrotizing enterocolitis was defined as modified Bell Stage IIA or greater. No blinding of predictor or outcome assessment was performed.

Statistical methods
The sample size was pragmatic with an intent to have at least 10 outcome events per parameter in the extubation model. No imputation for missing data was performed. To assess model performance, we generated probabilities of extubation success using the parameters from the published extubation success calculator, available at www.extubation.net. Next, we generated receiver operating characteristic curves using predicted probabilities and calculated the AUROC as a measure of model discrimination for extubation success. We also created calibration plots to compare predicted vs. observed probabilities of extubation success by deciles of predicted probability [6,7]. Within each decile, the observed proportion of infants with extubation success and corresponding 95% binomial confidence intervals were reported. The calibration plot included a fitted linear model slope and intercept. In addition, we estimated the observed-to-expected mean difference in extubation success.
To determine how model parameters compared from the original published extubation success calculator to our cohort, we recalibrated the Gupta et al. logistic regression model for our cohort and compared odds ratios of model parameters from the initial derivation cohort [3] to those generated from this validation cohort. Subsequently, we evaluated how a more parsimonious model or expanded model changed discrimination, by comparing the AUROC among different models that removed variables with odds ratio estimates that were different from the derivation cohort (Reduced model) or included variables that were different at a P < 0.10 between extubation success groups that were not in the original model (Expanded model). Because chorioamnionitis was added as a descriptive characteristic after the initial analysis and ascertainment of clinical chorioamnionitis can be subjective, this was not included as a variable in the Expanded model. We also included the evaluation of a recently published two-variable model by Kidman et al. that included gestational age at birth and pre-extubation mean airway pressure (MAP) [8], as well as gestational age only, given the strong prognostic influence of gestational age on neonatal outcomes and the goal to determine the incremental prognostic value of the other variables in the calculator. For comparisons of AUROCs across models, we only reported results generated from the validation study cohort, which was paired with statistical comparisons between models using the likelihood ratio test as described below. Additionally, we evaluated for the presence of possible interaction between gestational age and postnatal age using interaction terms in our model, which could be considered a combined indicator of the postmenstrual maturity of an infant. We compared the -2 Log likelihood of the Gupta et al. model to other models that had a nested structure (gestational age-only, Reduced and Expanded model) using likelihood ratio tests, which considers differences in the degrees of freedom (number of variables) of the different models. Because pre-extubation MAP was not in the Gupta et al. model, we did not perform a direct comparison with the Kidman et al. model. As an alternative, we compared the Kidman et al. model to the Expanded model, which did contain pre-extubation MAP.
To explore optimal cut-offs of predicted probability of extubation success considering both sensitivity and specificity, Youden's J [9] and Euclidean distances were plotted by predicted probability of extubation success. Youden's J values can indicate the point of predicted probability of extubation success where equal weight is given to sensitivity and specificity, with the highest value reflective of a probability threshold that could potentially be used for informed decision making. Characteristics between infants with extubation success and failure were compared using chi-square or Fisher's exact tests for categorical variables, independent sample t-tests for means and Wilcoxon rank-sum tests for medians. The mean probability of extubation success at time of extubation was compared by severity of BPD using linear regression, with and without adjustment for gestational age, to determine the independent association of extubation success probability after accounting for gestational age. All other outcome comparisons were unadjusted, to only account for potential differences after considering random variation, and were not intended for causal inference. Analyses were performed using SPSS version 27 (IBM) and GraphPad Prism 8. A two-sided P < 0.05 was considered statistically significant.

RESULTS
Among a total of 182 infants meeting inclusion criteria, we had complete data for the extubation prediction model in 177 infants, of which 120 (68%) had extubation success ( Supplementary Fig. 1). Infants who failed extubation had a lower birth weight and gestational age (P < 0.001, Table 1). In addition, a diagnosis of clinical chorioamnionitis was more common among infants who failed extubation (P = 0.01). There were no significant differences in maternal race, infant sex, multiple gestation, mode of delivery, Apgar score at 1 and 5 min, highest MAP and FiO 2 in the first 12 h of age, receipt of surfactant within 2 h of age, or RSS in the first 6 h of age by extubation success groups (Table 1).

Model validation
The AUROC of the primary 6 variable model was 0.72 (95% CI 0.65-0.80, Fig. 1). A linear fit line of the calibration plot had a slope of 0.96, y-intercept of −0.06 and R 2 of 0.81 (Fig. 1). The mean observed-to-expected difference in probability of extubation success was −0.08 (95% CI −0.01, −0.15). Exploring cut-points of extubation success considering both specificity and sensitivity, the highest Youden's J and lowest Euclidean distance were at a predicted probability of extubation success of 0.80 (sensitivity of 0.63, specificity of 0.74) (Supplementary Table 2 and Supplementary Fig. 2). Infants with a predicted probability of extubation success of >0.80 had an 83% incidence of first-attempt extubation success, compared to 52% for infants with a predicted probability of extubation success ≤0.80.

Model modification
When the original model was derived again in our validation cohort dataset using the six variables in the original extubation success calculator, the AUROC was 0.759 (Fig. 2). There was no evidence of statistical interaction between gestational age and postnatal age (P = 0.80). After removal of RSS as a predictor, the AUROC was 0.744. After expanding the model by adding birth weight, pre-extubation MAP, receipt of peri-extubation steroids and caffeine, the AUROC was 0.787. The AUROC of a gestationalage only model was 0.712 and of a recently published twovariable model based only on gestational age and MAP was 0.721. Statistical comparisons of the various models are reported in the legend of Fig. 2.

Comparison of factors between derivation and validation cohort
Odds ratio estimates were similar among the validation cohort and the derivation cohort, with the exception of RSS in the first 6 h of age, which showed differing directions in risk estimates between the two cohorts ( Supplementary Fig. 3). Selected characteristics of both cohorts are shown in Supplementary  Table 3.

Peri-extubation characteristics
Infants that failed extubation weighed less at the time of extubation (P = 0.002, Table 2), and received a higher FiO 2 (P < 0.001), had a lower blood gas pH (P = 0.003), higher MAP (P = 0.02) and an earlier postnatal age at extubation (P = 0.02). There was no significant difference in the interval in hours between the pre-extubation blood gas and time of extubation between groups (P = 0.55). Infants who failed extubation were more likely to receive peri-extubation steroids (P = 0.009). There was no significant difference between the groups in receipt of caffeine prior to extubation (P = 0.06).

Clinical outcomes
Infants who failed extubation had an increased severity of BPD or death (P < 0.001, Table 3). In addition, among infants with extubation attempts prior to 36 weeks' postmenstrual age, the probability of extubation success at first extubation attempt was inversely correlated with the severity of BPD ( Supplementary  Fig. 4). Infants who failed extubation had a longer length of stay (P = 0.003), longer duration of ventilation (P < 0.001), were more likely to receive postnatal steroids (P < 0.001). Other outcomes are shown in Table 3.

DISCUSSION
The extubation prediction model in our validation cohort had an AUROC of 0.72, compared to an AUROC of 0.77 in the previously published derivation cohort [3]. By convention, an AUROC of 0.5 is non-informative, 0.7 to 0.8 is considered acceptable, 0.8-0.9 is considered excellent and more than 0.9 is considered outstanding, although this depends on the clinical context [10,11]. Based on these categorizations, both the Gupta et al. study [3] and the results from our validation study demonstrate acceptable model performance in determining extubation success among infants undergoing a clinically-determined extubation attempt. Based on the calibration data, the model slightly underestimated extubation success on average in our cohort (8% lower than expected), but the calibration slope and intercept supported acceptable calibration.
In our study, 68% of very preterm infants were successfully extubated on initial attempt and those that failed extubation had a higher severity of BPD. By comparison, extubation success in the Gupta et al. [3] study was 73%. Notably, the association between the highest RSS in the first 6 h of life and extubation success differed between the two cohorts. This may reflect differences in early respiratory care, including surfactant administration and approaches to mechanical ventilation, which may influence RSS. Because this variable is included in the original model, we assessed the impact of its removal and found only a small reduction in model discrimination.
We defined extubation success as survival without an endotracheal tube for 5 days to be consistent with the definition used in the original study by Gupta et al. [1,3]. Other definitions of extubation success exist, with a 3-7 days duration of observation being most common and rates of extubation success varying based on definition [12]. All of the infants in this study received caffeine, although some received caffeine after extubation. In addition, some received peri-extubation steroids. Inclusion of these factors into an expanded model suggested they might slightly improve model performance, although the association between these factors and extubation success may not be generalizable to other centers. A recent study reported that a two-factor model that included gestational age and mean airway pressure predicted extubation success with an AUROC of 0.77 [8]. This two-variable model in our cohort had an AUROC of 0.72, although this was based on rederivation using the corresponding variables and not the model equation from the Kidman et al. study. Therefore, these two AUROCs may not be directly comparable. In both this study and the study by Kidman et al. [8], pre-extubation MAP was associated with extubation success. By contrast, the study by Gupta et al. [3] reported no association with pre-extubation MAP and extubation success. In addition, we found an association between peri-extubation corticosteroid use and extubation success, which was not observed in the study by Kidman et al. [8]. These differences may be due to differences in practices among centers, and highlight the importance of external validation of predictive models used in clinical practice. Additional studies of model performance of this and other extubation readiness models are needed in other cohorts of similar infants in order to determine which factors might be universally associated with extubation readiness and which factors are impacted by local center practices. It would be useful to assess the association with RSS within 6 h of birth and extubation success in additional studies, given the differences observed between the original Gupta et al. [3] study and this study. In order to make the calculator easier to use, a reduced model without RSS may be more practical and not sacrifice model performance. Additional model inputs such as pre-extubation pH may also be influenced by center practices with regards to pH tolerance or targets. The severity of BPD or death was associated with a lower predicted probability of extubation success in our cohort, even after adjustment for gestational age. This may be reflective of the correlation between other factors in the model, aside from gestational age, and the severity of BPD. These findings are consistent with the association between extubation failure and BPD reported in prior studies [3,8]. However, other factors associated with both BPD and extubation success may either confound or explain the observed association. Additional studies are needed to determine how use of the extubation success calculator may improve outcomes, including what specific thresholds might inform extubation decisions by clinicians. Because all Youden's J values were <0.5, this may indicate limitations of using specific probability thresholds generated by the model to guide decisions regarding extubation. Furthermore, we believe additional study is needed before routine clinical use of this model, given the potential unintended consequences of delaying extubation based on a clinician's interpretation of model results that could prolong invasive mechanical ventilation. Additionally, the extubation prediction model could be useful to evaluate heterogeneity in treatment effect [13] of interventions to improve extubation success [14].
There are some limitations to our study. Our findings may not be generalizable to infants that weigh >1250 grams at birth, have previously been extubated, have an unplanned extubation, or are extubated after 60 postnatal days. In addition, this study's generalizability may be limited in settings with standardized extubation criteria, which were not in place in this study setting. Additionally, over 90% of infants in our study received any dose of caffeine within 24 h before extubation, and the association between pre-extubation caffeine use and extubation success may differ for centers with more variable caffeine use. Furthermore, differences in outcomes among infants who had extubation success, compared to extubation failure, may be explained by many factors, including confounders that were not accounted for in our study. For example, infants with extubation failure had a higher frequency of non-invasive positive pressure ventilation immediately following extubation, which could indicate that a clinician had a high concern for the potential for extubation failure. Additionally, any peri-extubation systemic corticosteroids exposure, which is not routine practice at the study center, was more common among infants who failed extubation. This might  reflect a higher clinician concern for extubation failure related to airway edema or related airway issues. Importantly, our goals were not causal inference on the association between extubation success and clinical outcomes, but a focus on the external validation of the prediction model, and differences reported in outcomes should be viewed accordingly.
In conclusion, a publicly available prediction model for firstattempt extubation success in very preterm infants demonstrated Table 3. Outcomes of infants by initial extubation failure or success.

Outcome
Extubation failure (n = 57) Extubation success (n = 120) P  Mean airway pressure immediately prior to extubation, mean ± SD 7.0 ± 1. acceptable performance in this external cohort of infants, supporting its potential utility in clinical practice. Additional studies are needed to further refine the model and to determine if its use can improve outcomes.

DATA AVAILABILITY
The corresponding author can be contacted for source data.