Our study found that a simple and straightforward model based on pH, base excess, and gestational age may accurately predict the risk of severe intraventricular hemorrhage in infants born at extremely low birth weight. This study performed a variety of methodologies, including machine learning-based algorithms, to create and compare multiple models. The model using only these three features yielded an AUC of 0.857 and an easy and interpretable decision-tree model which does not require any subjective variables, such as the use of inotropic therapy. Low pH levels appear to be a key factor in identifying the great majority of cases that require additional attention. Preterm infants with the lowest pH in the first 3 days of life greater than 7.2 had only 6% of sIVH prevalence. In contrast, preterm with the lowest pH less than 7.2 had a 40% sIVH prevalence.
The prevalence of severe intraventricular hemorrhage remains relatively high (13), and more effective strategies in preventing sIVH are needed (14). It has serious consequences, with over half of all affected neonates developing significant neurocognitive impairment (1). One of the reasons why the prevalence of sIVH remains stagnant is the fact that it is a complex multifactorial disease, with many risk factors, including changes in cerebral blood flow (hypoxia, hypercarbia, acidosis, ventilation asynchrony, patent ductus arteriosus, suctioning of the airway), high cerebral venous pressure (pneumothorax, high ventilator pressure, prolonged labor, and vaginal delivery), abnormal blood pressure (hypotension, hypertension, sepsis, dehydration), the inherent fragility of the germinal matrix vasculature (hypoxic-ischemic insult, sepsis, thrombocytopenia), and hemostatic disturbance (14). Therefore, it is difficult for a neonatologist to identify those who are at a higher risk of sIVH. Previous predictive models were created to aid to fill this gap. With an AUC of 0.78, Luque et al (15). developed a stepwise logistic regression model with the following variables: gestational age, mechanical ventilation, antenatal steroid, 1-min APGAR, birth weight, cesarean section, male gender, and respiratory distress syndrome(15). However, essential variables like hemodynamics and respiratory variables were left out of this model. The AUC was only 0.70 when we applied the model proposed by Luque et al. to our population. Siddappa et al. (16) also developed another model that had an AUC of 0.78 using only a severity score (SNAPPE-II) (16). In our population, an AUC of 0.76 was obtained using a model based on severity score (CRIB-II).
The importance of classification analysis in assessing the associations between independent variables (predictors) and dependent variables (outcomes) cannot be underestimated. However, when there is a large number of predictors, it increases the computational complexity of the model and makes it more prone to overfitting. When a model becomes too complex, it may begin to describe random errors rather than the relationships between variables which is known as overfitting. To avoid this, the smallest set of features that are required to predict the outcome should be determined (17). However, because sIVH has a large number of risk factors, selecting a few key parameters to build a predictive model is challenging. In this case, machine learning algorithms could help in the identification of variables that are more important to the outcome: Boruta is a feature selection algorithm and XGBoost can provide estimates of feature importance.
Using traditional feature selection methods, we were only able to create reasonable models with AUC ranging from 0.707 to 0.823, similar to Luque et al. and Siddappa et al. results. Furthermore, those features did not result in easily interpretable decision-tree models. To illustrate, the root node (beginning of the tree) of some decision tree models was inotropic therapy necessity. Because there is no consensus on neonatal hemodynamic management, the use of inotropes is mostly subjective and varies between physicians (18). As a result, a decision-tree model containing a subjective variable as the root node may apply to our population but fail external validation.
In predictive modeling, feature selection is crucial. We demonstrated that machine learning algorithms can help in feature selection by assisting in the identification of variables with greater value. Interestingly, only machine learning algorithms (Boruta and XGBoost) identified the lowest pH as the most important predictor of sIVH. However, our study has several limitations. First, this was a single-center retrospective study, and a multi-center prospective one is needed. Second, our results showed that the accuracy of models varied widely, implying that there must be severe intraventricular hemorrhage-related variables that we did not analyze. Finally, the results may apply to our population, but external validation is needed.
We propose a simple and interpretable decision-tree model for predicting newborns with extremely low birth weight who are most at risk of severe intraventricular hemorrhage. Feature selection is crucial in predictive modeling.