The results presented here demonstrate that it is possible to learn RF models with superior, well-rounded performance for early prediction of preeclampsia at multiple time points throughout pregnancy, with minimal preprocessing of data, feature engineering, or feature selection. Exhibiting a relatively balanced score for PPV and Sensitivity, RF increases performance by all metrics at each new visit as more information becomes available. The feature importance plots confirm existing knowledge about known predictive features such as blood pressure, uterine artery blood flow, and placental analytes and identify features not commonly referenced in the prediction literature, such as Endoglin, Cholesterol, and Inhibin A. Review of RF fairness metrics indicated a correctable bias against Black participants.
Our study confirmed that blood pressure and placental analytes were significant in predicting PE across study visits[19,20,21]. The results of our statistical tests deviate from other works[2,10,22] in that risk factors such as maternal age, race, sleep apnea, and family history of PE were not significant. Socio-economic status did not contribute to the prediction of preeclampsia in our study cohort, as suggested by other studies such as Arechvo et al[23]. Thus, care must be taken in comparing the model performance presented here for the nuMoM2b dataset with other studies, given that the nuMoM2b dataset characterizes demographically diverse nulliparous mothers with unknown risk for PE at the time of first prediction while the target label is strictly focused on sPE+E criteria.
Our selected predictors in the first trimester of pregnancy are like those used by previously published competing risk models from Akolekar et al., Poon et al., and O’Gorman et al.[24, 25,26], but our study contains more features and focuses solely on a nulliparous study cohort. To compare our results to these two prior studies, we reconstructed their experiment using our nulliparous cohort and features from V1. We found that our model yielded better outcomes across the board. In Table 4, our model performance, on average, has a 3-4% higher AUC. While Poon et al.[24] report a 91% AUC for preterm PE and 78% AUC for predicting term PE just by utilizing features such as maternal risk factors, MAP, PlGF, uterine artery pulsatility index, and PAPP-A, we did not observe this high AUC in our prediction model. This might be attributed to the fact that our prediction task focuses on PE with severe features for nulliparous women only, which makes the prediction tasks much more difficult.
Ensemble methods, specifically RF and XGBoost[27], are the top performers in our study. Other studies have shown ensemble methods to have a strong predictive power for preeclampsia [28,29,30]. This may be due to the ensemble nature and the ability of the underlying model, decision trees, to capture some of the subtle distinctions between the varied and poorly understood subgroups of preeclampsia patients[31]. The PDP for BMI, a well-known risk factor for PE, shown in Figure 5.c indicates a risk increase in PE around 22.41 and at the peaks at 35 . One possible rationale is that the effect of magnesium circulation is reduced when the BMI is at 35 , since a good magnesium circulation can significant reduce the risk of eclampsia or convulsions[32]. Furthermore, PDPs for various placental analytes indicate that a decreased level of PlGF during the first and second trimesters precede the onset of PE[2,33,34]. Agrawal et al.[35] found that the predictive value was highest for PlGF levels between 80 and 120 pg/mL, which coincides with the sharp increase in the predictive risk for PE observed in the PDP for PlGF at Visit 1 for measurements less than 100 pg/mL. MacDonald et al.[36] suggested a sFlt-1:PlGF ratio > 33.4 which agrees with our PDP in supplement Figure 5. Levine et al.[37] found that endoglin levels at 25 through 28 weeks of gestation were significantly higher (8.5 ng/mL) in term PE patients. We observe this same cut-off value in the PDP in supplement Figure 4.c, which shows a pronounced increase in the risk of PE at around 9 ng/mL at V1, albeit occurring much earlier, at 6-13 weeks of gestation. Analytes such as PlGF, unlike blood pressure, were consistently important across the sPE+E vs. NPH model and the early vs. late model (Figure 5), indicating their predictive power, particularly their ability to rule out early onset[4,27].
Implications
This study demonstrates the utility of early and multiple time points screening for PE. It shows that early blood pressure measurement can be a proxy for the risk of high blood pressure later in pregnancy. Also, information about placental analytes, which can be gathered at a reasonable cost tradeoff between assessment and hospitalization[4], allows predictions that enormously surpass the accuracy of a model based only on ACOG guidelines[38]. Further validation is required for the proposed separate models for multiple time points to ensure prediction consistency: a patient identified as high risk early in pregnancy should not be deemed low risk later without sufficient explanation. Also, identifying women at increased risk in the first trimester allows for timely prophylaxis with low-dose aspirin, which is highly effective in preventing preterm disease[39].
Fairness metrics and analysis of causes for biases should become standard practice in model validation. We hypothesize that the limited sample size may have caused the bias against the Black participants skewed disproportionately towards White participants and the potentially inappropriate higher representation of the Black population among the sPE+E class than the NPH class (20.9% vs. 13.8%, respectively). However, after correcting for this imbalance, the bias still persisted. We then hypothesize that this bias might come from a difference in the distribution of values for the top placental analytes, as suggested in another study[40]. We did observe significant differences in the distribution of top predictive features (P<0.001), such as BMI and PLGF (V1, V2). Due to the correlation between some top features, we cannot simply normalize each by race. Therefore, adjusting the predictive threshold for the Black population is still an efficient way to reduce bias. While the cost of a false negative diagnosis for maternal and fetal health is very high, the stress, fees, and possibly inappropriate treatment of a false positive should not be ignored.
Distinguishing between sPE+E and NPH is critical, but the binary labels pose a challenge. The former group undoubtedly contains different subgroups and phenotypes of preeclampsia, and learning to make these distinctions will have the dual benefit of enhancing our understanding of preeclampsia and allowing for better predictive performance. Thus, moving beyond the initial literature-inspired feature set to a broader set of features will be the target of future work. Furthermore, temporal features capturing change between clinical measurements at different visits will be investigated, as this may enhance prediction quality at the second and third time points[28]. This would enable more timely monitoring and treatment of late onset preeclampsia.
A more significant departure will involve re-framing the prediction task. Compelling arguments have been made that preeclampsia is best interpreted as a syndrome rather than a disease[27,41]. Label difficulties have led at least one study of short term preeclampsia screening to focus on a label that consists of the presence, or not, of at least one of multiple maternal or adverse fetal outcomes[27].
Limitations
A set of features identified in the related medical literature was employed for this initial study, but this can be expanded without issue. Using the nuMoM2b data represents an exciting opportunity to learn from a sizable sample of U.S. mothers that is more diverse than other similar studies and that has been captured in a longitudinal study with a considerable number of features[3,27,42]. The occurrence rate of PE in this study was consistent with reported rates[4,43]. However, this meant that even with such a sizable sample, the analysis was limited to more than a couple of hundred sPE+E cases. The sub-study also had limitations: analytes were only available for V1 and V2. Our study only applies to the nulliparous population within the US. Therefore, our models do not take previous obstetric history into account.
One noticeable limitation of the study is the limited cases of existing medical conditions in participants of the placental analytes sub-study. This low presence can cause the model to attribute less importance to these risk factors, while these could be crucial in clinical practice. Lastly, our study only focuses on comparing patients with sPE+E and NPH, without addressing those patients who developed PE with mild features, or only hypertension.