Identifying the Early Signs of a Preterm Birth: A Large Cohort 1 Study 2

: 19 Background and Purpose — Preterm birth (PTB) is the leading cause of infant mortality in the 20 U.S. and globally. The goal of this study is to increase understanding of PTB risk factors that are 21 present early in pregnancy by leveraging statistical and machine learning techniques on big data. 22 Methods —The 2016 U.S. birth records is obtained and combined with two other area-level 23 datasets, Area Health Resources File and County Health Ranking. Then, we applied multiple 24 machine learning techniques to study a cohort of 3.6 million singleton deliveries to identify 25 generalizable preterm risk factors. 26 Results —The most important predictors of preterm birth are gestational and chronic 27 hypertension, interval since last live birth, and history of a previous preterm birth that can 28 respectively explain 14.91%, 6.92%, and 6.50% of the AUC. Parents education is one of the 29 influential variables in prediction of PTB explaining 10.5% of the AUC. The relative importance of 30 race declines when parents are more educated or have received adequate prenatal care. The 31 gradient boosting machines outperformed other machine learning techniques with an AUC of 0.75 32 (recall: 0.64, specificity: 0.73) for the validation dataset. 33 Conclusions —Application of ML techniques improved the performance measures in prediction 34 of preterm birth. The results emphasize the importance of socioeconomic factors such as parental 35 education as one of the most important indicators of a preterm birth. More research is needed on 36 the mechanisms through which the socioeconomic factors affect the biological responses. 40 of our PTB prediction the variable and partial dependence plots for the first time in the study of PTB. The reported metrics indicate that our best GBM model improves the performance of preterm prediction compared to the similar works that combined maternal characteristics with important biological markers like serum analytes 34). One of the major findings of this study is that the importance of in preterm birth can be explained when both individual live birth, of parents, and whether the and their interactions are added to the model. This analytical finding is consistent with the theory of Lifecourse for addressing the racial disparities in the preterm birth outcomes (36-38). The theory of Lifecourse emphasizes the socioeconomic factors as the main determinants of health that can result in a positive shift in the long-term individual’s health trajectory.


Introduction
Preterm birth (PTB), which is defined as a birth before 37 weeks of pregnancy, is the leading 42 cause of infant mortality in the U.S. and in the world (1). In 2013, PTB accounted for 36% of U.S. 43 infant deaths in their first year of life (2). In addition to the monetary cost of PTB, which exceeds 44 25 billion dollars annually, these babies may suffer from life-long deficiencies (3,4). Many of the 45 current interventions for reducing the likelihood of a preterm delivery like progesterone therapy 46 are effective only if administered early-between 16 and 24 weeks of gestation-in the pregnancy 47 (5). In prenatal care settings, patients can be enrolled in helpful interventions for reducing the 48 behavioral risks without significant disruption of services (6). Therefore, it is critical to study risk 49 factors of a preterm delivery that are present early or even before pregnancy. In addition, 50 identifying the risk factors might help define a population useful for studying specific interventions. 51 The identification of risk factors might also provide insight into the mechanisms of preterm birth 52 which is still largely unknown (7,8). 53 A large and growing body of literature has focused on finding the individual risk factors of preterm 54 birth (7,9,10). The most important individual risk factor for predicting preterm delivery is a history 55 of a previous PTB (both indicated and spontaneous) (11)(12)(13). Race is another major predictor for 56 a PTB. The preterm birth rate (PBR) among non-Hispanic (NH) Black is 52% more than NH 57 White-13.77 vs. 9.04 respectively (14). Other significant risk factors of preterm birth include age 58 (15), short cervix between 16 to 28 weeks of pregnancy (16), and chronic medical disorders like 59 hypertension (17) or diabetes (18). Some studies attempted to increase the generalizability of the 60 risk factors by including large cohorts in their studies (19). Machine learning techniques are 61 extensively used in advancing the understanding of spontaneous PTB risk factors (20)(21)(22)(23)(24). 62 Despite the vast body of literature on the risk factors of PTB, very few interventions have been 63 proven to effectively prolong gestational age in at-risk women (13,25). This is partly because two-64 thirds of preterm deliveries happen to women with no risk factors (26). The current risk 65 assessment in the obstetrical population shows limitation because of the low prevalence of 66 individual risk factors in the general obstetric population (27). For example, the most important 67 risk factor for preterm birth in singleton pregnancies is the history of a previous PTB (14,27). 68 However, the history of a previous PTB is not applicable to the women without a prior birth 69 (nulliparous) which includes more than a third of the total births. Many of the proposed studies 70 consider only the main effect of the individual risk factor of PTB while controlling for a limited 71 number of confounding variables and interactions that were selected manually (10,20,26,28 Data visualization is a challenging but insightful task in this study due to a large number of 99 observations. We used Violin graphs from the ggplot2 package in R to plot the data and gain more 100 information about the features and their relationship with preterm birth. Appendix C shows the 101 visualization of each variable. 102

Model Development 103
Our dataset has five characteristics that guide us in the selection of the methods. First, the 104 distribution of the response variable is imbalanced. Preterm birth in singleton pregnancies occurs 105 only in eight percent of the deliveries and the remaining are full-term. Second, many of the 106 features such as age and education have collinearity (Pearson's correlation coefficient= 0.41). 107 This will limit the use of methods like logistic regression which has the assumption of little or no 108 multicollinearity between independent features. Third, we are interested in finding significant 109 interactions among the variables. One of the best methods for learning the interactions with 110 minimal supervision is decision trees (30). Fourth, our dataset has 3.6 million records with 77 111 variables, which limits the use of methods that are memory intensive like support vector machines. 112 Fifth, the dataset has 20 categorical variables. This will limit the application of distance-based 113 methods like K-Nearest Neighbor. Based on these five characteristics, we apply regularized 114 logistic regression, random forest, gradient boosting machines (GBM), and LightGBM on our 115 dataset (see Appendix D for more details). 116 We used a grid search to find the best hyperparameters of logistic regression and random forest. 117 However, we coupled Bayesian optimization (BO) with the ML performance measures to reduce 118 training time for the GBM and lightGBM. The BO reduces the training time by sequentially solving 119 an optimization problem that tries to find the best set of hyperparameters that have the potential 120 to improve the outcomes in fewer iterations compared to an exhaustive grid search (31)(32)(33). To 121 prevent overfitting and reducing run-time, we also use early stopping methods (1e-4 after 5 122 rounds). We used a system equipped with a Core i7 2.50 GHz processor, and a 32.0 GB memory, 123 with an Ubuntu 18.04.3 operating system. 124 5

Handling Missing Values and Model Assessment 125
To handle missing observations and categorical variables, we use a method in which strings are 126 internally mapped to integers, and splits are done over these integers. The performance metrics 127 that we use in this study focuses on the true positive rate (Sensitivity or Recall) because it is more 128 important to correctly identify a preterm birth rather than mislabeling a full-term as otherwise. 129

Interpretation Techniques 130
To get the 'effect size' of each variable on the response, we use partial dependence plots (PDP). 131 This is a useful tool for our study, particularly because we consider high-order interactions 132 between our independent variables. Partial dependence plot returns the marginal 'effect size' of 133 each variable on the response after accounting for the effect (average) of other responses: 134 Where c X and s X complement the set of X , and 

137
It is important to note that the PDP does not ignore the effect c X . The latter case can be estimated The quantities s f and s f % will be the same only if the two 139 events of c and s are independent, which is an unlikely situation.

141
We randomly separated 75% of the data for the training set and the remaining 25% for validation 142 purposes. The performance metrics are reported for the test set that is not part of the training 143 process. The number of cross-validations for the methods is five-fold. 144

Study Design 145
The parameters for Logistic Regression with Elastic Net regularization (LR-EN) are set as 146

Comparison with other studies 168
There are few similar studies that used high-dimensional dataset in their studies. Weber,169 Darmstadt (20) developed their model on a high dimensional dataset with 1000 initial features 170 and 2.7 million observations. However, they developed their predictive model for the early 171 spontaneous preterm birth, which happens at a much lower rate of 1.02% compared to the 172 singleton preterm deliveries at 7.63% in our study. Another study by Alleman,Smith (19) has the 173 closest setup in terms of developing the predictive model for singleton pregnancies but has a 174 smaller dataset compared to our study. 175 Table 2 shows the comparison between the performance of our best GBM with the most relevant 176 preterm birth studies. The criteria for inclusion of a paper is that it has to either use data with a 177 large sample size that includes demographical information as predictors or it has used machine 178 learning techniques for building a predictive model for preterm birth. We report the sample size, 179 prevalence of the positive class, test AUC, recall, and specificity for each study. As can be seen 180 in Table 2, our best GBM model outperforms the frameworks in these studies by improving the 181 AUC by more than 5%, 9%, and 13% compared to the work of Goodwin, Iannacchione (34), 182 Alleman, Smith (19) and Weber,Darmstadt (20), respectively. The improvement in the combined 183 AUC, recall, and accuracy stems from pre-processing steps that remove anomaly and noise 184 removal, regularization methods, an optimized set of hyperparameters, and the superior ability of 185 the GBM algorithms in the extraction of high-level features in the data. 186  predictors of preterm birth that can respectively explain 14.91, 6.92, and 6.5% of the AUC. 194 Mothers' pre-pregnancy BMI is also an important predictor of preterm birth. Figure 1 shows this 195 interesting result that race has less relative importance when we consider factors like parent's 196 education, age, and adequacy of care during pregnancy. have consistently been at a higher risk of preterm delivery (14,35). In 2016, 10.88% of Black 204 singleton pregnancies resulted in a preterm baby versus 7.11% for White mothers. Our results in 205 Figure 2 show that this likelihood is 7.02% (P-Value<0.001) for Black versus 6.32% (P-206 Value<0.001) for White mothers when we account for the (average) effect of all factors such as 207 education and age of parents, and adequacy of care during pregnancy in each class. 208 preterm birth-while accounting for the effect of other variables. Figure 3 shows two examples of 212 partial dependence plots (PDP). Figure 3.a shows the relationship between a mother's BMI and 213 the likelihood of preterm delivery. The PDP shows that mothers with very low BMI-less than 214 22-are at higher risk of delivering a preterm baby. 215 increases. However, mothers with a Bachelor's degree are the least likely group to have a preterm 218 baby (6.24% with P-Value<0.001), and the likelihood increases for any degree more or less than 219 that. The graph also shows an important insight about the interpretation of missing values. A 220 missing value in the education of a father or mother carries an important information showing that 221 the likelihood of a preterm delivery for these types of observations is the highest (6.92% with P-222 Value<0.001) compared to other groups. Appendix G shows the PDP of other major risk factors. 223 In this study, we deployed statistical and machine learning techniques to first build a predictive 226 model and then extract the risk factors of preterm birth (PTB) that are present during the early 227 stages of pregnancy. This study is novel in that the application of ML techniques to a large cohort 228 9 increases the generalizability of the risk factors. We included both nulliparous and multiparous 229 mothers, spontaneous and indicated preterm birth, but excluded multifetal pregnancies that also 230 increase the generalizability of our PTB prediction model. We reported the variable importance 231 and partial dependence plots for the first time in the study of PTB. 232

a. Preterm delivery likelihood for different values of BMI b. Preterm delivery likelihood for different levels of education
The reported metrics indicate that our best GBM model improves the performance of preterm 233 prediction compared to the similar works that combined maternal characteristics with important 234 biological markers like serum analytes (19,34). One of the major findings of this study is that the 235 importance of race in predicting preterm birth can be explained when both individual risk factors 236 such as interval since live birth, education of parents, and whether the person received adequate 237 care during pregnancy, and their interactions are added to the model. This analytical finding is 238 consistent with the theory of Lifecourse for addressing the racial disparities in the preterm birth 239 outcomes (36)(37)(38). The results of our GBM model agree with the findings of previous studies. The variables like 255 hypertension ("hyper"), interval since last live birth ("interval"), and history of PTB 256 ("Previous_preterm") are among the most important predictors of a preterm birth, which is 257 consistent with past studies (7,27 (19,20). For example, 268 some studies used a majority White population or their sample was from one geographical 269 location to assess the PTB risk factors (19,21). A major strength of this study was the application 270 of data science on a population-based linked singleton births in the U.S. to address this gap. 271 However, using the U.S. birth dataset had its own challenges like the existence of anomalous 272 observations and random errors. To mitigate this problem, we applied one of the advanced 273 machine learning techniques, auto-encoders with deep neural nets, to perform data cleaning and 274 preparation. This study also contributes to the literature of preterm birth study by providing 275 important insights by using advanced visualization techniques. The initial visualization of variables 276 like mother's age versus gestational age (see Appendix C) shows a clear relationship between 277 these two variables in which the risk of a preterm delivery is the highest at the extremes of 278 maternal age. These findings match the results of multiple other in-depth analyses (15,40). Partial 279 dependence plots (PDP) are the other insightful tool that we used in this analysis. The PDPs like 280 mother's BMI in Figure 3 shows that the extremes of pre-pregnancy BMI is associated with 281 increased rates of PTB, which is compatible with the finding of other studies (27,41). The PDP 282 provides a better estimation of this association compared to previous studies (42), because it 283 takes the (average) interdependent effect of other variables into account. 284 There is still significant room for improving the precision of preterm birth in large cohort studies. 285 Positive predictive value (precision) of the past studies varied between 17 to 30 percent 286 depending on the sample used in the analysis (26,43). Our model shows a maximum precision 287 of 28.13% in a national-level dataset, which approaches the best practices of similar studies. 288 However, this metric is still relatively low. This low precision is due to the lack of knowledge 289 regarding the cause(s) of PTB and the absence of important predictors of preterm birth (e.g., 290 cervical length) in the CDC dataset (26). Our study was subject to other limitations. Despite using 291 the obstetric estimation for categorization of the PTB, there remains potential for errors (44). 292 However, we used large samples and multifold cross-validations that minimize the effect of the 293 incorrect categorization. Also, some of the biomarkers like cervical length or fetal fibronectin that 294 are routinely measured in the obstetrical screenings were unavailable in the U.S. linked birth 295 datasets. The association of these biomarkers and their interactions on the likelihood of a PTB 296 can be assessed in future research. 297