Nomogram to Early Screen Multiparous Women for Preterm Birth in a Cohort Study

Background: Preterm Birth (PTB) can negatively affect the health of mothers as well as infants. Prediction of this gynecological complication remains dicult especially in Middle and Low-Income countries because of limited access to specic tests and data collection scarcity. Multiparous women in our study presented a higher PTB prevalence compared to nulliparous women. Methods: In a cohort study from Northern Lebanon of 1996 women, 922 were multiparous presenting a PTB prevalence of 8%. We analyzed the personal, demographic, and health indicators available for this group of women. We compared 4 modied logistic regression models (up-sampling, lasso penalized regression) to develop a nomogram that can screen for preterm in multi-parous women. The models were trained and validated on different data sets. Results: The best PTB prediction of the Logistic regression model reached around 88%. This was obtained using a Logistic Regression Model trained on up-sampled datasets and LASSO (Least Absolute Shrinkage and Selection Operator) penalized. The regression coecients of the 6 selected variables (Pre-hemorrhage, Social status, Residence, Age, BMI, and Weight gain) were used to create a nomogram to screen multiparous women for PTB risk. Conclusions: The nomogram based on readily available indicators for multiparous women reasonably predicted most of the at PTB risk women. This tool will allow physicians to screen women that represent a high risk for spontaneous preterm birth and run furthermore adequate additional tests leading to better medical surveillance that can reduce PTB incidence. Preterm Birth (PTB) is still of the pregnancy that affects negatively the health of mothers as as Prediction of preterm remains dicult, especially in Middle and Low-Income countries, data collection scarcity and limited to perform advanced clinical tests. In our study, women with at least one child presented a higher PTB prevalence compared to women in their rst pregnancy. In the of specic preterm clinical tests, we used collected data on past pregnancy develop a graphical a spreadsheet to evaluate the risk of


Plain English Abstract
Preterm Birth (PTB) is still one of the pregnancy complications that affects negatively the health of mothers as well as infants. Prediction of preterm remains di cult, especially in Middle and Low-Income countries, because of data collection scarcity and limited resources to perform advanced clinical tests. In our study, women with at least one child presented a higher PTB prevalence compared to women in their rst pregnancy. In the absence of speci c preterm clinical tests, we used collected data on past pregnancy to develop a graphical tool, a nomogram, which can also be used in a spreadsheet to evaluate the risk of these women to undergo a PTB. The evaluation of the nomogram showed promising results to screen 88% of women at risk for PTB using easily available information at the beginning of pregnancy cycle including past hemorrhage or diabetes, high social status, residence in a city, age higher than 25 years, obese BMI, and excessive weight gain.

Background
Although preterm birth (PTB) prevalence varies widely among countries, it is generally estimated to be between 3 and 13% of total pregnancies [1,2]. PTB is also among the leading causes of morbidity and mortality under 5-year-old infants, particularly, in countries with an important number of low to middle income households, especially in some Asian and African countries [3].
Screening for PTB remains di cult in the absence of speci c tests that would identify potential mothers at high risk of preterm birth although the cervical length and cervicovaginal fetal bronectin measurements among others have been used with some success [4]. Hence, most of the prediction studies have used maternal factors that were associated with PTB with some considered non-modi able such as the history of PTB, extremes in maternal age (< 19 and > 35 years) [1], multiple pregnancies, short cervical length, uterine abnormalities, and genetic factors [5]. Factors related to nutrition, socioeconomic status, low body mass index (BMI), obesity, poor pregnancy weight gain, smoking, substance abuse, short inter-pregnancy interval, periodontal disease, bacterial vaginosis, late or no prenatal care, untreated antenatal depression, and the use of assisted reproductive technologies [3] can be preventable with close medical surveillance.
These maternal factors were used to develop models that predict preterm birth [6]. The models range from traditional logistic regression to identify the risk factors and estimate odds ratios to more recent machine learning algorithms including neural networks [7]. Although neural networks algorithms have been shown to lead to very high preterm prediction results, it is di cult to develop a simple version that can be used by physicians, especially in developing countries where the gynecologist takes all the decisions. The logistic regression model linear coe cients have been used in nomograms and spreadsheets to deliver prediction tools that can be used by all physicians [8].
Early detection of PTB can help lower the risks for infants and mothers through corticosteroid administration, cervical cerclage, and other effective treatments [9] However, because of the low prevalence of PTB, there is a need to screen for women selected to undergo more of these adequate tests for potential PTB, especially in developing countries with limited resources.
Our retrospective data of 1996 women showed that 922 multiparous women had a preterm prevalence (8%) more than double that of nulliparous women (3%). Reports on the incidence of PTB and multi-parity have been inconsistent and variable [10]. Although there are many models to screen for preterm there is a need for more focused analysis on multiparous women, especially because of the availability of indicators from past pregnancies.
The main objective of this project was to develop a valid and easy to use, tool for physicians to screen among non-nulliparous pregnant women for preterm birth risk based on the data routinely collected such as medical history, demographic, and weight parameters. We improved the prediction by training the models on resampled datasets (Up Sampling) to mitigate the problem of the low prevalence of preterm birth. We also used logistic regression regularized models with LASSO (Least Absolute Shrinkage and Selection Operator) to help analyze and select the different covariates for the best possible preterm risk evaluation.

Methods
Source of data: Data were obtained from the medical records in ve hospitals in North-Lebanon (private and public Islamic hospitals, Sayyidet Zgharta hospital, governmental hospital of Akkar, and governmental hospital of Tripoli).
Participants: In addition to the aforementioned collection of data from medical records, we also collected data directly from 688 women. The participants were chosen in concert with many local gynecologists.
Outcome: The objective was to develop a model that can be used to predict spontaneous preterm risk for multiparous women but also be able to be expressed in the form of a nomogram easy to use for physicians.
Predictors: The cohort study included binary responses to 15 variables with the positive class as follows Age (25-35 years), BMI (obese), Education-husband (high: university degree), Education-mom (high: university degree), Pre-Cesarean (presence in last pregnancy), Pre-Diabetes (presence in last pregnancy), Preeclampsia (presence in last pregnancy), Pre-Hemorrhage (presence in last pregnancy), Pre-Induction (presence in last pregnancy), Residence (city), preterm (spontaneous presence in last pregnancy), smoking(smoker), Social-status (high), Weight-gain (excess), Work-husband (external job), and Workmom (external job) The Body Mass Index (BMI) of each woman was calculated by using the formula: Weight (kg)/Height (m 2 ). Women were divided into obese and non-obese weight groups based on WHO guidelines [11] (BMI below or above 30). The underweight group was discarded due to a negligible number of representatives.
Missing data: There were no missing data because samples with incomplete data, women aged under 17 or above 35 or suspected to have fetuses with congenital malformation were discarded from the study.
Sample size: The data used in this work were part of a program to evaluate pregnancy fetal complications in Northern Labanon. The number of multiparous women were 922 among a total of 1996 that gave birth between January 2014 and January 2016. We divided the multiparous data into two les. The rst called testdata was composed of 706 pro les originating from the medical records. These second le that we used for model validation comprised 216 multiparous women from the 922 pro les that we collected directly from the women.
Statistical analysis methods: All the predictors were coded in binary variables. The rst model (glm) used was a logistic regression using the testdata le. The second model (glmup) was also a logistic regression model using a new le generated from the testdata le using Up-sampling. This le called upsampleddata included 1258 pro les representing 649 pro les of non-preterm women and 649 randomly generated pro les, by the up-sampling algorithm, for women with a preterm. The third model (glmnetup) was a logistic LASSO penalized regression. The nal model (glmglmnetup) was a logistic regression using only the predictors selected by the LASSO penalized regression trained using the upsampleddata.
All models were validated rst using the testdata and then using the validation data set (validationset).
The models were compared in terms of statistically signi cant predictors along with the percentage of true positives and false positives. True positives were identi ed for a risk (probability) higher than 50%. We also compared the risk distribution pro les given by each model.
Chi-square test, Fisher test, and Principal Component Analysis for categorical variables were performed using SPSS. The logistic regression modeling, up-sampling, and LASSO penalization were carried out using R version 3.6.1. The Nomogram was created using the lrm package in R version 3.6.1.

Results
The multiparous women of the sample represented 46% of the total retrospective data. Despite some overlapping, the multiparous women form a distinct group characterized by a relatively lower social status and a higher incidence of gynecological complications as shown on the projection of the rst two Principal Analysis Components (Fig. 1).
This group of 922 multiparous women were in majority urban, rather older working women with high education in a good income household (Table 1). They have dominantly university-level education (79%) along with their husbands (81%). About 65% reported having a job. Almost all the husbands reported having a job (96%) with a good social level (high income by 71%). They were also dominantly in the age bracket of 25 to 35 (64%), residing in the city (76%). About 33% of the women had rather an obese BMI with 47% presenting an excessive weight gain during the pregnancy. The percentage of mothers who smoked during pregnancy was 13%. The dominant gynecological complications during past pregnancies were diabetes (5%) and Pre-eclampsia (4%). Approximately 31% of them have had induction and 29% hemorrhage.
There were 75 spontaneous preterm cases among the 922 multiparous women representing a PTB prevalence around 8%, which represented more than double the prevalence for nulliparous women. The percentage of women with PTB was slightly higher in the validationset with about 9.7% (21 among 216) than the testdataset with 7.6% (54 among 706).
The covariates that presented the highest difference of percentage within the PTB positive and the negative class were Pre-hemorrhage, Weight gain, Age, BMI, and Social status (Fig. 2). The Chi-square test revealed that most of these variables were statistically signi cant at least at the 5% level (Fig. 2). Smaller, non-statistically signi cant, differences were observed for pre-diabetes, work husband, and preeclampsia. Pre-eclampsia and Pre-diabetes were discarded from further modeling analysis because they gave a low prevalence reaching even 0 for the positive class. It is most likely that women with these indicators were already surveilled for PTB, which may explain their low prevalence.  (Table 2). Despite presenting a high AUC of 0.84, this logistic model gave a low prediction of PTB that did not exceed 16% for the training set and 12% for the validation dataset. The women of the majority class of non-PTB were classi ed correctly which explains the high AUC (Area Under the Curve) observed (Accuracy higher than 92% for the training and validation dataset). In contrast, after creating a balanced sample using the up-sampling algorithm and running the logistic model (glmup) on these datasets, the results were notably improved for the PTB prediction (Table 3). Indeed, PTB prediction ranged from 78 for the training set to 92% for the validation dataset although the number of misclassi ed non-PTB women signi cantly increased from few cases for the rst model (glm) to about 25%, of the total number of pregnant women, for this last regression model. Comparable results were obtained by the LASSO regularized model (glmnetup) and the logistic regression using the selected variables by the LASSO regularization (glmglmnetup) that gave the lowest number of false positives (lower than 21%) while maintaining high PTB prediction, in comparison to all the models (Table 3) but still, the accuracy decreased to around 79%.
The comparison of the distribution of the PTB risk estimated by each model in comparison to original data (Fig. 3), showed that logistic regression before up-sampling (glm) and the Lasso model (glmnet) generally underestimate the probabilities in comparison to the other models. Even the last logistic model using the lasso selected variables slightly under-estimated those probabilities. However, both logistic regression with up-sampling before or after lasso regularization gave a closer risk or probability distribution to the original data than the other models (Fig. 3).
Along with the improvement of preterm prediction the number of statistically signi cant covariates (at least at the level 5%) also increased from 5 for glm, to 10 in glmup but the glmnetup reduced this number to 6 ( Table 2). The regression model using the selected Lasso variables (glmglmnetup) was used to develop a nomogram (Fig. 4). The validation of this nomogram using the data of this study showed the possibility of having a reasonably accurate risk of PTB given the levels of Social status, Residence, Prehemorrhage, Age, BMI, and Weight gain for a multiparous woman.

Discussion
The results of this work led to a signi cant improvement of early preterm birth prediction, reaching up to 88%, for multiparous women using routinely collected social, demographic, and health indicators. The model that led to the best result for PTB prediction and the lowest number of false positives, was used to draw a graphical nomogram that could be easily used by physicians to screen for high-risk PTB. Nevertheless, the physicians will need to put on stricter medical surveillance about 21% (at risk of PTB + false positives) of the total number of multiparous women.
To achieve this level of PTB prediction, data augmentation of the initial sample through up-sampling algorithms was used. Hence, it is probable that the low PTB prediction of the logistic regression model based on the original data was at least partially due to the low prevalence of preterm birth. This model still predicted the majority class of non-PTB women with levels comparable to reported data on preconception PTB modeling [8].
However, using logistic regression to predict low prevalence events may lead to meaningless outcomes [12]. Data augmentation by up-sampling randomly increases the number of positive preterm birth pro les in the newly generated dataset without changing the other class comprising women not presenting PTB [13]. This technique has been successfully used in investigations with low or very low prevalence, including some machine learning techniques such as convolutional neural networks [7].
The logistic regression model on low prevalence data clearly under-estimates the general probability [14]. A similar phenomenon was also observed for the Lasso based model, albeit with signi cantly smaller under-estimation. Furthermore, the regressions on up-sampled data included a higher number of signi cant variables to explain the model. The number of signi cant variables by logistic regression almost exactly corresponded to the variables selected by Lasso regularization. However, the nal model using the 6 selected variables from Lasso regularization decreased the number of false positives and hence gave the best results for PTB prediction.
The selected covariates that seem to signi cantly affect PTB in this study were Social status, Prehemorrhage, Residence, Weight gain, BMI, and Age. These variables were used to draw a nomogram that can be used to screen multiparous women for PTB. Hence, it seems that the possibility of access to adequate medical care through a high income and avoiding weight problems are key factors to decrease PTB incidence for this group of multiparous women. Nevertheless, if residing in the city may grant easier access to medical care, in comparison to villages, urban women presented a slightly higher PTB risk. In China, it has also reported higher PTB risk in urban areas [15]. Indicators of excess weight in terms of BMI or pregnancy Weight gain especially coupled to older pregnancy age increase preterm risk [16,17]. It is noteworthy that besides the social status, the high incidence of hemorrhage in this group of women, reaching 29% that is higher even in comparison to some countries of lower national income [18] led to the highest adjusted odds ratio for PTB of 6.88 to 10.24 (95% interval).
However, this study presents many limitations. It would be improved with a higher number of women in the sample. On top of the low number of cases, the sample was fairly homogeneous because data are better kept in hospitals treating a bigger number of high social status patients. We are hoping that this type of work will encourage health authorities to establish public databases on births in this type of low to middle-income countries. Pre-eclampsia and Diabetes were not used in the models because of the very low prevalence affecting the interpretation of the models. More variables could be added such as past PTB, the number of children, stressful work, anxiety and planned pregnancies among others. Measurements such as cervical length and cervicovaginal fetal bronectin should be added in the screening model or at least carried out on the group of screened women by the nomogram.

Conclusion
Using readily available information from past pregnancy along with social and weight indicators, we developed a nomogram that can be used to screen for PTB risk in multiparous women. The best logistic regression model, that was used to develop the nomogram, showed that a group representing about 1/5 of the total number of women included 88% of the high PTB risk women. This group that was identi ed based on a risk threshold higher than 50%, should undergo additional tests or at least a closer medical surveillance for PTB. The number of women could be adjusted as a function of the health care capacity by decreasing or increasing the probability threshold using the nomogram. The nomogram uses the binary response to 6 covariates including Social status, Pre-hemorrhage, Residence, Weight gain, BMI, and Age.
In order to achieve a reasonably high prediction for PTB, the logistic regression was trained on a data augmented sample using upsampling and LASSO penalization, was used to help select these nal covariates. These methods have proven their effectiveness in diseases or health complications that present low or very low prevalence. Authors' contributions:

List Of Abbreviations
Mrs Traboulsi Mayssa: is a Ph.D. candidate that collected the data, participated in the design and write up of this work.
Pr. Zainab E. El Alaoui-Talibi: is the Ph.D. main advisor, participated in the design and write up of this work.
Pr. Boussaid Abdellatif: is the Ph.D. co-advisor, participated in the design and write up of this work. Executed and helped in the interpretation of the statistical analyses. Projection on the rst and second axes (34% of total variance) of a Principal Component Analysis for all the retrospective data showing the separation between nulliparous (blue) and multiparous women (red). Figure 2 . Percentage of preterm cases within the positive and the negative class for each covariate with levels of signi cance for Chi-square test (signi cant at the level 5% *, 1%** and 1‰ ***).