Risk Factors of PTB
PTB has been associated with many risk factors including: inadequate antenatal care [34]–[37], antepartum hemorrhage [34], [38], [39], preeclampsia [34], [40], nulliparity [35], [37], [41] short interpregnancy interval [35], [42], [43], maternal age <20 years [35], [36], [44]–[46], advanced maternal age (≥ 35 years) [46], single status of mothers [35], [45], history of PTB [47]–[49], history of abortion [36], [45], [50], advanced maternal age [45], [46], [49] , pre-pregnancy hypertension [49], [51], history of fetal demise [45], underweight mothers [46], [52], [53], first antenatal visit after the first trimester [37], [54], lower education of mothers [41], smoking in pregnancy [55], prior cesarean delivery [42], [56], and pre-pregnancy diabetes [51].
Related Works
Many studies have been done on preterm birth in developing countries [34], [35], [57]–[61] . These studies have been done to determine risk factors of PTB, mainly using traditional statistical analysis. For instance, Bater et al. [35] did a study on the predictors of LBW and PTB in rural Uganda. They derived household, maternal, and infant characteristics data from a prospective birth cohort study from 2014 to 2016 in 12 districts. Stepwise Logistic regression was done using 3,841 (744 PTB) women to determine predictors of PTB. Ayebare, Ntuyo, Malande, and Nalwadda [34] did a study on maternal, reproductive and obstetric factors associated with preterm births in Kampala’s National Referral Hospital. They also used Logistic Regression but on a smaller sample; 296 women (99 PTB). Other studies have similarly addressed determinants of PTB in developing countries without assessing for predictive power of the models [57]–[61].
In addition to utilizing statistical approaches, other studies have assessed machine learning methods by predictive power of PTB. However, many of these have been done in developed countries in the context of Electronic Health Records (EHR) [62]–[67]. For instance, Mercer et al. [62] developed a risk score-based system to predict PTB. They identified a number of risk factors, including fetal fibronectin, short cervix and history of preterm birth and used a sample of 2,929 women in the United States (US) to train a multivariate logistic regression. The model yielded a sensitivity of 24.2% (18.2%) and a specificity of 28.6% (33.3%) for nulliparous (multiparous) women. Using the same dataset, Vovsha et al. [63] compared Support Vector Machine (SVM) and Logistic and Lasso Regression with different model selection along with a model based on decision rules for the prediction of PTB. With linear SVM yielding 47% sensitivity and 57% specificity for predicting PTB at 28 weeks, they demonstrated an improvement over the sensitivity and specificity obtained by Mercer et al. Goodwin et al. [64] used data mining techniques and identified seven demographic variables that predict PTB. They used an ethnically diverse sample of 19,970 women in the US and obtained a 0.72 area under the receiver operator characteristic curves (AUCs). Weber et al. [65] utilized administrative data and extracted records for singleton pregnancies among nulliparous women in California from 2007 to 2011. The prediction of PTB was performed using K-nearest neighbors (KNN), lasso regression, and random forests (RF). They used demographic, maternal, and residency characteristics in a machine learning prediction model for PTB. The model yielded low AUC; 0.67. Koivu & Sairanen [66] used LR, ANN, gradient boosting decision tree, and ensemble models to construct individual classifiers to predict early stillbirth, late stillbirth and preterm birth pregnancies. They used pregnancy data provided by the Centers for Disease Control and Prevention (CDC), National Center of Health Statistics via their National Vital Statistics System in the US. They achieved a 0.64 AUC for PTB under the best performing model. Sun et al. [67] extracted data from EHR in a Beijing hospital. They used data based on physical examination, blood test, urine test strip, and gynecological examination. They compared six algorithms in the prediction pf PTB; Naive Bayesian (NBM), SVM, RF, artificial neural networks (ANN), K-means, and logistic regression. A total of 9550 pregnant women were included in the study, of which 4775 women had PTB. At 81.6%, the accuracy of the RF model was the highest compared to other algorithms.
It is important to note that studies on PTB using machine learning have also been carried out in developing and semi-developed countries [68]–[70]. For example, Prema and Pushpalatha [68] used data from local hospitals of Mysuru, India, and compared SVM with linear and nonlinear kernels, and logistic regression. The risk factors they considered included age, number of times pregnant, diabetes, obesity, and hypertension. In the balanced dataset, SVM with linear kernel yielded accuracy of 76% (sensitivity 84% and specificity 73%) and Logistic Regression yielded accuracy of 75% (sensitivity 70% and specificity 80%). Raja, Mukherjee, and Sarkar [69] used data from community health centers in Jharkhand, India. They used a feature selection approach based on the notion of entropy and compared prediction accuracy of three different classifiers, namely, decision tree (DT), logistic regression, and SVM for PTB prediction. SVM classifier yielded an accuracy of 90.9%. However, their predictive accuracy is the highest so far reviewed in the literature. Batoul et al. [70] compared SVM and Logistic Regression for predicting and classifying factors affecting PTB in women from Tehran, Iran. The dataset they used includes demographic and pregnancy characteristics and achieved 57% and 67% accuracy in logistic regression and SVM, respectively.
The evidence shows an abundance of literature addressing PTB. The studies done in developing countries have mostly been done to determine the risk factors of PTB without assessing predictive power of the models. Further, studies done on prediction of PTB have mostly been carried out in the developed countries using EHR. Unlike paper-based health records, EHR have higher rates of completeness and are easier to access when needed [71], [72]. While some studies have addressed PTB prediction in developing countries, the etiology of PTB depends on the geographical and demographic features of the population studied [73]. Therefore, the results of studies in the developed countries may not be applied entirely to the situation in the developing countries. This study, therefore, seeks to address the gap in literature by using data extracted from paper-based maternal health records in Uganda to train machine learning methods to predict PTB.