Machine Learning Prediction of Preterm Birth: An Analysis of Facility-Based Paper Health Records in Uganda

Literature Review

Risk Factors of PTB

PTB has been associated with many risk factors including: inadequate antenatal care [34]–[37], antepartum hemorrhage [34], [38], [39], preeclampsia [34], [40], nulliparity [35], [37], [41] short interpregnancy interval [35], [42], [43], maternal age <20 years [35], [36], [44]–[46], advanced maternal age (≥ 35 years) [46], single status of mothers [35], [45], history of PTB [47]–[49], history of abortion [36], [45], [50], advanced maternal age [45], [46], [49] , pre-pregnancy hypertension [49], [51], history of fetal demise [45], underweight mothers [46], [52], [53], first antenatal visit after the first trimester [37], [54], lower education of mothers [41], smoking in pregnancy [55], prior cesarean delivery [42], [56], and pre-pregnancy diabetes [51].

Related Works

Many studies have been done on preterm birth in developing countries [34], [35], [57]–[61] . These studies have been done to determine risk factors of PTB, mainly using traditional statistical analysis. For instance, Bater et al. [35] did a study on the predictors of LBW and PTB in rural Uganda. They derived household, maternal, and infant characteristics data from a prospective birth cohort study from 2014 to 2016 in 12 districts. Stepwise Logistic regression was done using 3,841 (744 PTB) women to determine predictors of PTB. Ayebare, Ntuyo, Malande, and Nalwadda [34] did a study on maternal, reproductive and obstetric factors associated with preterm births in Kampala’s National Referral Hospital. They also used Logistic Regression but on a smaller sample; 296 women (99 PTB). Other studies have similarly addressed determinants of PTB in developing countries without assessing for predictive power of the models [57]–[61].

In addition to utilizing statistical approaches, other studies have assessed machine learning methods by predictive power of PTB. However, many of these have been done in developed countries in the context of Electronic Health Records (EHR) [62]–[67]. For instance, Mercer et al. [62] developed a risk score-based system to predict PTB. They identified a number of risk factors, including fetal fibronectin, short cervix and history of preterm birth and used a sample of 2,929 women in the United States (US) to train a multivariate logistic regression. The model yielded a sensitivity of 24.2% (18.2%) and a specificity of 28.6% (33.3%) for nulliparous (multiparous) women. Using the same dataset, Vovsha et al. [63] compared Support Vector Machine (SVM) and Logistic and Lasso Regression with different model selection along with a model based on decision rules for the prediction of PTB. With linear SVM yielding 47% sensitivity and 57% specificity for predicting PTB at 28 weeks, they demonstrated an improvement over the sensitivity and specificity obtained by Mercer et al. Goodwin et al. [64] used data mining techniques and identified seven demographic variables that predict PTB. They used an ethnically diverse sample of 19,970 women in the US and obtained a 0.72 area under the receiver operator characteristic curves (AUCs). Weber et al. [65] utilized administrative data and extracted records for singleton pregnancies among nulliparous women in California from 2007 to 2011. The prediction of PTB was performed using K-nearest neighbors (KNN), lasso regression, and random forests (RF). They used demographic, maternal, and residency characteristics in a machine learning prediction model for PTB. The model yielded low AUC; 0.67. Koivu & Sairanen [66] used LR, ANN, gradient boosting decision tree, and ensemble models to construct individual classifiers to predict early stillbirth, late stillbirth and preterm birth pregnancies. They used pregnancy data provided by the Centers for Disease Control and Prevention (CDC), National Center of Health Statistics via their National Vital Statistics System in the US. They achieved a 0.64 AUC for PTB under the best performing model. Sun et al. [67] extracted data from EHR in a Beijing hospital. They used data based on physical examination, blood test, urine test strip, and gynecological examination. They compared six algorithms in the prediction pf PTB; Naive Bayesian (NBM), SVM, RF, artificial neural networks (ANN), K-means, and logistic regression. A total of 9550 pregnant women were included in the study, of which 4775 women had PTB. At 81.6%, the accuracy of the RF model was the highest compared to other algorithms.

It is important to note that studies on PTB using machine learning have also been carried out in developing and semi-developed countries [68]–[70]. For example, Prema and Pushpalatha [68] used data from local hospitals of Mysuru, India, and compared SVM with linear and nonlinear kernels, and logistic regression. The risk factors they considered included age, number of times pregnant, diabetes, obesity, and hypertension. In the balanced dataset, SVM with linear kernel yielded accuracy of 76% (sensitivity 84% and specificity 73%) and Logistic Regression yielded accuracy of 75% (sensitivity 70% and specificity 80%). Raja, Mukherjee, and Sarkar [69] used data from community health centers in Jharkhand, India. They used a feature selection approach based on the notion of entropy and compared prediction accuracy of three different classifiers, namely, decision tree (DT), logistic regression, and SVM for PTB prediction. SVM classifier yielded an accuracy of 90.9%. However, their predictive accuracy is the highest so far reviewed in the literature. Batoul et al. [70] compared SVM and Logistic Regression for predicting and classifying factors affecting PTB in women from Tehran, Iran. The dataset they used includes demographic and pregnancy characteristics and achieved 57% and 67% accuracy in logistic regression and SVM, respectively.

The evidence shows an abundance of literature addressing PTB. The studies done in developing countries have mostly been done to determine the risk factors of PTB without assessing predictive power of the models. Further, studies done on prediction of PTB have mostly been carried out in the developed countries using EHR. Unlike paper-based health records, EHR have higher rates of completeness and are easier to access when needed [71], [72]. While some studies have addressed PTB prediction in developing countries, the etiology of PTB depends on the geographical and demographic features of the population studied [73]. Therefore, the results of studies in the developed countries may not be applied entirely to the situation in the developing countries. This study, therefore, seeks to address the gap in literature by using data extracted from paper-based maternal health records in Uganda to train machine learning methods to predict PTB.

Materials And Methods

Study Design, Site and Population

This study utilized a facility-based retrospective case-control approach based on administrative records of women from Kawempe National Referral Hospital. The Hospital deals mainly in Maternal Health Services (MHS) including Antenatal Care (ANC), intrapartum care and postnatal care for both mothers and newborns. On an average, about 2,000 women give birth at the hospital every month. The Hospital provides free ANC and delivery services to any pregnant woman who seeks the services; further, the hospital accepts referrals from all parts of the Country. For the purpose of this study, only records of women who had given birth at the Hospital in the period January 2017 to January 2021, and who had at least one ANC visit were considered. Figure 1 displays the workflow adopted in this study.

Inclusion Criteria

We included only live preterm and term singleton births. Women who delivered before completed 37 weeks of gestation were considered as cases. Women who delivered at term (≥ 37 weeks of gestation) formed the controls. Only records with mostly complete ANC cards were captured because the cards had most of the required data.

Exclusion Criteria

Extreme preterm births (< 28 weeks), post term births (≥ 42 weeks), still births, and multiple pregnancies were excluded. Records with missing or mostly incomplete ANC cards were not considered.

Sampling Procedure

Starting with the most recent records, each maternal file was assessed for PTB. These were the cases. We used a case-control ratio of 1:1. For every file with PTB, a file without PTB in the same day was selected as a control. The process was repeated until all files in the time interval considered were exhausted. A total of 1,540 records were captured. Out of these, 770 women delivered prematurely, while 770 women gave birth at full term.

Data Management and Quality Control

An online data capturing tool was developed using Open Data Kit (ODK) to capture data from maternal records which contain ANC Cards and Maternal Delivery Notes (MDN). Six Research Assistants (RA) with medical backgrounds and experience in clinical research, worked on data extraction. Prior to the actual data collection, the RAs were trained on the tool and selection of records. Pre-testing was done to ensure adequacy of the tool and thereafter data collection began. The Principal Investigator (PI) worked closely with the RAs to ensure reliable data collection procedures. Additionally, the PI would regularly check if extracted data matched the data in the maternal files. Variables with missingness greater than 90% were dropped at the pretest stage.

Variables Adopted in the Study

The maternal records include the following: (i) ANC section which captures baseline data on the women’s socio-demographic characteristics, chronic illnesses, surgical history, gynecological and obstetric history; (ii) delivery notes which capture aspects like type of delivery, weeks of gestation at delivery, issues pertaining to morbidities developed and care given; and (iii) Infant notes which captures the baby’s weight, Apgar score and anomalies.

The following variables were captured from the respective sections: (a) ANC section: district, age, marital status, religion, occupation, education level, pre-delivery weight, HIV sero-status, STD, hypertension, gestational diabetes, gravidity, parity, pre-partum anemia, previous caesarean section (c/s), previous stillbirth, previous PTB, previous abortion, previous Early Neonatal Death (ENND), birth spacing, number of ANC visits, gestation age at first ANC; (b) Delivery notes: multiple pregnancy, preeclampsia, antepartum hemorrhage, mode of delivery, gestational age at birth, maternal death; (c) Infant notes: birth weight, Apgar score, ENND. The outcome variable was incidence of PTB. This was deduced from the gestational age at birth as indicated in the delivery notes. Births at less than 37 completed weeks of gestation were termed as prebirths.

Data Analysis

Data was exported from ODK to R Studio for further cleaning and analysis subsequently. Missing data was imputed using Random Forest imputation. Imputation was done because analysis on primary care data using only complete information reduces predictive power and produces biased estimates leading to invalid conclusions [74]–[76]. Random Forest imputation was used based on its ability to impute missing data in multiple categorical variables simultaneously [76], [77]. Next, a descriptive summary of the maternal records was done using frequency distribution. The purpose of the analysis was to provide a description of the mothers utilized in the study. Simple Logistic Regression was used to produce Crude Odds Ratios for each independent variable. Next, Importance Analysis of the RF model was done to find the variables with the greatest effect on PTB. Further, the Fisher’s Exact test was done to check if PTB had a significant effect on mode of delivery, maternal death, low birth weight, low Apgar score, and early neonatal death. Thereafter, the complete dataset was split into training (75%) and validation sets (25%). The following classification methods were applied to the training set: Logistic Regression (LR), Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Naïve Bayes (NB).

Machine Learning Algorithms

Logistic Regression (LR) – It is used to determine the relationship between a categorical dependent variable and independent variable(s). LR is used when the dependent variable is binary in nature i.e., it has only two categories, for instance 0 and 1, yes and no, or true and false. Suppose the dependent variable takes on the values 1 and 0, then LR will be in the following form:

$$P\left(Y=1\right)= \frac{1}{1+ {e}^{-x\beta }}$$

Where X is the set of independent variables, and β are their corresponding regression coefficients. LR models the probability of one event (out of two possible events) by using a logistic function to map the results of linear regression between 0 and 1. The log-odds (the logarithm of the odds) for the event is a linear combination of one or more independent variables [78]. Simple LR was used to produce crude odds ratios for each independent variable. Only variables with p < 0.1 in simple LR were used in the final LR model.

Decision Tree (DT) – A Decision Tree (DT) is made up of a root node, internal nodes, and leaf nodes. It is built by use of recursive binary splitting. The classification error rate (CE) is used as a criterion for making the binary splits. This is the fraction of the training observations in that region that do not belong to the most common class. The CE is measured for each independent variable. The variable that yields the smallest CE is selected to be the root node. CE is measured as follows:

$$CE=1-\text{max}\left({\widehat{p}}_{mk}\right)$$

Where ${\widehat{p}}_{mk}$is the proportion of training observations in the m^th region from the k^th class. Subsequent nodes are again determined using CE. However, if a node itself has a lower CE than a subsequent node, it is not split further and ends up as a leaf node. To prevent overfitting, pruning technique is used when the tree is fully grown. DTs have several advantages, including their ability to handle both categorical and continuous data, simplicity and comprehensibility. However, they are not very flexible and fail on test data; they suffer from inaccuracy [79]–[82]. DT classification was done using rpart package in RStudio.

Random Forest (RF) - Random Forests (RF) are built from decision trees but overcome their inaccuracy aspect. Based on the bagging idea of ensemble learning, they integrate multiple DTs. From the original dataset, a new bootstrapped dataset is created and used to build a DT using only a random subset of the independent variables at each step. The observations left out of the bootstrapped dataset form the Out-Of-Bag (OOB) dataset and are used to measure the accuracy of the trees to be built. This is repeated hundreds of times to yield a large variety of trees. It is this step that allows a random forest to overcome the drawback of the decision tree. To make a prediction for x_o, the data for x_o is run on each tree. The aggregate of the predicted class under each tree is computed and the class with the most votes is selected to be the predicted y for x_o. This process is known as bagging.

To measure accuracy, the OOB dataset is run through all the trees where it was not used to create. The class with the most values is selected. This is then compared to the actual class. The accuracy is measured by the proportion of OOB observations that were correctly classified by the random forest. The proportion of OOB observations that were misclassified form the OOB error. This process can also be used to determine the number of independent variables to consider at each step when building the trees [83]. RF classification was done using the randomForest package in RStudio.

Random forests can be used to rank the importance of variables in a regression or classification problem. This is done in the following steps:

1. The number of votes for the correct class in the OOB data is computed for each tree in the RF

2. The order of values in a predictor (say predictor m) are shuffled in the OOB data and the number of votes for correct class are computed.

3. The number of votes for the correct class in the shuffled m is subtracted from the number of votes for the correct class in the original OOB data.

4. The average of this number aggregated over all the trees in the RF forms the raw importance score for the m.

5. The variables with highest scores are ranked as the most important.

The varImp package was used to evaluate variable importance in RStudio.

K-Nearest Neighbors (KNN) - This is one of the simpler to understand classifier and closest to the gold-standard Bayes Classifier. It works in the following steps:

For a positive integer K and test observation ${x}_{0}$:

i) It ‘looks’ for K points in the training data that are nearest to ${x}_{0}$. These will be denoted by${N}_{o}$

ii) Next, it estimates the conditional probability for class j:

$$pr \left(Y=j|X={x}_{0}\right)=\frac{1}{K}\sum _{iϵ{N}_{0}}I({y}_{i}=j)$$

iii) Then, it applies the Bayes Rule; it classifies ${x}_{0}$ to the class with the largest probability.

The advantages of the KNN classifier are that it doesn’t need to know the true distribution of Y given X. It still yields results akin to the optimal Bayes classifier [84], [85]. KNN classification was done using the class package in RStudio.

Support Vector Machine (SVM) - Support Vector Machines (SVM) are an extension of support vector classifiers, which are in turn an extension of maximal marginal classifiers. A maximal marginal classifier uses a threshold that gives the largest distance between the most extreme points in a class and the threshold itself. However, this classifier is super sensitive to outliers; the outliers can “pull” the threshold in a direction and this leads to poor performance in prediction.

The solution to this problem is a soft margin which allows some level of misclassification and hence reduce the sensitivity to outliers. Cross validation is used to determine how many observations and misclassifications to allow in the soft margin. This is now known as a support vector classifier. It also has a drawback; it fails when there is a lot of overlap in the observations. This is because wherever it places the threshold, there will still be a lot of misclassifications.

The solution to overlapping data is to move it to a higher dimension using a kernel and then putting a threshold through it that separates the data into two groups. This is now known a support vector machine [86], [87]. SVM classification was done using the e1071 package in RStudio.

Naïve Bayes (NB) – Naïve Bayes (NB) is a probabilistic classifier and uses the Bayes Theorem for classification tasks [88], [89]. The Bayes Theorem is:

$$P\left(B|A\right)= \frac{P\left(A|B\right)P\left(B\right)}{P\left(A\right)}$$

Where P(B|A) is a posterior probability of class B, P(A) is the prior probability of predictor A and P(B) is the prior probability of class B. For m classes, c₁,c₂,…,c_m, this classifier will assign test observation x₀ to the class with the largest posterior probability, i.e., x_o will belong to class c_i if and only if:

$$P\left({c}_{i}|{x}_{0}\right)>P\left({c}_{j}|{x}_{0}\right), j\ne i$$

NB classification was done using the naivebayes package in RStudio.

Performance Evaluation of Methods to Predict PTB

The validation set was used to determine final performance by assessing for accuracy, sensitivity and specificity for proper classification of PTB.

where:

$$Accuracy= \frac{tp+tn}{tp+tn+fn+fp}$$

where tp is the true positive, tn is the true negative, fp is the false positive, and fn is the false negative, and

$$Sensitivity= \frac{tp}{tp+fn}$$

$$Specificity= \frac{tn}{fp+tn}$$

In R Studio, “Yes” for PTB was selected as the positive response to ensure that sensitivity reflects accurate prediction of PTB

Ethical Considerations

Ethical approval for conducting the study was obtained from: (i) Uganda National Council of Science and Technology (UNCST) with Registration Number Ref: HS977ES, and (ii) Mulago Hospital Research and Ethics Committee (REC/IRB). Administrative clearance to accessing the administrative records was obtained from Kawempe National Referral Hospital. To ensure confidentiality of the women, the study was conducted according to the Helsinki Declaration (1975;2008) guidelines for medical research, where names of mothers were not captured. Only their file numbers were recorded. At data analysis, no individual records are reported – we utilize only aggregate findings. Passwords will be used for data archiving so that data will be accessed for the sole purpose of this study. Data was extracted from the records after discharge or death and hence study inclusion had no effect on the treatment. Because of the approach used in data collection, individual informed consent was not required.

Results

This section presents a description of the mothers in the study, the importance of the independent variables used, the performance of methods used to predict PTB, and the effects of PTB on mother and newborn.

Characteristics of Mothers

A total of 1,540 mothers (PTB: 770, control: 770) were included in our study. The number of women aged < 20 years and those aged ≥ 35 years were both higher in the PTB group than the control group, whereas the number of women aged between 20 and 34 years were less in the PTB group than in the control group (Table 1).

There were more cases of grand multigravida (women who have had ≥5 births (live or stillborn) at ≥20 weeks of gestation) in the PTB group as compared to the control group. Histories of still birth, early neonatal death and PTB were more prevalent in the PTB group than the control group (Table 2).

More women in the control group had attended ANC at least four times during their pregnancies as compared to women with PTB. More women who experienced PTB attended ANC only one or two times as compared to women in the control group. There were more cases of women with pre-delivery weight ≤ 55 kg in the PTB than control groups. There were more cases of chronic hypertension, preeclampsia and antepartum hemorrhage in women with PTB than those without. The median gestation age at birth of women who experienced PTB was 35 weeks, whereas the median gestation age for women who delivered at term was 38 weeks (Table 3).

The variables with significant crude odds ratios have been bolded in Table 1, Table 2 and Table 3. These include age group, gravidity, history of still birth, history of early neonatal death, history of PTB, number of ANC visits, pre delivery weight, chronic hypertension, preeclampsia and antepartum hemorrhage. These were the variables used in the final LR classification model. It is important to note that parity was not used due to it being highly collinear with gravidity.

Variable Importance

The importance analysis of the RF model found that the top 10 most important variables (mean decrease accuracy (MDA) > 10 in RF model) include number of ANC visits, preeclampsia, age group, weight, gravidity, birth spacing, previous ENND, previous PTB, occupation, and district . The variable with the biggest effect on PTB is number of ANC visits (Figure 2).

Prediction of PTB

Table 4 shows that for classification of PTB, the SVM model yielded the highest accuracy at 0.64, whereas the KNN model yielded the lowest accuracy at 0.58. Even though DT gave the highest sensitivity (0.66), its specificity was the lowest at 0.55. KNN had the worst sensitivity at 0.42, followed by RF at 0.59. SVM yielded a sensitivity and specificity that were both above 0.63.

Table 4: Performance Evaluation of Methods to Predict Preterm Birth

Method	Accuracy	Sensitivity	Specificity
LR	0.6052	0.6207	0.5924
DT	0.6026	0.6609	0.5545
RF	0.6201	0.5882	0.6453
SVM	0.6364	0.6437	0.6303
KNN	0.5844	0.4195	0.7204
NB	0.6104	0.6092	0.6114

Effect of PTB on Mother and Newborn

The results from Table 5 show that PTB had a significant effect on mode of delivery and LBW. Women who experienced PTB were less likely to deliver by cesarean-section than women who delivered at term (p < 0.05). On the other hand, babies born prematurely were more likely to have low birth weight (< 2.5 kgs) than babies born at term (p < 0.05).

Table 5: Outcomes of Preterm Delivery on Mother and Newborn

Variable

Preterm

n (%)

Control

n (%)

Fisher's exact

p-value

Mode of Delivery

Normal

Cesarean-Section

485 (63.0)

285 (37.0)

416 (54.0)

354 (46.0)

0.00

Maternal Death

Yes

3 (0.4)

767 (99.6)

1 (0.1)

769 (99.9)

0.63

Low Birth Weight

Yes

495 (64.3)

275 (35.7)

127 (16.5)

643 (83.5)

0.00

Low Apgar Score

Yes

185 (24.0)

585 (76.0)

216 (28.1)

554 (71.9)

0.08

ENND

Yes

19 (2.5)

751 (97.5)

9 (1.2)

761 (98.8)