Predictive Modelling of Under-Five Mortality Determinants Using Machine Learning Techniques

doi:10.21203/rs.3.rs-3344538/v1

Download PDF

Research Article

Predictive Modelling of Under-Five Mortality Determinants Using Machine Learning Techniques

https://doi.org/10.21203/rs.3.rs-3344538/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Under-five mortality is one of the major public health issues and directly influences the population’s health, social development, and economic status of countries. Thus, early detection is essential to find what efficient prevention can take to save it. Therefore, this study will explain how machine-learning techniques can help predict the important determinants of under-five mortality in India.

Methods

This study used data from the National Family Health Survey-V of India. We performed the tenfold cross-validation to assess the model’s capability in the dataset. The decision tree, random forest, logistic regression, neural network, ridge regression, k-nearest neighbor, and naive Bayes models were used in under-five mortality data, and metrics like confusion matrix, accuracy, recall, precision, F1-score, Cohen kappa, and the area under receiver operative characteristics (AUROC) were used to assess the predictive power of the models. The chi-square scores, recursive feature elimination, extra tree classifier, random forest importance, sequential feature selector, and traditional logistic regression were used to predict the important features(factors) of under-five mortality. All computational algorithms were done with the help of SPSS-27 and Jupiter notebook (inbuilt Python 3.3) software.

Result

The result reveals that the random forest model was the best predictive model compared to other ML models for under-five mortality. The Random Forest model’s precision was estimated to be 98.88% for all factors and 96.25% for important selected variables. After that, neural network accuracy was 96.52%, and accuracy was 94.83% with important variables. Traditional logistic regression accuracy was 93.99% and 93.51%, respectively. The number of living children, breastfeeding status, birth in the last five years, children ever born, time, antenatal care, region, size of children, number of household members, and birth order, were important factors of under-five mortality after using the feature selection methods.

Conclusions

This is the first study of India to use machine learning approaches to find the important ML predictive model and determine the causative factors for under-five mortality. The random forest model predicted the most important factors with the highest accuracy of under-five mortality. This machine-learning approach can be used as reference concepts to understand students, non-computing professionals, healthcare professionals, and decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Logistic regression

Machine Learning

Mortality

Accuracy

AUROC

Under-five mortality is widely used as an important health indicator to assess the well-being of the population, and it is listed in the Millennium Development Goals (MDGs) and Sustainable Development Goals (SDGs) [1]. In lower- and middle-income countries, the deaths of children under five years of age (U5MR) are a major public health problem. Child survival has improved globally, with a decline in U5MR by 59% since 1990. Worldwide, 149 out of 195 countries at least halved their under-five mortality rate from 1990 to 2019. Forty-four percent (85) of countries reduced their under-five mortality by two-thirds from 1990 to 2019, of which 34 countries were low- or lower-middle-income countries [2]. Although many improvements have been made in the reduction of U5MR in the past three decades, in 2019 alone, 5.2 million under five children lost their lives worldwide, where more than half of these deaths occurred in sub-Saharan Africa and an estimated 14,000 children under the age of five died daily due to preventable causes in the same year. Millions of children below the age of five die each year, and these deaths are preventable with timely interventions such as immunization and child health programs [3].

Developing countries, particularly middle- and low-income countries, are at higher risk of under-five mortality. India is a developing country and remains a lower-middle-income nation. Out of 266 countries, India ranked at the 104th position globally in under-five deaths in 2019[4]. India ranks 131 out of 189 nations on the Human Development Index (HDI), with an HDI value of 0.645, as reported by the United Nations Development Program (UNDP) annual report 2020[5]. In India, the U5MR decreased from 126 deaths per 1000 live births in 1990 to 34 per 1000 live births in 2019, with a 4.5% annual rate reduction of U5MR [2]. Figure 1 shows that in 2019, the U5MR of the world was 37.7 per 1000 LB, while for India, it was 34.3 per 1000 LB, which stands below the global average U5MR [2]. These data show that India has made remarkable progress in reducing the U5MR. Despite the progress made, India has the largest number of under-five deaths globally, and significant variations in under-five mortality exist between states and districts [6].

Machine learning methods are widely used for predictive modeling in a variety of domains. Currently, these methods are used for different purposes: to identify child mortality factors, find the correlations within the key factors, and predict under-five mortality [7]. Methods such as binary logistic regression (BLR), bivariate and multivariate analysis, principal component analysis (PCA), maximum likelihood method (MLM), rule induction (pruning rule-based classification tree (PART)) algorithm, and decision tree (J48) have been used to improve maternal health and child mortality [3].

Machine learning methods use computer algorithms to find patterns in large datasets and variables and can be used to predict data-based outcomes. Machine learning algorithms build the model by dividing the main datasets into test and training datasets, and training datasets are used for model building [8]. In recent years, ML methods have emerged as highly effective decision-making and prediction methods in several disciplines [9–10]. Machine learning methods, e.g., logistic regression, random forest, K-nearest neighbors, and decision trees, can predict the determinants of under-five mortality [11–13]. ML methods can also predict neonatal morbidity and mortality [14]. Long-term U5MR can also be predicted by ML methods such as the group method of data handling (GMDH)-type artificial neural network (ANN), and these forecasted figures can be compared with the commonly used conventional statistical methods – Holt winters exponential smoothing models and ARIMA regression [15]. ML algorithms can also predict disease incidence, patient response to treatment, and other healthcare events [16]. This study’s objective was to identify an accurate machine learning algorithm for predicting important factors that health professionals can use to forecast under-five death rates to provide timely interventions and possible reductions in factors causing high death rates.

Sample Design and Sample

This study used the latest national family health survey (NFHS)-V dataset. It is a nationally representative dataset of India that provides information on 707 districts, 28 states, and 8 union territories. This survey provides information on sexual behavior; husband’s background and women’s work; HIV/AIDS knowledge, attitudes, and behavior; and domestic violence, health, and family welfare as well as health indicators such as fertility status, infant and child mortality, and maternal and child health. The total sample size of approximately 610,000 households in India was based on the size needed to produce reliable indicator estimates for each district. The rural sample was selected through a two-stage sample design with villages as the primary sampling units (PSUs) at the first stage (selected with probability proportional to size), followed by a random selection of 22 households in each PSU in the second stage. In urban areas, there was also a two-stage sample design with Census Enumeration Blocks (CEBs) selected at the first stage and a random selection of 22 households in each CEB at the second stage. After conducting a systematic mapping and household listing operation in the chosen first-stage units, households in both urban and rural areas were chosen for the second stage. A total of 636,699 households were interviewed, of which 7,24,115 were eligible women aged 15–49 years and 1,01,839 were eligible men aged 15–54 years [17].

Data Source

This study used the birth recode dataset file from the NFHS-V survey. This dataset contains records of all children born to surveyed women. Essentially, it is the complete birth history of all interviewed women, including pregnancy and postpartum information, immunizations, and health status of children born in the last five years. It also includes data on the mothers of each of these children. This file can be used to calculate health indicators, birth rates, and death rates. This study uses information on the child’s date of birth and the age at death of the child to calculate under-five mortalities in India. This dataset has various sets of predictors or features, and the target variable in this dataset is child survival status before completing five years of age.

Analytic Strategies for Under-five Mortality Data

We have used various analytic frameworks to complete the objectives of the study. First, we used the data preprocessing method to finalize the dataset for applying the statistical and machine learning approaches. Second, the baseline structure of the data was explained by descriptive statistics, and a multivariate logistic regression model was used to identify the important factors/features of under-five mortality. Third, we used various supervised machine learning models in the dataset, including all factors, to predict the best machine learning model. Then, we used feature selection techniques to predict the most important variables of under-five mortality. After screening the important features/factors based on logistic regression and the ML feature selection technique, we again applied the various ML models with only the important variables (features) to find the best predictive ML model for under-five mortality in the dataset.

Preprocessing of the data

In this study, we used a modified analytical framework proposed by Mosley and Chen to analyze child survival in developing nations [18] and, as shown in previous literature [19], to find important independent features (variables). A total of 55 independent variables were selected, and those independent variables were categorized into socioeconomic, biodemographic, health-related, and environmental factors. Data preprocessing is very important for better model building. In data preprocessing, we used data filtering, data cleaning, and a quality check of the data, handling the missing values and removing the outliers and duplicate variables from the data. After the data preprocessing approach, we found a total of 34 independent variables out of 55 independent variables for U5MR in India. A total of 11 variables were biodemographic factors, 8 variables were health-related factors, 4 variables were environmental factors and 12 variables were socioeconomic factors [Fig. 2]. The categorical values were transformed into numeric numbers; also, we have changed in one-hot encoding to all the predictor (features) variables.

The target variable is necessary to balance the class variable for better predictive models. To balance the class variable, we used the synthetic minority oversampling technique (SMOTE) because this is a hybrid method that combines both oversampling and under sampling methods to overcome their weaknesses. This is a highly approachable method for imbalanced data [20]. The value of dead children is very low compared to live children in the dataset. After using the smote method, the class variable was balanced to a ratio of 50:50 from the initial ratio of 96:4.

Target variable

Target variable in this dataset is under-five-year child mortality status, which has been classified into death (0) and alive (1).

Predictor variables

After removing duplicates and handling missing values, the predictor (independent) variables taken into consideration in this research included the age of mothers, current marital status of the mother, age at 1st birth, age at married, household sex, child sex, birth order, number of living children, child size at birth, birth in the last five years, desire for more children, vaccinations, contraceptive use, delivery place, delivery mode, antenatal care, current breastfeeding, anemia level, postnatal care, source of drinking water, toilet facilities, type of cooking fuel, electricity availability, household education, household working status, mother’s education, mother’s working status, father’s education, place of residence, region of residence, household size, wealth index, religion, time in days and caste.

Machine learning approach framework of U5MR prediction

In this study, we first used a modified analytical approach developed by Mosley and Chen to analyze child survival in developing nations to find important features. We chose only those important features from the dataset and then applied data preprocessing methods to make the final dataset for further computation. The explanations of algorithm computation and machine learning frameworks are depicted in Fig. 3. The statistical package for social sciences (SPSS)-27.0 software was used for data preprocessing, descriptive statistics, and multivariate logistics regression analysis. The all-machine learning algorithm was computed with the help of Jupiter notebook inbuilt Python 3.8 software.

Model Building

This step divides the whole data (100%) into 70% as training and 30% as test data. Seventy percent of the trained data were used for tenfold cross-validation to select the model with the best performance, and the remaining 30% were used to predict the model performance. In the model building, we used various machine learning models in the 70% dataset, and all the model details are explained below.

Decision Tree (DT)

The decision tree algorithm is widely used in the healthcare domain for label classification. The decision tree method is a valid and mostly used classifier to determine conditional probabilities [21–22]. Binary trees are created from decision trees and help in classification problems, and each terminal node of the tree shows a particular class label [23]. To understand the underlying pattern of data instances, a classifier can be used to generate decision rules.

Random Forest (RF)

We used random forest (RF) to classify the class labels and identify the best-performing classifier on the dataset. RF creates a new bootstrapped dataset that is identical in size and creates the subset from the data. Decision trees are generated from the bootstrap data, and each decision tree takes individual inputs. From the vote of all decision trees, a new input is assigned to the class label. The class label, which has been voted the maximum through decision trees, attains a new input. This classifier’s process selects random samples. Some samples are not considered by chance [24].

Equation (1) is used for the calculation of bagging. The bag represents the repetition of bagging.

Here, ${f}_{b}$ is the classification tree that trains the training dataset.

Naive Bayes

The Bayesian network (NB) classifier identifies determinants and checks the posterior probability using the prior probability and likelihood. It is widely used in data analytics and generates the best classifier results [25–26].

Neural Network

We used the MLP for model building. This is a feed-forward ANN, and it has multilayer nodes. In the ANN, individual layers are fully connected with the next layer. The input nodes represent the number of attributes. The activation function is utilized for the linear combination of input with weight “W” and bias “B”. The following formula of the activation function in MLP is used

$$f\left(zi\right)=\frac{1}{11+e-zi}$$

K-nearest neighbor (KNN)

K Nearest Neighbors (KNN) is a method for classification and regression. By measuring the distance between any training point and the test data, KNN attempts to determine the appropriate class for the test data. Then, choose the number K of points that most closely resemble the test data. The KNN algorithm analyses the probability that test data will belong to each of the "K" classes of training data, and the class with the highest probability is chosen. For regression, the value is the average of the 'K' selected training points [27].

Logistic Regression (LR)

Logistic regression is a strong statistical and supervised ML model used for binary outcomes. This model says that the relationship between binary outcome (dependent) variables is single or multiple independent variables [28].

Model Evaluation

The selected set of machine learning classifiers is used to train the model after the classification. In this step, each model is evaluated using test data. This will ensure which model performs best in predicting under-five mortality based on measures such as sensitivity, specificity, accuracy, precision, F1-score, and area under the receiver operating characteristic curve (AUROC). All the measure calculations were calculated with the help of a confusion matrix.

The confusion matrix is used to measure the accuracy of classifiers to see the dead and alive child classes in this study [29]. True positive (TP) is the positive class classified correctly as positive. True negative (TN) is the negative class predicted correctly as negative. False positive (FP) is the negative class incorrectly predicted as positive. A false negative (FN) is a positive class incorrectly predicted as negative. There are other various metrics such as sensitivity (recall), specificity, accuracy, and precision that were used in classifying the model.

Accuracy – Accuracy is the percentage correctly defined among the total number of cases tested.

$$\mathbf{A}\mathbf{c}\mathbf{c}\mathbf{u}\mathbf{r}\mathbf{a}\mathbf{c}\mathbf{y} =\frac{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{P}+\mathbf{F}\mathbf{N}}$$

Sensitivity - Sensitivity measures the chance of predicting the positive class correctly and is calculated by this given formula.

$$\mathbf{S}\mathbf{e}\mathbf{n}\mathbf{s}\mathbf{i}\mathbf{t}\mathbf{i}\mathbf{v}\mathbf{i}\mathbf{t}\mathbf{y}/\mathbf{R}\mathbf{e}\mathbf{c}\mathbf{a}\mathbf{l}\mathbf{l} =\frac{\mathbf{T}\mathbf{P}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{N}}$$

Specificity – Specificity is the measure of chance to predict negative outcomes from among the total negative outcomes. The given formula can calculate it:

$$\mathbf{S}\mathbf{p}\mathbf{e}\mathbf{c}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{i}\mathbf{t}\mathbf{y}=\frac{\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{P}}$$

Precision – Precision is a measure of positive elements that were correct. This is also defined as positive predictive value.

$$\mathbf{P}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{i}\mathbf{s}\mathbf{i}\mathbf{o}\mathbf{n}/\mathbf{P}\mathbf{P}\mathbf{V} =\frac{\mathbf{T}\mathbf{P}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{P}}$$

Negative predictive value – The negative predictive value is defined as the number of true negatives divided by the total number of people who tested negative.

$$\mathbf{N}\mathbf{e}\mathbf{g}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{v}\mathbf{e} \mathbf{p}\mathbf{r}\mathbf{e}\mathbf{d}\mathbf{i}\mathbf{c}\mathbf{t}\mathbf{i}\mathbf{v}\mathbf{e} \mathbf{v}\mathbf{a}\mathbf{l}\mathbf{u}\mathbf{e} =\frac{\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{N}}$$

F1 score - The inverse relationship between accuracy and recall is the F1 score or the F test. A higher value of the F1 score predicts a better model. The harmonic mean of recall and accuracy is determined as

$$\mathbf{F}1 \mathbf{s}\mathbf{c}\mathbf{o}\mathbf{r}\mathbf{e} =\frac{2\mathbf{T}\mathbf{P}}{2\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{N}+\mathbf{F}\mathbf{P}}$$

Cohen Kappa

Cohen's kappa statistic is a very good measure that handles both multiclass and class imbalance problems well [30]. Cohen’s kappa statistic is used to compare the agreement between the predicted labels from a model and the actual labels in the data. It is calculated using the formula

$$K=\frac{{p}_{0}-{p}_{e}}{1-{p}_{e}}$$

where p_o is the observed agreement and p_e is the expected agreement. It basically tells you how good your classifier is compared to a classifier that randomly guesses according to the frequency of each class.

Receiver operating characteristic (ROC) curve

AUC-ROC is an estimated metric used to evaluate the performance of classification models. AUC-ROC metrics are clearly useful in determining and informing a model's ability to distinguish between classes. The evaluation criteria are as follows. - The higher the AUC, the better the model. AUC-ROC curves are often used to graphically represent the relationship and trade-off between sensitivity and specificity.

K-fold Cross-Validation

Cross-validation is a verification technique that evaluates a model’s ability to generalize to an independent dataset. In k-fold cross-validation, the training dataset is arbitrarily divided into k mutually exclusive subsamples of equal size. The model is trained (or folded) k times, with each iteration using one of the k subsamples for testing (cross-validation) and applying the remaining k-1 subsamples to train the model.

Important features (factors) selection method

This method involves decreasing the reduction of the data by choosing a subset of k features/variables from the original pool of features/variables. Feature selection focuses on selecting the best subset of input variables that describe the target variable. There are various ways to use feature selection, such as filters, wrappers, and built-in methods, in the machine learning approach. In this study, we used the feature selection techniques chi-square scores, recursive feature elimination, extra tree classifier, random forest importance, and sequential feature selector for predicting the most important variables of under-five mortality.

Ethical considerations

This study was based on freely available online data, so ethical approval was not required for this study.

In the study, a total of 2,32,920 maternal responses of children aged less than five years and under five deaths were reported (8702). The outcome variable was defined as the survival status (live/death) of children under five years of age from the age of 1 month to the age of 59 months for this study. After using the multivariate logistic regression analysis, we found an association between under-five child mortality and independent factors, and the results of the logistic regression are explained in Table 1. The region, contraceptive use, mother’s education, twin child, sex, place of delivery, father’s occupation, preceding birth interval, breastfeeding, father’s education, place of residence, age of mother at first birth, number of antenatal visits, source of drinking water, toilet facility and socioeconomic status were found to be highly significant for under-five mortality.

Table 1

Multivariate logistic regression analysis to predict the factors of under-five mortality.
Variables	Coefficient	Standard Error	p-value	Odds ratio	95.0% Confidence for odds ratio
Variables					Lower	Upper
Household Sex (Male/Female)	-0.01	0.03	0.71	0.99	0.93	1.05
Residence Place (Urban/Rural)	-0.38	0.03	0.00	0.68	0.64	0.72
Electricity facilities (No/Yes)	0.44	0.04	0.00	1.56	1.43	1.70
Wealth index (Poor)			0.00
Middle	0.60	0.03	0.00	1.81	1.72	1.91
Rich	0.27	0.04	0.00	1.31	1.22	1.40
Births in last five years (Single)			0.00
Two children	-1.58	0.03	0.00	0.21	0.20	0.22
>= Three children	-0.88	0.03	0.00	0.41	0.39	0.44
Breastfeeding Status (No/Yes)	0.54	0.02	0.00	1.71	1.64	1.79
Husband/partner's education level (No/Yes)	0.40	0.06	0.00	1.50	1.34	1.67
Husband/partner working (No/Yes)	0.03	0.07	0.67	1.03	0.89	1.19
Respondent working status (No/Yes)	0.05	0.03	0.10	1.06	0.99	1.12
Child Sex (Male/Female)	0.15	0.02	0.00	1.16	1.11	1.21
Delivery Mode	0.24	0.03	0.00	1.27	1.19	1.34
Vaccination (No/Yes)	0.06	0.02	0.01	1.06	1.02	1.11
Postnatal Care (No/Yes)	0.09	0.02	0.00	1.09	1.04	1.14
Antenatal care (No/Yes)	0.40	0.02	0.00	1.49	1.43	1.55
Region (Central)			0.00
East	0.59	0.04	0.00	1.80	1.65	1.97
North	0.41	0.05	0.00	1.51	1.38	1.66
Northeast	0.06	0.05	0.20	1.06	0.97	1.17
South	0.01	0.05	0.81	1.01	0.92	1.12
West	-0.18	0.06	0.00	0.84	0.75	0.93
Education of women (Illiterate/Literate)	0.35	0.02	0.00	1.42	1.35	1.48
Source of drinking water (Improved/Unimproved)	0.00	0.02	0.92	1.00	0.95	1.05
Toilet facility (Improved/Unimproved)	-0.37	0.02	0.00	0.69	0.66	0.72
Religion (Hindu)			0.00
Muslim	0.29	0.06	0.00	1.33	1.18	1.50
Christian	0.13	0.07	0.05	1.14	1.00	1.30
Other	-0.03	0.07	0.73	0.98	0.85	1.13
Household member (1–2 members)			0.00
3–4 Members	2.33	0.05	0.00	10.24	9.30	11.27
5–6 Members	0.44	0.03	0.00	1.55	1.47	1.64
>=7 Members	0.06	0.03	0.04	1.06	1.00	1.12
Type of cooking fuel (Clean)			0.00
Solid	-0.67	0.05	0.00	0.51	0.46	0.56
Other	-0.23	0.05	0.00	0.80	0.73	0.88
Total children ever born (One child)			0.00
Two children	-0.45	0.03	0.00	0.64	0.60	0.68
Three children	-0.65	0.03	0.00	0.52	0.50	0.55
Four and above children	-0.26	0.03	0.00	0.77	0.73	0.82
Age at first birth (< 18 years)			0.00
18–24 years	-0.02	0.03	0.50	0.98	0.92	1.04
25–29 years	-0.19	0.04	0.00	0.82	0.76	0.90
>=30 years	-0.11	0.07	0.09	0.89	0.78	1.02
Contraceptive use (Nonuser/users)	0.87	0.02	0.00	2.39	2.29	2.49
Anaemia level (Yes/No)	0.14	0.02	0.00	1.15	1.10	1.20
Marital status (unmarried)			0.42
Currently married	-0.04	0.34	0.90	0.96	0.49	1.88
Widowed/divorced/separated	0.11	0.09	0.22	1.11	0.94	1.32
Desire for more children (No/Yes)	-1.04	0.02	0.00	0.35	0.34	0.37
Birth Order (One)			0.00
Birth Order (Two)	-0.30	0.03	0.00	0.74	0.69	0.78
Birth Order(Three)	-0.50	0.03	0.00	0.61	0.57	0.65
Birth Order(Four)	-0.27	0.04	0.00	0.76	0.71	0.82
Cast (SC/ST)			0.00
Cast(OBC)	0.27	0.03	0.00	1.31	1.23	1.40
Cast(GENERAL)	0.20	0.03	0.00	1.22	1.14	1.30
Number of living children (No children)			0.00
Single children	4.54	0.03	0.00	93.51	87.54	99.89
Two children	1.19	0.03	0.00	3.30	3.11	3.49
>= Three Children	0.02	0.03	0.49	1.02	0.96	1.09
Delivery place (Home/Institutional)	0.27	0.03	0.00	1.31	1.25	1.39
Child Size (Small)			0.00
Average	0.87	0.03	0.00	2.38	2.22	2.54
Large	-0.09	0.03	0.00	0.91	0.86	0.97
Age at Married ( > = 18 years)			0.00
19–29 Years	-0.25	0.27	0.34	0.78	0.46	1.31
30–39 Years	-0.44	0.27	0.10	0.64	0.38	1.08
>= 40 Years	-0.59	0.29	0.04	0.55	0.32	0.97
Mother's Age (> 20 Years)			0.00
20–29 Years	1.47	0.09	0.00	4.34	3.66	5.16
30–39 Years	0.12	0.06	0.05	1.13	1.00	1.28
> 39 Years	-0.18	0.06	0.01	0.84	0.74	0.95

Machine learning analytical approaches in under-five mortality data

In this approach, we first split the whole dataset into 70% training and 30% testing data. Various machine learning models, such as decision tree, random forest, logistic regression, neural network, ridge regression, k-nearest neighbor, and naive Bayes machine learning models, were applied to 70% of the trained data with tenfold cross-validation, and 30% of the test data were used for model evaluation with the help of metrics such as the confusion matrix, accuracy, recall, precision, F1-score, Cohen kappa, area under the receiver operative characteristics (AUROC) and precision-recall curve. The ML model’s evaluation result is explained in Table 2, and the random forest showed the highest predictive accuracy of 98.88% for under-five mortality.

Table 2

Model accuracy metrics for all models with all factors for under-five mortality data
Measure	Neural Network	Decision Tree	Logistic Regression	Random Forest	Naïve Bays	K-nearest Neighbor	Ridge Regression
Sensitivity	95.71%	96.59%	91.55%	98.76%	68.37%	86.72%	90.04%
Specificity	97.36%	80.63%	96.70%	99.00%	78.02%	99.90%	98.28%
Precision	97.39%	76.46%	96.87%	99.00%	82.37%	99.91%	98.43%
Negative Predictive Value	95.66%	97.32%	91.12%	98.77%	62.15%	84.80%	89.19%
Accuracy	96.52%	86.93%	93.99%	98.88%	72.22%	92.33%	93.79%
F1 Score	96.54%	85.36%	94.14%	98.88%	74.72%	92.85%	94.05%
Kappa	0.93	0.74	0.88	0.98	0.45	0.85	0.88

The sensitivity was 98.76%, specificity was 99%, precision was 99%, negative predictive value was 98.77%, F1-score was 98.88%, Cohen’s kappa value was 0.98 and AUROC was 99.96% (Fig. 4) of the random forest for predicting under-five mortality. The neural network was a second-best predictive model with 96.52% accuracy, 95.71% sensitivity, 97.36% specificity, 97.39% precision, 95.66% negative predictive value, 96.54% F1-score, 0.93 Cohen’s kappa value and 99.70% AUROC for predicting under-five mortality.

The traditional statistical logistics regression model accuracy, sensitivity, specificity, precision, negative predictive value, F1 score, Cohen’s kappa value, and AUROC were 93.99%, 91.55%, 96.70%, 96.87%, 91.12%, 94.14%, 0.88 and 97.86%, respectively.

Importance feature selection

In this procedure, we used various feature selection techniques, such as the chi-square selector, recursive feature elimination, extra tree classifier, random forest, and forward feature selection using linear regression, to select the important features of under-five mortality. We selected only the top eleven features from each selection technique. We have given a score of zero (0) and one (1) to each feature and their corresponding feature selection techniques if features were present would get one (1) otherwise zero (0). All the calculation details are explained in Table 3, and based on the calculation, we found the top ten best features. It showed that the number of living children, breastfeeding status, births in the last five years, children ever born, time, antenatal care, region, child size, number of household members and birth order were the top ten features that contributed to under-five mortality.

Table 3

Important variables selected using various feature selection methods for under-five mortality data
S.No.	Variables	Chi-2	Recursive Feature Elimination	Extra Trees Classifier	Random Forest Classifier	Sequential Feature Selector	Total Score
1	Living children	1	1	1	1	1	5
2	Breastfeeding status	1	1	1	1	1	5
3	Births in last five years	1	1	1	1	1	5
4	Children ever born	1	1	1	1	1	5
5	Time	1	1	1	1	1	5
6	Antenatal care	1	1	1	1		4
7	Region	1	0	1	1	1	4
8	Child Size	1	0	1	1	1	4
9	Number of household members	1	0	1	1	1	4
10	Birth order	0	1	1	1	1	4
11	Cast	0	1	1	1		3
12	Contraceptive user	1	0	0	0		1
13	Wealth index	1	0	0	0		1
14	Anemia Level	0	1	0	0		1
15	Marital Status	0	1	0	0		1
16	Desire for Children	0	1	0	0		1
17	Child Sex					1	1
18	Religion					1	1

Machine learning model algorithm for important variables

We made a new dataset including only the top ten features and split the dataset into 70% training and 30% testing sets. Again, we applied the abovementioned machine learning models to 70% of the trained data with tenfold cross-validation and 30% of the test data for model evaluation. The comparison of the ML models’ evaluation is presented in Table 4. In all predictive models, random forest performed better with 96.25% accuracy, 94.91% sensitivity, 97.66% specificity, 97.71% precision, 94.79% negative predictive value, 96.29% F1-score, 0.93 Cohen kappa value and 99.43% AUROC (Fig. 5) for under-five mortality. After that, the neural network performed better, showing 94.83% accuracy for under-five mortality. The neural network obtained 92.48% sensitivity, 97.43% specificity, 97.55% precision, 92.12% negative predictive value, 94.95% F1-score, 0.90 Cohen kappa value, and 98.89% AUROC. Logistic regression was 97.74% accurate and obtained 90.95% sensitivity, 96.39% specificity, 96.59% precision, 90.45% negative predictive value, 93.68% F1-score, and 0.87 Cohen kappa value. The area under the receiver operating characteristic (ROC) curve was 97.74% for the logistic regression for predicting under-five mortality.

Table 4

Model accuracy metrics for all models with selected important variables for under-five mortality data
Measure	Neural Network	Decision Tree	Logistic Regression	Random Forest	Naïve Bays	K-nearest Neighbor	Ridge Regression
Sensitivity	92.48%	96.51%	90.95%	94.91%	80.87%	93.71%	89.80%
Specificity	97.43%	79.49%	96.39%	97.66%	83.76%	95.92%	97.76%
Precision	97.55%	74.72%	96.59%	97.71%	84.35%	95.99%	97.95%
Negative Predictive Value	92.12%	97.32%	90.45%	94.79%	80.18%	93.60%	88.95%
Accuracy	94.83%	86.06%	93.51%	96.25%	82.26%	94.79%	93.43%
F1 Score	94.95%	84.23%	93.68%	96.29%	82.58%	94.84%	93.69%
Kappa	0.90	0.72	0.87	0.93	0.65	0.90	0.87

This study was designed to find the best ML predictive model and predict the important factors of under-five mortality in India using machine learning techniques. We applied various ML models to find the best predictive model for under-five mortality data. The random forest (RF) model was found to be the best ML with the highest accuracy and AUC statistics for under-five mortalities. The random forest model exhibits a greater predictive potential than the other ML models utilized in this research study. Apart from that, we have also used various feature section ML techniques. In this context, the random forest, chi-square selector, recursive feature elimination, extra trees classifier, and forward feature selection using a linear regression model showed that the number of living children, breastfeeding status, births in the last five years, children ever born, time, antenatal care, region, child size, number of household members and birth order were the top ten features that contributed to under-five mortality. The literature has a wealth of evidence supporting the significant effect that some of these factors have on under-five mortality rates [31–33]. The results of the best-performing ML model seem to be essentially in line with the results of the traditional logistic regression model, which also indicates that factors such as number of living children, time, birth in last five years, children ever born, birth order, number of household members, breastfeeding status, region, caste, size of children and antenatal care have a significant impact on under-five mortality rates in India. This study’s results confirmed that random forest had a higher predictive power than the other ML models. This result coincides with previous studies [34–36]. Earlier studies [37–38] also established that the random forest model is better than the traditional statistical analysis approach. As per the findings of the traditional logistic regression, male children showed a significantly higher risk of under-five mortality than female children. This agrees with the results of a cross-sectional ML study performed in India [39–40].

Research shows that male children have a higher mortality risk in the neonatal period due to their greater susceptibility to infectious diseases. This could be because, compared to male neonates, female neonates are more likely to have early fetal lung development during the first week of life, which may lead to a reduced prevalence of respiratory illnesses in female neonates. Additionally, a much-increased risk of under-five mortality appears to be linked to children's birth order; the higher, the better. Similar research has shown that greater birth order has a negative impact on children's chances of surviving throughout Asia and some regions of Africa [32, 40]. The study also shows that antenatal care checkups improve the chances of survival for children under the age of five. This is in concordance with the strong correlation between prenatal and postnatal care and a decreased risk of under-five mortality reported in the literature [41, 13]. It is implied that children whose mothers do not obtain antenatal and postnatal care services may experience more proximate under-five mortality risk factors, such as congenital and infectious disorders, than their counterparts.

This research also found that breastfeeding infants at an early age significantly increased their survival chances. Breastfeeding has long been demonstrated to be a significant predictor of under-five mortality, particularly in developing nations, and may be a crucial component of treatments in India aimed at promoting child survival [42].

While this study is one of the early works of its kind to investigate ML and under-five mortality, it does have some limitations. First, since this is an actual cross-sectional evidence-based study, we can only infer associations, not causality. Second, we did not measure changing trends in under-five mortalities in India. Future research should investigate longitudinal data points to infer changes over time and perhaps causality. Finally, we could not examine individual state-, regional-, and subgroup-level variations due to socioeconomic, geographic, and cultural variations. Therefore, we cannot generalize that the level of association is the same across all nations and subgroups. Therefore, future research should consider stratifying data using factors such as income, education, and location to study subgroup variations. Another limitation was the limited number of studies on this topic. According to our information, there has not been enough research done on using machine learning techniques to analyze data on under-five mortality. The data study based on NFHS-5 survey data can have limitations in the consistency of some questionnaires. The number of missing entries of some important variables in the dataset and the chances of biases by respondents or interviewers. Regarding other limitations regarding the reporting of the summary measure from the articles, some presented the AUC, while others presented the sensitivity and specificity. Without consistent metrics, it limits comparing the studies and models.

The conclusion of this study highlights the importance of machine learning (ML) as a technique for predicting the factors of under-five mortality. We found that the random forest model was the strongest predictive model compared to other ML models, including the traditional logistic regression model. The results found that the number of living children, time, birth in the last five years, children ever born, birth order, number of household members, breastfeeding status, region, caste, size of children, and antenatal care are important factors responsible for under-five mortalities. Hence, health researchers, policymakers, public health experts, and government policymakers should utilize ML model approaches, and those important factors should be considered in constructing any appropriate action program or policy to reduce under-five mortalities in India. Future studies should concentrate on utilizing ML approaches and different datasets to predict child mortalities and measure their relative abilities.

Acknowledgments

We would like to acknowledge the International Institute for Population Studies (IIPS) Mumbai for granting permission to access the online data used for this research.

Author contributions

RKS contributed to developing the concept of the research paper, writing the methodology section, applying machine model algorithms and techniques, data analysis, interpretation of the results, discussing the findings and implications, and the final draft of the paper. PKY extracted the dataset and performed a quality check of the dataset. PKY was also involved in reviewing the literature, providing feedback on the methodology, and discussing the results and conclusions. VV provided comments on the data analysis, manuscript structure, and language corrections and provided input on the final draft of the paper.

Funding

This research paper was not support by any funding agency.

Data availability

For this study, we used the Vth round of the National Family Health Survey of India and this data are available in a public, open access repository. Additionally, data can download it after registering with the DHS data user website through an email ID. Dataset access is only granted for legitimate research purposes. For data access or to view a list of available datasets, please follow this link. https://dhsprogram.com/data/new-user-registration.cfm).

Competing interests

The authors declare that they do not have any financial, personal, or professional relationships that could be interpreted as influencing the research presented in the article if there are no conflicts of interest to declare.

Ethics approval and consent to participate

This NFHS-5 dataset online publicly available and this NFHS-5 survey data with no identifiable information on the participants and can be freely accessed from the DHS site (https://dhsprogram.com/data/new-user-registration.cfm).

The ethical approval for the NFHS-5 survey is obtained from the ethics review board of the International Institute for Population Sciences, Mumbai, India. These surveys are reviewed and approved by the ICF International Review Board. Informed written consent for participation in this survey are obtained from the respondents during the survey. According to the consistency methodology used in these national surveys, approval from each individual is sought before the patient interview.

Consent for publication

Not applicable.

Author information

Rakesh Kumar Saroj, Assistant Professor is corresponding and first author.

Pawan Kumar Yadav and Vizovonuo Visi is co-author.

World Health Organization. Health in 2015: from MDGs, millennium development goals to SDGs, sustainable development goals.
UNICEF, W., World Bank Group, and United Nations. Levels and trends in child mortality: estimates developed by the UN inter-agency group for child mortality estimation.2020; Retrieved from https://data.unicef.org/resources/levels-and-trends-in-child-mortality/ .
Islam M, Usman M, Mahmood A, Abbasi AA, Song O-Y. Predictive analytics framework for accurate estimation of child mortality rates for Internet of Things enabled smart healthcare systems. International Journal of Distributed Sensor Networks. 2020;16(5).
World Bank Mortality rate, under-5 (per 1,000 live births), World Bank Data. World Development Indicators. World DataBank. 2019; Retrieved from https://databank.worldbank.org/reports.aspx?source=2&series=SH.DYN.MORT&country=# .
UNDP. Human Development Report 2020. Retrieved from New York.2020; http://hdr.undp.org/sites/default/files/hdr2020.pdf.
Qiu PL, Liu SY, Bradshaw M, Rooney-Latham S, Takamatsu S, Bulgakov TS, Tang SR, Feng J, Jin DN, Aroge T, Li Y. Multi-locus phylogeny and taxonomy of an unresolved, heterogeneous species complex within the genus Golovinomyces (Ascomycota, Erysiphales), including G. ambrosiae, G. circumfusus and G. spadiceus. BMC MICROBIOL. 2020 ;20:1-6.
Singh A, Masquelier B. Continuities and changes in spatial patterns of under-five mortality at the district level in India (1991–2011). International Journal of Health Geograp. 2018; 17:1-8.
Adegbosin AE, Stantic B, Sun J. Efficacy of deep learning methods for predicting under-five mortality in 34 low-income and middle-income countries. BMJ OPEN. 2020;10(8):e034524.
Khan YA, Abbas SZ, Truong BC. Machine learning-based mortality rate prediction using optimized hyper-parameter. Comput Methods Programs Biomed. 2020 ;197:105704.
Deo RC. Machine Learning in Medicine. CIRCULATION. 2015;132(20):1920-30.
Waljee AK, Higgins PD. Machine learning in medicine: a primer for physicians. Am J Gastroenterol. 2010;105(6):1224-6.
Caluza LJ. Machine Learning Algorithm Application in Predicting Children Mortality: A Model Development. Int J Inf Sci Appl. 2018;10(1):1-6.
Bitew, F.H., Nyarko, S.H., Potter, L. et al. Machine learning approach for predicting under-five mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey. Genus. 2020 (76): 1-16.
Bizzego A, Gabrieli G, Bornstein MH, Deater-Deckard K, Lansford JE, Bradley RH, Costa M, Esposito G. Predictors of Contemporary under-5 Child Mortality in Low- and Middle-Income Countries: A Machine Learning Approach. Int J Environ Res Public Health. 2021;18(3):1315.
Jaskari J, Myllärinen J, Leskinen M, Rad AB, Hollmén J, Andersson S, Särkkä S. Machine learning methods for neonatal mortality and morbidity classification. IEEE ACCESS.2020 ;8:123347-58.
Adeyinka DA, Muhajarine N. Time series prediction of under-five mortality rates for Nigeria: comparative analysis of artificial neural networks, Holt-Winters exponential smoothing and autoregressive integrated moving average models. BMC Med Res Methodol.2020;20(1):292.
Huang L, Shea AL, Qian H, Masurkar A, Deng H, Liu D. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J Biomed Inform. 2019; 99:103291.
Mosley WH, Chen LC. An analytical framework for the study of child survival in developing countries. 1984. Bull World Health Organ. 2003;81(2):140-5.
Dendup T, Zhao Y, Dema D. Factors associated with under-five mortality in Bhutan: an analysis of the Bhutan National Health Survey 2012. BMC Public Health. 2018;18(1):1375.
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F. Learning from imbalanced data sets. Cham: Springer; 2018.
Patil TR. Mrs. SS Sherekar,” Performance Analysis of J48 and J48 Classification Algorithm for Data Classification”. International Journal of Computer Science and Applications. 2013 ;6(2).
Tomar D, Agarwal S. A survey on Data Mining approaches for Healthcare. International Journal of Bio-Science and Bio-Technology. 2013 ;5(5):241-66.
Apte C, Hong SJ. Predicting Equity Returns from Securities Data with Minimal Rule Generation. InKDD Workshop 1994;407-418.
Shafi’i Muhammad Abdulhamid, Maryam Shuaib, Oluwafemi Osho, Idris Ismaila, John K. Alhassan,"Comparative Analysis of Classification Algorithms for Email Spam Detection", International Journal of Computer Network and Information Security. 2018;10(1):60-67.
Arar, Ömer Faruk and Kürsat Ayan. “A feature dependent Naive Bayes approach and its application to the software defect prediction problem Applied Soft Computing. 2017; 59:197-209.
N. García-Pedrajas, J. A. Romero del Castillo and G. Cerruela-García, "A Proposal for Local k Values for k -Nearest Neighbor Rule," in IEEE Transactions on Neural Networks and Learning Systems.2017;28(2):470-475.
Zhang S, Li X, Zong M, Zhu X, Cheng D. Learning k for knn classification. ACM Transactions on Intelligent Systems and Technology. 2017 ;8(3):1-9.
Peng CY, Lee KL, Ingersoll GM. An introduction to logistic regression analysis and reporting. The Journal of Educational Research. 2002 ;96(1):3-14.
Fawcett T. An introduction to ROC analysis. Pattern recognition letters. 2006 ;27(8):861-74.
Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960 ;20(1):37-46.
Abir T, Agho KE, Page AN, Milton AH, Dibley MJ. Risk factors for under-5 mortality: evidence from Bangladesh Demographic and Health Survey, 2004-2011. BMJ Open. 2015;5(8):e006722.
Howell EM, Holla N, Waidmann T. Being the younger child in a large African Family: a study of birth order as a risk factor for poor health using the demographic and health surveys for 18 countries. BMC Nutrition. 2016;2(1):1-2.
Yaya S, Bishwajit G, Okonofua F, Uthman OA. Under five mortality patterns and associated maternal risk factors in sub-Saharan Africa: a multi-country analysis. PloS One. 2018;13(10):e0205977.
Biau G, Scornet E. A random forest guided tour. Test. 2016 ;25:197-227.
Nasejje JB, Mwambi H. Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption. BMC Research Notes. 2017;10(1):459.
Ehrlinger J. ggRandomForests: Exploring random forest survival. arXiv preprint arXiv:1612.08974. 2016.
Panesar SS, D'Souza RN, Yeh FC, Fernandez-Miranda JC. Machine Learning Versus Logistic Regression Methods for 2-Year Mortality Prognostication in a Small, Heterogeneous Glioma Database. World Neurosurgery: X. 2019 ;2:100012.
Podda M, Bacciu D, Micheli A, Bellù R, Placidi G, Gagliardi L. A machine learning approach to estimating preterm infants survival: development of the Preterm Infants Survival Assessment (PISA) predictor. Scientific Reports 2018 ;8(1):13743
Saroj RK, Yadav PK, Singh R, Chilyabanyama ON. Machine Learning Algorithms for understanding the determinants of under-five Mortality. BioData Mining. 2022;15(1):20
Hong R, Hor D. Factors associated with the decline of under-five mortality in Cambodia, 2000–2010: Further analysis of the Cambodia Demographic and Health Surveys. Calverton: ICF International. 2013:2013.
Woldeamanuel BT. Socioeconomic, demographic, and environmental determinants of under-5 mortality in Ethiopia: evidence from Ethiopian demographic and health survey, 2016. Child Development Research. 2019;2019.
Azuine RE, Murray J, Alsafi N, Singh GK. Exclusive Breastfeeding and Under-Five Mortality, 2006-2014: A Cross-National Analysis of 57 Low- and-Middle Income Countries. Int J MCH AIDS. 2015;4(1):13-21.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Predictive Modelling of Under-Five Mortality Determinants Using Machine Learning Techniques

Status:

Version 1

Abstract

Background

Methods

Result

Conclusions

Figures

Introduction

Methods

Sample Design and Sample

Data Source

Analytic Strategies for Under-five Mortality Data

Preprocessing of the data

Target variable

Predictor variables

Machine learning approach framework of U5MR prediction

Model Building

Model Evaluation

K-fold Cross-Validation

Important features (factors) selection method

Ethical considerations

Results

Machine learning analytical approaches in under-five mortality data

Importance feature selection

Machine learning model algorithm for important variables

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1