Machine Learning Algorithms for understanding the determinants of under-five Mortality

Background Under-five mortality is a matter of serious concern for child health as well as the social development of any country. The paper aimed to find the accuracy of machine learning models in predicting under-five mortality and identify the most significant factors associated with under-five mortality. Method The data was taken from the National Family Health Survey (NFHS-IV) of Uttar Pradesh. First, we used multivariate logistic regression due to its capability for predicting the important factors, then we used machine learning techniques such as decision tree, random forest, Naïve Bayes, K- nearest neighbor (KNN), logistic regression, support vector machine (SVM), neural network, and ridge classifier. Each model’s accuracy was checked by a confusion matrix, accuracy, precision, recall, F1 score, Cohen’s Kappa, and area under the receiver operating characteristics curve (AUROC). Information gain rank was used to find the important factors for under-five mortality. Data analysis was performed using, STATA-16.0, Python 3.3, and IBM SPSS Statistics for Windows, Version 27.0 software. Result By applying the machine learning models, results showed that the neural network model was the best predictive model for under-five mortality when compared with other predictive models, with model accuracy of (95.29% to 95.96%), recall (71.51% to 81.03%), precision (36.64% to 51.83%), F1 score (50.46% to 62.68%), Cohen’s Kappa value (0.48 to 0.60), AUROC range (93.51% to 96.22%) and precision-recall curve range (99.52% to 99.73%). The neural network was the most efficient model, but logistic regression also shows well for predicting under-five mortality with accuracy (94% to 95%)., AUROC range (93.4% to 94.8%), and precision-recall curve (99.5% to 99.6%). The number of living children, survival time, wealth index, child size at birth, birth in the last five years, the total number of children ever born, mother’s education level, and birth order were identified as important factors influencing under-five mortality. Conclusion The neural network model was a better predictive model compared to other machine learning models in predicting under-five mortality, but logistic regression analysis also shows good results. These models may be helpful for the analysis of high-dimensional data for health research.


Introduction
Under-five mortality is the most widely used indicator to measure the health status of children. It is also an index of the general development of any country. Under-five mortality is the probability of children dying before their fifth birthday. Worldwide, underfive mortality rates are higher in the South-Asian and Sub-Saharan African countries. In India, the under-five mortality rate has reduced from 83 deaths per 1000 live births in 2000 to 42 deaths in 2017 [1]. State-wise reports have found that under-five mortality is highest in Uttar Pradesh, followed by Madhya Pradesh and Chhattisgarh [2], as shown in Fig. 1. Although there has been a significant reduction in under-five deaths in these states, it remains a major issue for child health in developing countries like India. Understanding the important factors in explaining childhood mortality is integral to reducing the death rate, but it is not enough.
Nowadays, Machine learning (ML) techniques are highly used in public health research. Various machine learning models have been used to predict and classify various health and biomedical data. These ML models can automatically identify interactions and find the non-linear relationship between the target variable and independent variables. Machine learning approaches can be utilized to discover the exposures related to health outcomes of interest and the potential interactions between those exposures [3]. Various machine learning prediction and classification models like regression, logistic regression, principal component analysis (PCA), decision trees, and maximum likelihood methods have been used to find the accurate estimation of health data. These approaches could help to obtain early prediction and insight into the important factors for under-five mortality. A study by Ethiopian provides evidence of J48 machine learning and artificial neural network (ANN) techniques to find the causes of child mortality [4]. Another study showed that the machine learning model effectively predicted the under-nutrition status of under-five children in the Ethiopian administrative zones [5]. The studies assessed the machine learning technique's performance to predict the risk of neonatal mortality and morbidity [6,7]. A study used iterative dichotomiser3(ID3), random forest, and decision tree models to predict the nutritional status of under-five children [8]. Another Indian study predicted the nutrient effects on human health using machine learning techniques [9]. So far in our literature search, no published study which used the machine learning model technique to predict under-five mortality was available. Also, past studies have found a lack of a generic prediction framework for accurately estimating child mortality rates using machine learning techniques.
There is a need for accurate prediction and classification models to provide highly accurate results and allow health researchers to experiment with various sets of aspects. This study offers an opportunity to assess the accuracy or efficacy of the machine learning models and find the important factors with the help of the information gain method in studying under-five mortality.

Methodology
This study's methods have been explained step by step through a framework for underfive mortality prediction. The data analysis of this study was performed in various steps. Firstly, the multivariate logistic regression analysis was performed to find the important factors (p < 0.05) thereafter machine learning model's approaches were applied to the dataset. The explanations of the machine learning frameworks are portrayed in Fig. 2. All the analyses of the data were conducted using Python 3.3, STATA 16.0, and SPSS-27 software.

Importance of ML methods over traditional methods
A study has shown that a machine learning framework can be used to detect significant risk factors of under-five mortality and that deep learning techniques are superior to logistic regression for the classification of child survival [10]. Machine learning models can accurately predict neonatal, perinatal, and infant mortality [11][12][13]. Several studies done to predict the bankruptcy of banks have shown that intelligent techniques (specifically ANN) seem to work more effectively than statistical techniques. ANN and KNN methods perform more effectively than traditional methods [14].

Dataset
National Family Health Survey (NFHS-IV) is a large-scale, multi-round cross-sectional, national representative survey conducted in households throughout the Indian states and union territories and is one of the most extensive data collection methods to help keep records across India. The reports are summarized from district-wise to state-wise. The survey collects extensive information on population, health, and nutrition, with an emphasis on women and young children. In this study, we have used secondary data from the NFHS-IV survey of Uttar Pradesh. We have used the target group data of under-five children of Uttar Pradesh. This dataset has records for every woman interviewed whose child was born in the past five years preceding the survey. It contains information related to the mother's pregnancy, postnatal care, and health. This file was used to obtain information related to child health indicators such as immunization coverage, vitamin A supplementation, recent occurrences of diarrhoea, fever, and cough for young children, and treatment of childhood diseases. A total of 1377 variables were available in this dataset. There was a total of 41,751 samples/individuals, out of which under-five mortality was 2830.

Study variables
According to an analytical framework for child survival in developing countries [15], we have used 19 (out of 1377 variables) most important variables that were related to under-five mortality, as most of the variables were not useful for this study. Due to missing values, only 15 variables were used for the analysis, which included the outcome/target variable. A missing value is defined as a variable that should have a response but does not have a response either because the question was not asked (due to interviewer error) or the respondent did not want to answer. The outcome/target (dependent) variable was under-five mortality which is known as the death of a child before completing 59 months.
The predictor (independent) variables considered in this study were mothers' educational level, births in the last five years, any exposure, currently breastfeeding, total number of living children, wealth index, mass media exposure (MXP), survival time, the total number of children ever born, desire for more children, sex of the child, child-size at birth, ANC visits and birth order.

Data pre-processing
After making the final dataset, the next step was to pre-process the data by using various methods. In this step, the duplicates and missing variables were removed using the predictive mean matching method. Thereafter, all string and categorical variables were transformed into numerical values.
An important point in data pre-processing is the need to balance the target or outcome variable. In the dataset, the numbers of under-five mortality were highly skewed as compared to live children (38,921 live children vs 2830 under-five mortality). A random over-sampling method was used to balance the target (dependent), after which a ratio of 50:50 was obtained as compared to the early ratio of 93:7.

Feature selection
The idea of feature selection is about ranking the major risk factors from the dataset according to their importance. This is based on the calculation of the information gain values for each of the selected variables. In this study, we have used a random forest model to find the risk factors or important features that have a major contribution to child mortality. The higher information gain values tell us important variables that are highly correlated with the class of variable. We randomly selected the top eight ranked information values, which we used in the model building later.

Data Splitting
In this step, we split the datasets into trained and test data. 70% of the trained data are used for the model classification and 30% of the data for model evaluation. Again, we will split the datasets into trained and tested (80% and 20% respectively) for a clear idea of a classification model. All the independent features needed to be changed in one-hot encoding to build better predictive models. In this study, the dependent variable was binary, i.e., dead/alive. We then used various suitable machine learning models, namely decision tree, random forest, Naïve Bayes, KNN model, logistic regression, SVM, neural network, and ridge classifier.

Decision Tree (DT)
The decision tree is one of the most intuitive and straightforward techniques in machine learning based on the divide and conquers paradigm [16]. In a decision tree technique, tests (on input patterns) and categories (of patterns) are used as inner and leaf nodes, respectively. This technique also assigns a class number to an input array by filtering the array down via the tests in the tree [12].

Random Forest (RF)
The random forest algorithm takes hyper-parameters, identifying the number of trees and the maximum depth of each tree. The random forest is a combination of learning approaches for the classification in machine learning and uses a vast collection of decorrelated decision trees [17].

Support Vector Machine (SVM)
The SVM is a supervised machine learning technique for analyzing and recognizing patterns of data [18]. New observations are predicted based on class and the side of the partition they fall in. The SVM is the nearest data point to the hyperplane that divides the classes.

Logistic Regression (LR)
Logistic regression is a statistical classification probabilistic model that predicts the probability of occurrence of an event. The logistic regression model is used to model the categorical dependent variable and a dichotomous categorical outcome or feature. It is a binary (multiple) model used to predict binary (multiple) responses [16]. The predictors need to be independent and significantly associated with the outcome variables [19].

Naive Bayes (NB)
Naive Bayes is a simple machine learning algorithm based on the Bayes theorem, and it has a necessary assumption that the attributes are conditionally independent for the given class. Naive Bayes gives competitive classification accuracy [20]. Naïve Bayes is widely applied because of its computational efficiency and desirable features [21].

K-Nearest Neighbours (KNN)
The KNN is a simple and effective non-parametric method of classification, and it is effective in many cases [22]. To classify the data record 't' , its 'k' nearest neighbour is collected, forming a neighbourhood 't' . Most points among the data records in the neighbourhood is mainly used to decide the classification for 't' with or without consideration of distance-based weighting. While applying the KNN, we choose an appropriate value for 'k' , and the classification success depends on this value. There are several methods of determining k values, but the simplest one is to run the algorithm many times with varying k values and choose the best performance [23].

Neural network
Neural networks reflect the human brain's behavior and allow computer programs to find patterns and solve common problems in machine learning, artificial learning, and deep learning. ANN comprises a node layer that contains an output layer, an input layer, and one or more hidden layers [24]. Each node connects to another and has an associated weight and threshold. If the output of an individual node exceeds the given threshold value, that node is activated and sends data to the next layer of the network.

Ridge regression
Ridge regression is a method for estimating the multiple-regression models' coefficients when the independent variables are highly correlated. This method was developed as a possible solution to the imprecision of least squares estimators with multi-collinearity among the independent variables in the linear regression model [25]. Ridge parameter estimates are more precise because their mean square error and variance are smaller than the least square estimators.

Evaluation for predictive models
In this study, to predict the best model for under-five mortality, evaluation was conducted by various indices such as confusion matrix, sensitivity, specificity, precision, accuracy, F1 score, negative predictive value, Cohen's Kappa values, and AUROC. All the details as given below:

Confusion matrix
The confusion matrix visualizes the actual and predicted class accuracies [26]. To examine the performance of the classification algorithm, the confusion matrix compares the pre- Sensitivity -Sensitivity is the test to measure correctly positive predicted events out of a total number of positive events. This gives us the value of how many positives are predicted out of total positive classes. This is known as recall and can be calculated by the given formula: Specificity -Specificity is the measure that tells us the proportion of correctly predicted negative outcomes among all total negative outcomes. It can be calculated by the given formula: Precision -Precision is the correct events divided by the total number of positive events that the classifier predicts. This is also known as positive predictive value. In this study, it was used to check the model output from the given formula below and it was calculated from the confusion matrix: Negative predictive value -The negative predictive value is defined as the number of true negatives divided by the total number of people who test negative. Accuracy -Accuracy is the percentage of true events among the total number of cases tested. In this study, it was used to determine model efficacy and measure from the confusion matrix.

Sensitivity/Recall
F1 score-The inverse relationship between accuracy and recall is the F1 score or the F test. The higher value of the F1 score predicts a better model. The harmonic mean of recall and accuracy is determined as.

Cohen's Kappa-Cohen's
Kappa is a coefficient used to assess the performance of the binary classification model [27]. It is a very useful evaluation statistic coefficient when working with imbalanced data. Cohen's Kappa (k) is calculated by the given formula: where p o is the overall accuracy of the model and is the measure of the agreement between the model predictions and the actual class values as if happening by chance? It can range from 0 to 1, with 0 representing no agreement and 1 representing the perfect agreement between classes.

Area under Receiver Operator Characteristic (AUROC) Curve
The Receiver Operator Characteristic curve is the probability curve that shows the relationship between sensitivity and specificity. This curve is the most used metric for binary classification outcomes. The Field under the ROC shows how well the probabilities are segregated from the negative classes by the positive classes. When the AUC value is close to 1, the model prediction indicates better, while the value near 0 indicates bad model efficiency. In this study, we use this measure for the model's efficiency.

Precision-recall curve
The precision-recall curve is a combination of sensitivity (x-axis) and precision(y-axis). It's used as an alternative to roc curves [28]. The high precision relates to a low false positive rate, while high recall is related to low false. The maximum area under the curve denotes both high precision and high recall. The highest score for both measures indicates that the classifier is producing results that are mostly positive (high recall) and accurate (high precision).

Results
The multivariate logistic regression analysis was applied to predict the important factors in under-five mortality data. Table 1 shows births in the last five years, breastfeeding status, sex of the child, number of living children, child size at birth, sex of the child, birth order, survival time, children ever born, and desire for more children were important factors for under-five mortality. The machine learning models, namely decision tree, random forest, Naïve Bayes, KNN, logistic regression, SVM, neural network, and ridge classifier were applied to build a predictive model of under-five mortality. A comparison of 70% training and 30% validation, 80% training, and 20% validation was done by eight machine learning models including various evaluation measures with and without important data factors.
All predictive models of under-five mortality were applied to training data of 70% with all factors. The models were tested on test data 30%. The performance of predictive models was evaluated and compared using various metrics namely confusion matrix,   Fig. 3, and the precision-recall curve is shown in Fig. 4. Both curves of the neural network model show the highest AUROC (96.4%) and highest precision-recall curve (99.7%), again indicating that it is the best predictive model among all models. The second-best model shows logistic regression analysis with 94.5% AUROC and 99.6% precision-recall curve value. Again, all predictive models of under-five mortality were applied to training data of 80% with all factors to get a better idea regarding the accuracy or efficacy of the model. The result of the model evaluation is shown in Table 2 for 80% of trained data. The result indicated that the neural network model was the best predictive model for under-five mortality compared to other predictive models. The result findings found that the neural network model has predicted under-five mortality at 95.96% highest accuracy with recall (79.27%), precision (51.83%), F1 score (62.68%), and Cohen's Kappa value (0.60). The ROC curve is shown in Fig. 5, and the precision-recall curve is shown in Fig. 6. The curve of the neural network model shows the highest AUC (93.87%), and highest precision-recall curve (99.7%), indicating it is the best predictive model among the models. The second-best model shows the logistic regression model with 94.8% AUROC and 99.6% precision-recall curve value.
After that, we used a random forest model to find the risk factors or important features that had a major contribution to the mortality of under-five children. We used the information gain rank method of random forest to check feature importance concerning its predictive power. We selected only the top eight best features for the model's accuracy. The result of feature importance is shown in Fig. 7. The result showed that the most important determinants of under-five mortality were the number of living children, survival time, wealth index, child size at birth, birth in the last five years, total children ever born, mother's education level, and birth order because they were high rank in order. After that, we repeated all procedures with important factors to know the importance of information gain measures or very important features.
All machine learning models, namely decision tree, random forest, Naive Bayes, KNN, logistic regression, SVM, neural network, and ridge classifier were applied to build a predictive model of under-five mortality in training data of 70% with eight important factors. The models were tested on test data 30%. The result of the model evaluation is shown in Table 3 for 70% of trained data. The result indicates that the neural network model was the best predictive model for under-five mortality compared to other predictive models. The result showed that the neural network model had predicted under-five mortality at 95.31% highest accuracy with recall (81.03%), precision (36.64%), F1 score (50.46%), and Cohen's Kappa value (0.48). The ROC curve is shown in Fig. 8, and the precision-recall curve is shown in Fig. 9. The curve of the neural network model showed the highest AUC (93.51%), and the precisionrecall curve (99.5%) indicated it is the best predictive model among the models. The logistics regression model indicated the best second model with 93.3% AUROC and 99.5% precision-recall curve value.
Again, all predictive models of under-five mortality were applied to training data of 80% with eight important factors. The models were tested on test data 20%. The result of the model evaluation is shown in Table 3 for 80% of trained data. The result found that the neural network model predicted under-five mortality at 95.29% highest accuracy with recall (71.51%), precision (45.05%), F1 score (55.28%), and Cohen's Kappa value (0.53), indicating it is the best predictive model among the models. The ROC curve is shown in Fig. 10, and the precision-recall curve is shown in Fig. 11. The curve of the neural network model shows the highest AUC (93.95%) and the precision-recall curve (99.5%) is the best predictive model among the models. The second-best model was a logistic regression with 94.8% AUROC and 99.5% precision-recall curve value. Finally, the result declared that the neural network classifier model is the most accurate model for predicting under-five mortality in the predictive analytics structure. The result also confirms that the machine learning model shows better output accuracy than the traditional statistical model and the information gain ranked method predicts the under-five mortality factors.

Discussion
This study predicts the important factors of under-five mortality using logistic regression analysis and a machine learning model. This study evaluated the importance of machine learning techniques in predicting the factors of under-five mortality. This is the first study that used machine learning techniques in high under-five mortality data of an Indian state Uttar Pradesh, to predict under-five mortality. To find better accuracy of machine learning models, we applied two different ratios i.e. 70/30 and 80/20 and we observed that the 70/30 ratio was the appropriate ratio for the model and this result is justified by previous studies [29,30]. This study showed that the neural network predictive model is better than another predictive model for predicting the factors of under-five mortality data. Concerning the predictive analysis,   found that neural networks were superior to logistic regression [31][32][33]. The articles found no differences between LR, and neural networks and some articles found that logistic regression was better than neural networks [34,35]. It may not be possible to determine which model is superior to the other in each dataset but the neural network's ability to detect the complex nonlinear relationship and all possible interactions between predictor variables. The neural network gives impressive results from an overfitted model including various free parameters while logistic regression has less potential for overfitting. All variables in a dataset are rarely useful for developing machine learning models. Adding maximum variables in the analysis reduces the competence and accuracy of the models. Thus, feature selection is an important tool in machine learning to find the important factors that are useful in machine learning models. The feature information gain method showed that the number of living children, time, wealth index, child size at birth, birth in the last five years, total children ever born, mother's education level, and birth order are the top eight important predictors for under-five mortality.
Various studies also confirmed that these factors are crucial for under-five mortality . From this study, we can confirm that the wealth index was one of the important factors for under-five mortality, which is in line with other studies [40][41][42]. This study found time was a significant factor in under-five mortality [43,44]. This study's findings observed that the mother's education level was one of the major under-five mortality factors. Previous studies also confirmed that the mother's education played an important role in reducing the risk of under-five mortality [45][46][47][48]. This may be because educated mothers might have better knowledge about the health services, care, and immunization of their children.This study found that the number of births in the last five years and birth order is an important factor related to under-five mortality. This approach can predict and simulate the mortality rates in the human population and make accurate predictions of mortality risk for most preterm infants [59,60]. Previous research also confirms that machine learning model methods are better than traditional analysis methods [61,62]. A previous study predicted that machine learning models are more suitable for finding the factors of infant mortality and confirming better goodness of fit in most critical groups [63]. Moreover, machine learning models are very valuable in predicting health studies that lead to healthier and more suitable policy decisions.

Study limitation
This study cannot be complete without its limitation because we have used machine learning models, unlike statistical models. The machine learning model's result comes without any coefficient and odds ratio compared to the statistical model and is difficult to understand how much and in which direction, factors affect the outcome. Another limitation is that we need to decide our research hypothesis in the study, but machine learning models cannot frame research hypotheses for prediction and classification both. The results of the study are based on NFHS-IV questionnaires' data. It is not a specific study, nor has precise objectives related to under-five mortality. There were various missing variables in the dataset and those variables were not included in the study.

Conclusion
The objective of this study was to apply the various Machine Learning models to underfive mortality data.
This study explains the ML accuracy and predicts the important factors related to under-five mortality. The neural network model performed best in predicting under-five mortality with the highest accuracy compared to this study's other machine learning models. The study also indicates that logistic regression analysis can be useful in predicting the mortality of under-five morality with some limitations. However, this study also highlighted that some of the variables have an equally significant impact on under-five mortality in both LR and ML models. The number of children, survival time, child size at birth, birth in the last five years, the total number of children ever born, and birth order were found to be the most important factors for under-five mortality. The machine learning models provide some important factors that may add to analysis capabilities compared to other traditional statistical models. These models may be helpful for the analysis of high-dimensional data for health research.