Sample Design and Sample
This study used the latest national family health survey (NFHS)-V dataset. It is a nationally representative dataset of India that provides information on 707 districts, 28 states, and 8 union territories. This survey provides information on sexual behavior; husband’s background and women’s work; HIV/AIDS knowledge, attitudes, and behavior; and domestic violence, health, and family welfare as well as health indicators such as fertility status, infant and child mortality, and maternal and child health. The total sample size of approximately 610,000 households in India was based on the size needed to produce reliable indicator estimates for each district. The rural sample was selected through a two-stage sample design with villages as the primary sampling units (PSUs) at the first stage (selected with probability proportional to size), followed by a random selection of 22 households in each PSU in the second stage. In urban areas, there was also a two-stage sample design with Census Enumeration Blocks (CEBs) selected at the first stage and a random selection of 22 households in each CEB at the second stage. After conducting a systematic mapping and household listing operation in the chosen first-stage units, households in both urban and rural areas were chosen for the second stage. A total of 636,699 households were interviewed, of which 7,24,115 were eligible women aged 15–49 years and 1,01,839 were eligible men aged 15–54 years [17].
Data Source
This study used the birth recode dataset file from the NFHS-V survey. This dataset contains records of all children born to surveyed women. Essentially, it is the complete birth history of all interviewed women, including pregnancy and postpartum information, immunizations, and health status of children born in the last five years. It also includes data on the mothers of each of these children. This file can be used to calculate health indicators, birth rates, and death rates. This study uses information on the child’s date of birth and the age at death of the child to calculate under-five mortalities in India. This dataset has various sets of predictors or features, and the target variable in this dataset is child survival status before completing five years of age.
Analytic Strategies for Under-five Mortality Data
We have used various analytic frameworks to complete the objectives of the study. First, we used the data preprocessing method to finalize the dataset for applying the statistical and machine learning approaches. Second, the baseline structure of the data was explained by descriptive statistics, and a multivariate logistic regression model was used to identify the important factors/features of under-five mortality. Third, we used various supervised machine learning models in the dataset, including all factors, to predict the best machine learning model. Then, we used feature selection techniques to predict the most important variables of under-five mortality. After screening the important features/factors based on logistic regression and the ML feature selection technique, we again applied the various ML models with only the important variables (features) to find the best predictive ML model for under-five mortality in the dataset.
Preprocessing of the data
In this study, we used a modified analytical framework proposed by Mosley and Chen to analyze child survival in developing nations [18] and, as shown in previous literature [19], to find important independent features (variables). A total of 55 independent variables were selected, and those independent variables were categorized into socioeconomic, biodemographic, health-related, and environmental factors. Data preprocessing is very important for better model building. In data preprocessing, we used data filtering, data cleaning, and a quality check of the data, handling the missing values and removing the outliers and duplicate variables from the data. After the data preprocessing approach, we found a total of 34 independent variables out of 55 independent variables for U5MR in India. A total of 11 variables were biodemographic factors, 8 variables were health-related factors, 4 variables were environmental factors and 12 variables were socioeconomic factors [Fig. 2]. The categorical values were transformed into numeric numbers; also, we have changed in one-hot encoding to all the predictor (features) variables.
The target variable is necessary to balance the class variable for better predictive models. To balance the class variable, we used the synthetic minority oversampling technique (SMOTE) because this is a hybrid method that combines both oversampling and under sampling methods to overcome their weaknesses. This is a highly approachable method for imbalanced data [20]. The value of dead children is very low compared to live children in the dataset. After using the smote method, the class variable was balanced to a ratio of 50:50 from the initial ratio of 96:4.
Predictor variables
After removing duplicates and handling missing values, the predictor (independent) variables taken into consideration in this research included the age of mothers, current marital status of the mother, age at 1st birth, age at married, household sex, child sex, birth order, number of living children, child size at birth, birth in the last five years, desire for more children, vaccinations, contraceptive use, delivery place, delivery mode, antenatal care, current breastfeeding, anemia level, postnatal care, source of drinking water, toilet facilities, type of cooking fuel, electricity availability, household education, household working status, mother’s education, mother’s working status, father’s education, place of residence, region of residence, household size, wealth index, religion, time in days and caste.
Machine learning approach framework of U5MR prediction
In this study, we first used a modified analytical approach developed by Mosley and Chen to analyze child survival in developing nations to find important features. We chose only those important features from the dataset and then applied data preprocessing methods to make the final dataset for further computation. The explanations of algorithm computation and machine learning frameworks are depicted in Fig. 3. The statistical package for social sciences (SPSS)-27.0 software was used for data preprocessing, descriptive statistics, and multivariate logistics regression analysis. The all-machine learning algorithm was computed with the help of Jupiter notebook inbuilt Python 3.8 software.
Model Building
This step divides the whole data (100%) into 70% as training and 30% as test data. Seventy percent of the trained data were used for tenfold cross-validation to select the model with the best performance, and the remaining 30% were used to predict the model performance. In the model building, we used various machine learning models in the 70% dataset, and all the model details are explained below.
Decision Tree (DT)
The decision tree algorithm is widely used in the healthcare domain for label classification. The decision tree method is a valid and mostly used classifier to determine conditional probabilities [21–22]. Binary trees are created from decision trees and help in classification problems, and each terminal node of the tree shows a particular class label [23]. To understand the underlying pattern of data instances, a classifier can be used to generate decision rules.
Random Forest (RF)
We used random forest (RF) to classify the class labels and identify the best-performing classifier on the dataset. RF creates a new bootstrapped dataset that is identical in size and creates the subset from the data. Decision trees are generated from the bootstrap data, and each decision tree takes individual inputs. From the vote of all decision trees, a new input is assigned to the class label. The class label, which has been voted the maximum through decision trees, attains a new input. This classifier’s process selects random samples. Some samples are not considered by chance [24].
Equation (1) is used for the calculation of bagging. The bag represents the repetition of bagging.
Here, \({f}_{b}\) is the classification tree that trains the training dataset.
Naive Bayes
The Bayesian network (NB) classifier identifies determinants and checks the posterior probability using the prior probability and likelihood. It is widely used in data analytics and generates the best classifier results [25–26].
Neural Network
We used the MLP for model building. This is a feed-forward ANN, and it has multilayer nodes. In the ANN, individual layers are fully connected with the next layer. The input nodes represent the number of attributes. The activation function is utilized for the linear combination of input with weight “W” and bias “B”. The following formula of the activation function in MLP is used
$$f\left(zi\right)=\frac{1}{11+e-zi}$$
3
K-nearest neighbor (KNN)
K Nearest Neighbors (KNN) is a method for classification and regression. By measuring the distance between any training point and the test data, KNN attempts to determine the appropriate class for the test data. Then, choose the number K of points that most closely resemble the test data. The KNN algorithm analyses the probability that test data will belong to each of the "K" classes of training data, and the class with the highest probability is chosen. For regression, the value is the average of the 'K' selected training points [27].
Logistic Regression (LR)
Logistic regression is a strong statistical and supervised ML model used for binary outcomes. This model says that the relationship between binary outcome (dependent) variables is single or multiple independent variables [28].
Model Evaluation
The selected set of machine learning classifiers is used to train the model after the classification. In this step, each model is evaluated using test data. This will ensure which model performs best in predicting under-five mortality based on measures such as sensitivity, specificity, accuracy, precision, F1-score, and area under the receiver operating characteristic curve (AUROC). All the measure calculations were calculated with the help of a confusion matrix.
The confusion matrix is used to measure the accuracy of classifiers to see the dead and alive child classes in this study [29]. True positive (TP) is the positive class classified correctly as positive. True negative (TN) is the negative class predicted correctly as negative. False positive (FP) is the negative class incorrectly predicted as positive. A false negative (FN) is a positive class incorrectly predicted as negative. There are other various metrics such as sensitivity (recall), specificity, accuracy, and precision that were used in classifying the model.
Accuracy – Accuracy is the percentage correctly defined among the total number of cases tested.
$$\mathbf{A}\mathbf{c}\mathbf{c}\mathbf{u}\mathbf{r}\mathbf{a}\mathbf{c}\mathbf{y} =\frac{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{P}+\mathbf{F}\mathbf{N}}$$
Sensitivity - Sensitivity measures the chance of predicting the positive class correctly and is calculated by this given formula.
$$\mathbf{S}\mathbf{e}\mathbf{n}\mathbf{s}\mathbf{i}\mathbf{t}\mathbf{i}\mathbf{v}\mathbf{i}\mathbf{t}\mathbf{y}/\mathbf{R}\mathbf{e}\mathbf{c}\mathbf{a}\mathbf{l}\mathbf{l} =\frac{\mathbf{T}\mathbf{P}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{N}}$$
Specificity – Specificity is the measure of chance to predict negative outcomes from among the total negative outcomes. The given formula can calculate it:
$$\mathbf{S}\mathbf{p}\mathbf{e}\mathbf{c}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{i}\mathbf{t}\mathbf{y}=\frac{\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{P}}$$
Precision – Precision is a measure of positive elements that were correct. This is also defined as positive predictive value.
$$\mathbf{P}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{i}\mathbf{s}\mathbf{i}\mathbf{o}\mathbf{n}/\mathbf{P}\mathbf{P}\mathbf{V} =\frac{\mathbf{T}\mathbf{P}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{P}}$$
Negative predictive value – The negative predictive value is defined as the number of true negatives divided by the total number of people who tested negative.
$$\mathbf{N}\mathbf{e}\mathbf{g}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{v}\mathbf{e} \mathbf{p}\mathbf{r}\mathbf{e}\mathbf{d}\mathbf{i}\mathbf{c}\mathbf{t}\mathbf{i}\mathbf{v}\mathbf{e} \mathbf{v}\mathbf{a}\mathbf{l}\mathbf{u}\mathbf{e} =\frac{\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{N}}$$
F1 score - The inverse relationship between accuracy and recall is the F1 score or the F test. A higher value of the F1 score predicts a better model. The harmonic mean of recall and accuracy is determined as
$$\mathbf{F}1 \mathbf{s}\mathbf{c}\mathbf{o}\mathbf{r}\mathbf{e} =\frac{2\mathbf{T}\mathbf{P}}{2\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{N}+\mathbf{F}\mathbf{P}}$$
Cohen Kappa
Cohen's kappa statistic is a very good measure that handles both multiclass and class imbalance problems well [30]. Cohen’s kappa statistic is used to compare the agreement between the predicted labels from a model and the actual labels in the data. It is calculated using the formula
$$K=\frac{{p}_{0}-{p}_{e}}{1-{p}_{e}}$$
where po is the observed agreement and pe is the expected agreement. It basically tells you how good your classifier is compared to a classifier that randomly guesses according to the frequency of each class.
Receiver operating characteristic (ROC) curve
AUC-ROC is an estimated metric used to evaluate the performance of classification models. AUC-ROC metrics are clearly useful in determining and informing a model's ability to distinguish between classes. The evaluation criteria are as follows. - The higher the AUC, the better the model. AUC-ROC curves are often used to graphically represent the relationship and trade-off between sensitivity and specificity.
K-fold Cross-Validation
Cross-validation is a verification technique that evaluates a model’s ability to generalize to an independent dataset. In k-fold cross-validation, the training dataset is arbitrarily divided into k mutually exclusive subsamples of equal size. The model is trained (or folded) k times, with each iteration using one of the k subsamples for testing (cross-validation) and applying the remaining k-1 subsamples to train the model.
Important features (factors) selection method
This method involves decreasing the reduction of the data by choosing a subset of k features/variables from the original pool of features/variables. Feature selection focuses on selecting the best subset of input variables that describe the target variable. There are various ways to use feature selection, such as filters, wrappers, and built-in methods, in the machine learning approach. In this study, we used the feature selection techniques chi-square scores, recursive feature elimination, extra tree classifier, random forest importance, and sequential feature selector for predicting the most important variables of under-five mortality.
Ethical considerations
This study was based on freely available online data, so ethical approval was not required for this study.